clawbench/scripts
scoootscooob b6f07d9a87 analysis: dynamical-systems diagnostics for agent runs
Treats agent runs as stochastic trajectories in semantic state space
and extracts signal that flat run_score averages away. Inspired by
the "When LLMs Are Dreaming, Where Do They Go?" framework: task
constraint characterization, per-run regime classification, seed-vs-
capability variance decomposition, per-turn survival, SNR-weighted
ranking.

Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external
model dependencies) as the semantic state proxy since sentence
embeddings would require torch. Crude but sufficient for the signals
the paper calls out.

scripts/compute_constraint_index.py: computes C(q) per task from
archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is
participation ratio of response covariance, entropy is eigenvalue
entropy, and BOPS is inter-run cosine (predictability proxy). High
C(q) = tasks where models converge to similar answers; low C(q) =
open-ended tasks where models diverge for style reasons.

scripts/classify_regimes.py: per-run regime classifier. Computes
drift_mean, from_start, recurrence, vol_log over turn trajectories.
Quartile-based thresholds label each run as too_short / trapped /
limit_cycle / diffusive / mixed. Reveals per-model tendencies:
Gemini traps frequently (one-shot answer without iteration), GPT
loops tool patterns, GLM is most balanced.

scripts/variance_decomp.py: decomposes run_score variance per task
into seed variance (3 runs of same model) vs capability variance
(across model means). SNR = cap_var / seed_var. Exposes that 47% of
benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and
give essentially random rankings.

scripts/survival_analysis.py: per-turn empirical survival S(t) and
hazard h(t). T_F = first turn where assistant emits empty response
or run ends in failure. Reveals long-horizon capability that flat
scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at
60% rate.

scripts/snr_weighted_ranking.py: SNR × |C(q)|-weighted ranking (with
winsorization at p95 to prevent single-task dominance). Headline
metric that weights discriminating + signal-rich tasks more than
noisy or consensus tasks. Also emits SNR-only and flat variants for
comparison.

scripts/generate_dynamical_report.py: assembles all four diagnostic
JSONs into a single markdown report with per-model regime tables,
SNR tiers, survival curves, and integrated interpretation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:49:05 -07:00
..
_archive_cache.sh clawbench: per-sweep cache archiving + generic sweep templates 2026-04-18 12:46:45 -07:00
analyze_open_vs_closed.py ClawBench: 7-model frontier baseline + bake-off tooling 2026-04-10 19:14:11 -07:00
audit_per_run.py analysis: fair-comparison audit and rejudge pipeline 2026-04-20 19:48:43 -07:00
audit_runs.py analysis: fair-comparison audit and rejudge pipeline 2026-04-20 19:48:43 -07:00
classify_regimes.py analysis: dynamical-systems diagnostics for agent runs 2026-04-20 19:49:05 -07:00
compute_constraint_index.py analysis: dynamical-systems diagnostics for agent runs 2026-04-20 19:49:05 -07:00
container_sweep_minimal.sh clawbench: per-sweep cache archiving + generic sweep templates 2026-04-18 12:46:45 -07:00
container_sweep_single.sh sweep: per-container state isolation + qwen model-id fix 2026-04-20 19:48:30 -07:00
generate_dynamical_report.py analysis: dynamical-systems diagnostics for agent runs 2026-04-20 19:49:05 -07:00
generate_fair_report.py analysis: fair-comparison audit and rejudge pipeline 2026-04-20 19:48:43 -07:00
git_checkpoint.py clawbench: per-sweep cache archiving + generic sweep templates 2026-04-18 12:46:45 -07:00
ingest_real_run.py ClawBench: 7-model frontier baseline + bake-off tooling 2026-04-10 19:14:11 -07:00
inject_judge_rubrics.py ClawBench: 7-model frontier baseline + bake-off tooling 2026-04-10 19:14:11 -07:00
refactor_verifiers.py ClawBench: 7-model frontier baseline + bake-off tooling 2026-04-10 19:14:11 -07:00
rejudge_all.py analysis: fair-comparison audit and rejudge pipeline 2026-04-20 19:48:43 -07:00
run_open_vs_closed_bakeoff.py ClawBench: 7-model frontier baseline + bake-off tooling 2026-04-10 19:14:11 -07:00
scale_timeouts.py ClawBench: 7-model frontier baseline + bake-off tooling 2026-04-10 19:14:11 -07:00
seed_historical_db.py ClawBench: 7-model frontier baseline + bake-off tooling 2026-04-10 19:14:11 -07:00
snr_weighted_ranking.py analysis: dynamical-systems diagnostics for agent runs 2026-04-20 19:49:05 -07:00
survival_analysis.py analysis: dynamical-systems diagnostics for agent runs 2026-04-20 19:49:05 -07:00
variance_decomp.py analysis: dynamical-systems diagnostics for agent runs 2026-04-20 19:49:05 -07:00