clawbench

History

scoootscooob b6f07d9a87 analysis: dynamical-systems diagnostics for agent runs Treats agent runs as stochastic trajectories in semantic state space and extracts signal that flat run_score averages away. Inspired by the "When LLMs Are Dreaming, Where Do They Go?" framework: task constraint characterization, per-run regime classification, seed-vs- capability variance decomposition, per-turn survival, SNR-weighted ranking. Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external model dependencies) as the semantic state proxy since sentence embeddings would require torch. Crude but sufficient for the signals the paper calls out. scripts/compute_constraint_index.py: computes C(q) per task from archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is participation ratio of response covariance, entropy is eigenvalue entropy, and BOPS is inter-run cosine (predictability proxy). High C(q) = tasks where models converge to similar answers; low C(q) = open-ended tasks where models diverge for style reasons. scripts/classify_regimes.py: per-run regime classifier. Computes drift_mean, from_start, recurrence, vol_log over turn trajectories. Quartile-based thresholds label each run as too_short / trapped / limit_cycle / diffusive / mixed. Reveals per-model tendencies: Gemini traps frequently (one-shot answer without iteration), GPT loops tool patterns, GLM is most balanced. scripts/variance_decomp.py: decomposes run_score variance per task into seed variance (3 runs of same model) vs capability variance (across model means). SNR = cap_var / seed_var. Exposes that 47% of benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and give essentially random rankings. scripts/survival_analysis.py: per-turn empirical survival S(t) and hazard h(t). T_F = first turn where assistant emits empty response or run ends in failure. Reveals long-horizon capability that flat scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at 60% rate. scripts/snr_weighted_ranking.py: SNR × \|C(q)\|-weighted ranking (with winsorization at p95 to prevent single-task dominance). Headline metric that weights discriminating + signal-rich tasks more than noisy or consensus tasks. Also emits SNR-only and flat variants for comparison. scripts/generate_dynamical_report.py: assembles all four diagnostic JSONs into a single markdown report with per-model regime tables, SNR tiers, survival curves, and integrated interpretation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>		2026-04-20 19:49:05 -07:00
..
_archive_cache.sh	clawbench: per-sweep cache archiving + generic sweep templates	2026-04-18 12:46:45 -07:00
analyze_open_vs_closed.py	ClawBench: 7-model frontier baseline + bake-off tooling	2026-04-10 19:14:11 -07:00
audit_per_run.py	analysis: fair-comparison audit and rejudge pipeline	2026-04-20 19:48:43 -07:00
audit_runs.py	analysis: fair-comparison audit and rejudge pipeline	2026-04-20 19:48:43 -07:00
classify_regimes.py	analysis: dynamical-systems diagnostics for agent runs	2026-04-20 19:49:05 -07:00
compute_constraint_index.py	analysis: dynamical-systems diagnostics for agent runs	2026-04-20 19:49:05 -07:00
container_sweep_minimal.sh	clawbench: per-sweep cache archiving + generic sweep templates	2026-04-18 12:46:45 -07:00
container_sweep_single.sh	sweep: per-container state isolation + qwen model-id fix	2026-04-20 19:48:30 -07:00
generate_dynamical_report.py	analysis: dynamical-systems diagnostics for agent runs	2026-04-20 19:49:05 -07:00
generate_fair_report.py	analysis: fair-comparison audit and rejudge pipeline	2026-04-20 19:48:43 -07:00
git_checkpoint.py	clawbench: per-sweep cache archiving + generic sweep templates	2026-04-18 12:46:45 -07:00
ingest_real_run.py	ClawBench: 7-model frontier baseline + bake-off tooling	2026-04-10 19:14:11 -07:00
inject_judge_rubrics.py	ClawBench: 7-model frontier baseline + bake-off tooling	2026-04-10 19:14:11 -07:00
refactor_verifiers.py	ClawBench: 7-model frontier baseline + bake-off tooling	2026-04-10 19:14:11 -07:00
rejudge_all.py	analysis: fair-comparison audit and rejudge pipeline	2026-04-20 19:48:43 -07:00
run_open_vs_closed_bakeoff.py	ClawBench: 7-model frontier baseline + bake-off tooling	2026-04-10 19:14:11 -07:00
scale_timeouts.py	ClawBench: 7-model frontier baseline + bake-off tooling	2026-04-10 19:14:11 -07:00
seed_historical_db.py	ClawBench: 7-model frontier baseline + bake-off tooling	2026-04-10 19:14:11 -07:00
snr_weighted_ranking.py	analysis: dynamical-systems diagnostics for agent runs	2026-04-20 19:49:05 -07:00
survival_analysis.py	analysis: dynamical-systems diagnostics for agent runs	2026-04-20 19:49:05 -07:00
variance_decomp.py	analysis: dynamical-systems diagnostics for agent runs	2026-04-20 19:49:05 -07:00