Treats agent runs as stochastic trajectories in semantic state space
and extracts signal that flat run_score averages away. Inspired by
the "When LLMs Are Dreaming, Where Do They Go?" framework: task
constraint characterization, per-run regime classification, seed-vs-
capability variance decomposition, per-turn survival, SNR-weighted
ranking.
Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external
model dependencies) as the semantic state proxy since sentence
embeddings would require torch. Crude but sufficient for the signals
the paper calls out.
scripts/compute_constraint_index.py: computes C(q) per task from
archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is
participation ratio of response covariance, entropy is eigenvalue
entropy, and BOPS is inter-run cosine (predictability proxy). High
C(q) = tasks where models converge to similar answers; low C(q) =
open-ended tasks where models diverge for style reasons.
scripts/classify_regimes.py: per-run regime classifier. Computes
drift_mean, from_start, recurrence, vol_log over turn trajectories.
Quartile-based thresholds label each run as too_short / trapped /
limit_cycle / diffusive / mixed. Reveals per-model tendencies:
Gemini traps frequently (one-shot answer without iteration), GPT
loops tool patterns, GLM is most balanced.
scripts/variance_decomp.py: decomposes run_score variance per task
into seed variance (3 runs of same model) vs capability variance
(across model means). SNR = cap_var / seed_var. Exposes that 47% of
benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and
give essentially random rankings.
scripts/survival_analysis.py: per-turn empirical survival S(t) and
hazard h(t). T_F = first turn where assistant emits empty response
or run ends in failure. Reveals long-horizon capability that flat
scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at
60% rate.
scripts/snr_weighted_ranking.py: SNR × |C(q)|-weighted ranking (with
winsorization at p95 to prevent single-task dominance). Headline
metric that weights discriminating + signal-rich tasks more than
noisy or consensus tasks. Also emits SNR-only and flat variants for
comparison.
scripts/generate_dynamical_report.py: assembles all four diagnostic
JSONs into a single markdown report with per-model regime tables,
SNR tiers, survival curves, and integrated interpretation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tools for auditing archive coverage, rejudging judge-infra failures
via direct Anthropic API (bypasses the gateway path that sometimes
returns "Gateway is restarting" / empty judge results), and producing
fair multi-model comparison reports.
scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs
and archive JSONs side-by-side. Reports coverage %, clean mean,
coverage-normalized score, infra-zero count, judge-infra remaining
vs rejudged.
scripts/audit_per_run.py: per-run cross-model audit. Flags tasks
where all models score zero (broken task/verifier), verifier
rejects-valid-outputs (C=0 but agent produced text), harness-error
clusters, model-specific pathologies.
scripts/rejudge_all.py: re-runs judge scoring on archive runs where
the gateway judge failed. Uses direct anthropic SDK against
claude-sonnet-4-6, rewrites judge_result fields in place, recomputes
run_score per the C+T+B+J weighting.
scripts/generate_fair_report.py: produces an 8/9-model comparison
markdown report. Supports --exclude to drop specific models, headlines
"clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0
task pass counts, and coverage parity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/container_sweep_single.sh: clone pristine OpenClaw state to
/tmp/ per sweep before starting the gateway. Carries over config
(openclaw.json, identity/, devices/, exec-approvals.json, tasks/,
subagents/, flows/, cron/) but leaves runtime dirs (agents/,
workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR
to the isolated dir so the gateway writes to /tmp instead of the
shared host mount. Fixes the cascading "RPC agents.create timed out
after 60s" failures caused by 4k+ stale agents accumulating across
sequential sweeps.
profiles/frontier_qwen_3_6.yaml: fix base_model from
openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus
(no dash). The dashed slug is unknown to OpenRouter and silently fails;
the no-dash version is the real canonical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- scripts/_archive_cache.sh: snapshot run_cache/<model>/ to
run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json.
Sourced by sweep scripts so transcripts survive the next sweep's
cache wipe and stay available for audits.
- scripts/container_sweep_single.sh: base multi-model sweep.
Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so
their caches are force-cleared at sweep start. Calls archive helper
on exit.
- scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast
fix validation (~20 min) instead of full 3-run sweep (~60 min).
- Dockerfile.main: parametrized clawbench-on-openclaw image with
ARG BASE for pinning to any openclaw tag.
- scripts/git_checkpoint.py + README: documented checkpoint workflow
for tagging known-good states during risky work.
- .gitignore: un-ignore scripts/, keep targeted ignores for
__pycache__, .tmp, .local.py.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Re-ran Sonnet 4.6 judge on all 60 GPT 5.4 runs that had auth errors
during the original sweep. Called the Anthropic API directly using
cached transcripts. Results:
- judge_task_coverage: 0.6 -> 1.0 (all 40 tasks fully judged)
- judge_error_count: 60 -> 0
- overall_judge_score: 0.438 -> 0.239 (was inflated by excluding errors)
- overall_score: 0.456 -> 0.457 (unchanged, judge gated on C >= 0.9999)
No judge caveat remains. All 6 models now have complete, unbiased
judge coverage across all 720 runs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Profiles (profiles/):
- frontier_opus_4_6.yaml (Anthropic Claude Opus 4.6 — closed)
- frontier_gpt_5_4.yaml (OpenAI GPT-5.4 — closed)
- frontier_gemini_3_pro.yaml (Google Gemini 3.1 Pro — closed)
- frontier_glm_5_1.yaml (Zhipu AI GLM-5.1 via OpenRouter — open)
- frontier_qwen_3_6.yaml (Alibaba Qwen3.6-Plus via OpenRouter — open)
- frontier_minimax_m27.yaml (MiniMax M2.7 via OpenRouter — open)
- frontier_kimi_k25.yaml (Moonshot Kimi K2.5 via OpenRouter — open)
- example_research_stack.yaml (example for docs)
All seven profiles share an identical plugin stack (anthropic +
memory-lancedb + browser-playwright) so base_model is the only
structural variable across the bake-off.
Scripts (scripts/):
- run_open_vs_closed_bakeoff.py: driver that runs each profile
through the harness and generates a comparison table. Wraps
`clawbench run --profile` via an inline Click entry (the package
has no __main__.py so `python -m clawbench.cli` is a no-op).
- analyze_open_vs_closed.py: historical DB analyzer — per-bucket
mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies
OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/
Moonshot land in the open bucket.
- ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py,
scale_timeouts.py, seed_historical_db.py: task-corpus tooling.
Reports (reports/):
- FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run
(3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6
scored 63.9% with real token streaming (174K tok, $0.18 cost).
The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too
easy to separate frontier models at n=1. Documents
infrastructure findings around gateway plugin allowlist behavior,
token streaming gaps for non-Anthropic providers, and hot-reload
cascade when config changes mid-run.
- open_vs_closed_bakeoff_summary.md: auto-generated headline table
- FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run
- REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run
- PARALLEL_HARNESS_REPORT.md: concurrency validation writeup
- V05_DELIVERY_REPORT.md: v0.5 framework delivery notes
- CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning
Artifacts (reports/artifacts/):
- frontier_*.json: the 7 BenchmarkResult files from the bake-off
(committed snapshot for reproducibility; runtime results still
go to results/ which remains gitignored)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>