clawbench

Author	SHA1	Message	Date
pllm-uci	c209612d46	Add archive dynamics pipeline and audience-based model presets	2026-04-22 12:03:13 -07:00
scoootscooob	b6f07d9a87	analysis: dynamical-systems diagnostics for agent runs Treats agent runs as stochastic trajectories in semantic state space and extracts signal that flat run_score averages away. Inspired by the "When LLMs Are Dreaming, Where Do They Go?" framework: task constraint characterization, per-run regime classification, seed-vs- capability variance decomposition, per-turn survival, SNR-weighted ranking. Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external model dependencies) as the semantic state proxy since sentence embeddings would require torch. Crude but sufficient for the signals the paper calls out. scripts/compute_constraint_index.py: computes C(q) per task from archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is participation ratio of response covariance, entropy is eigenvalue entropy, and BOPS is inter-run cosine (predictability proxy). High C(q) = tasks where models converge to similar answers; low C(q) = open-ended tasks where models diverge for style reasons. scripts/classify_regimes.py: per-run regime classifier. Computes drift_mean, from_start, recurrence, vol_log over turn trajectories. Quartile-based thresholds label each run as too_short / trapped / limit_cycle / diffusive / mixed. Reveals per-model tendencies: Gemini traps frequently (one-shot answer without iteration), GPT loops tool patterns, GLM is most balanced. scripts/variance_decomp.py: decomposes run_score variance per task into seed variance (3 runs of same model) vs capability variance (across model means). SNR = cap_var / seed_var. Exposes that 47% of benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and give essentially random rankings. scripts/survival_analysis.py: per-turn empirical survival S(t) and hazard h(t). T_F = first turn where assistant emits empty response or run ends in failure. Reveals long-horizon capability that flat scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at 60% rate. scripts/snr_weighted_ranking.py: SNR × \|C(q)\|-weighted ranking (with winsorization at p95 to prevent single-task dominance). Headline metric that weights discriminating + signal-rich tasks more than noisy or consensus tasks. Also emits SNR-only and flat variants for comparison. scripts/generate_dynamical_report.py: assembles all four diagnostic JSONs into a single markdown report with per-model regime tables, SNR tiers, survival curves, and integrated interpretation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:49:05 -07:00
scoootscooob	afb14c3982	analysis: fair-comparison audit and rejudge pipeline Tools for auditing archive coverage, rejudging judge-infra failures via direct Anthropic API (bypasses the gateway path that sometimes returns "Gateway is restarting" / empty judge results), and producing fair multi-model comparison reports. scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs and archive JSONs side-by-side. Reports coverage %, clean mean, coverage-normalized score, infra-zero count, judge-infra remaining vs rejudged. scripts/audit_per_run.py: per-run cross-model audit. Flags tasks where all models score zero (broken task/verifier), verifier rejects-valid-outputs (C=0 but agent produced text), harness-error clusters, model-specific pathologies. scripts/rejudge_all.py: re-runs judge scoring on archive runs where the gateway judge failed. Uses direct anthropic SDK against claude-sonnet-4-6, rewrites judge_result fields in place, recomputes run_score per the C+T+B+J weighting. scripts/generate_fair_report.py: produces an 8/9-model comparison markdown report. Supports --exclude to drop specific models, headlines "clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0 task pass counts, and coverage parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:48:43 -07:00
scoootscooob	01a31e55fb	sweep: per-container state isolation + qwen model-id fix scripts/container_sweep_single.sh: clone pristine OpenClaw state to /tmp/ per sweep before starting the gateway. Carries over config (openclaw.json, identity/, devices/, exec-approvals.json, tasks/, subagents/, flows/, cron/) but leaves runtime dirs (agents/, workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR to the isolated dir so the gateway writes to /tmp instead of the shared host mount. Fixes the cascading "RPC agents.create timed out after 60s" failures caused by 4k+ stale agents accumulating across sequential sweeps. profiles/frontier_qwen_3_6.yaml: fix base_model from openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus (no dash). The dashed slug is unknown to OpenRouter and silently fails; the no-dash version is the real canonical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:48:30 -07:00
scoootscooob	8a5be9c686	clawbench: per-sweep cache archiving + generic sweep templates - scripts/_archive_cache.sh: snapshot run_cache/<model>/ to run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json. Sourced by sweep scripts so transcripts survive the next sweep's cache wipe and stay available for audits. - scripts/container_sweep_single.sh: base multi-model sweep. Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so their caches are force-cleared at sweep start. Calls archive helper on exit. - scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast fix validation (~20 min) instead of full 3-run sweep (~60 min). - Dockerfile.main: parametrized clawbench-on-openclaw image with ARG BASE for pinning to any openclaw tag. - scripts/git_checkpoint.py + README: documented checkpoint workflow for tagging known-good states during risky work. - .gitignore: un-ignore scripts/, keep targeted ignores for __pycache__, .tmp, .local.py. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 12:46:45 -07:00
scoootscooob	6ab3004d63	Remove reports and scripts from repo, add to gitignore Reports and eval scripts contain internal benchmark data that should not be public. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 00:51:50 -07:00
scoootscooob	0d07aa4d08	Re-judge GPT 5.4: resolve judge auth caveat, full coverage Re-ran Sonnet 4.6 judge on all 60 GPT 5.4 runs that had auth errors during the original sweep. Called the Anthropic API directly using cached transcripts. Results: - judge_task_coverage: 0.6 -> 1.0 (all 40 tasks fully judged) - judge_error_count: 60 -> 0 - overall_judge_score: 0.438 -> 0.239 (was inflated by excluding errors) - overall_score: 0.456 -> 0.457 (unchanged, judge gated on C >= 0.9999) No judge caveat remains. All 6 models now have complete, unbiased judge coverage across all 720 runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 00:36:27 -07:00
Codex	4744a6ae7e	ClawBench: 7-model frontier baseline + bake-off tooling Profiles (profiles/): - frontier_opus_4_6.yaml (Anthropic Claude Opus 4.6 — closed) - frontier_gpt_5_4.yaml (OpenAI GPT-5.4 — closed) - frontier_gemini_3_pro.yaml (Google Gemini 3.1 Pro — closed) - frontier_glm_5_1.yaml (Zhipu AI GLM-5.1 via OpenRouter — open) - frontier_qwen_3_6.yaml (Alibaba Qwen3.6-Plus via OpenRouter — open) - frontier_minimax_m27.yaml (MiniMax M2.7 via OpenRouter — open) - frontier_kimi_k25.yaml (Moonshot Kimi K2.5 via OpenRouter — open) - example_research_stack.yaml (example for docs) All seven profiles share an identical plugin stack (anthropic + memory-lancedb + browser-playwright) so base_model is the only structural variable across the bake-off. Scripts (scripts/): - run_open_vs_closed_bakeoff.py: driver that runs each profile through the harness and generates a comparison table. Wraps `clawbench run --profile` via an inline Click entry (the package has no __main__.py so `python -m clawbench.cli` is a no-op). - analyze_open_vs_closed.py: historical DB analyzer — per-bucket mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/ Moonshot land in the open bucket. - ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py, scale_timeouts.py, seed_historical_db.py: task-corpus tooling. Reports (reports/): - FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run (3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6 scored 63.9% with real token streaming (174K tok, $0.18 cost). The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too easy to separate frontier models at n=1. Documents infrastructure findings around gateway plugin allowlist behavior, token streaming gaps for non-Anthropic providers, and hot-reload cascade when config changes mid-run. - open_vs_closed_bakeoff_summary.md: auto-generated headline table - FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run - REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run - PARALLEL_HARNESS_REPORT.md: concurrency validation writeup - V05_DELIVERY_REPORT.md: v0.5 framework delivery notes - CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning Artifacts (reports/artifacts/): - frontier_*.json: the 7 BenchmarkResult files from the bake-off (committed snapshot for reproducibility; runtime results still go to results/ which remains gitignored) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:14:11 -07:00

8 Commits