Commit Graph

8 Commits

Author SHA1 Message Date
pllm-uci
c209612d46 Add archive dynamics pipeline and audience-based model presets 2026-04-22 12:03:13 -07:00
scoootscooob
b6f07d9a87 analysis: dynamical-systems diagnostics for agent runs
Treats agent runs as stochastic trajectories in semantic state space
and extracts signal that flat run_score averages away. Inspired by
the "When LLMs Are Dreaming, Where Do They Go?" framework: task
constraint characterization, per-run regime classification, seed-vs-
capability variance decomposition, per-turn survival, SNR-weighted
ranking.

Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external
model dependencies) as the semantic state proxy since sentence
embeddings would require torch. Crude but sufficient for the signals
the paper calls out.

scripts/compute_constraint_index.py: computes C(q) per task from
archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is
participation ratio of response covariance, entropy is eigenvalue
entropy, and BOPS is inter-run cosine (predictability proxy). High
C(q) = tasks where models converge to similar answers; low C(q) =
open-ended tasks where models diverge for style reasons.

scripts/classify_regimes.py: per-run regime classifier. Computes
drift_mean, from_start, recurrence, vol_log over turn trajectories.
Quartile-based thresholds label each run as too_short / trapped /
limit_cycle / diffusive / mixed. Reveals per-model tendencies:
Gemini traps frequently (one-shot answer without iteration), GPT
loops tool patterns, GLM is most balanced.

scripts/variance_decomp.py: decomposes run_score variance per task
into seed variance (3 runs of same model) vs capability variance
(across model means). SNR = cap_var / seed_var. Exposes that 47% of
benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and
give essentially random rankings.

scripts/survival_analysis.py: per-turn empirical survival S(t) and
hazard h(t). T_F = first turn where assistant emits empty response
or run ends in failure. Reveals long-horizon capability that flat
scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at
60% rate.

scripts/snr_weighted_ranking.py: SNR × |C(q)|-weighted ranking (with
winsorization at p95 to prevent single-task dominance). Headline
metric that weights discriminating + signal-rich tasks more than
noisy or consensus tasks. Also emits SNR-only and flat variants for
comparison.

scripts/generate_dynamical_report.py: assembles all four diagnostic
JSONs into a single markdown report with per-model regime tables,
SNR tiers, survival curves, and integrated interpretation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:49:05 -07:00
scoootscooob
afb14c3982 analysis: fair-comparison audit and rejudge pipeline
Tools for auditing archive coverage, rejudging judge-infra failures
via direct Anthropic API (bypasses the gateway path that sometimes
returns "Gateway is restarting" / empty judge results), and producing
fair multi-model comparison reports.

scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs
and archive JSONs side-by-side. Reports coverage %, clean mean,
coverage-normalized score, infra-zero count, judge-infra remaining
vs rejudged.

scripts/audit_per_run.py: per-run cross-model audit. Flags tasks
where all models score zero (broken task/verifier), verifier
rejects-valid-outputs (C=0 but agent produced text), harness-error
clusters, model-specific pathologies.

scripts/rejudge_all.py: re-runs judge scoring on archive runs where
the gateway judge failed. Uses direct anthropic SDK against
claude-sonnet-4-6, rewrites judge_result fields in place, recomputes
run_score per the C+T+B+J weighting.

scripts/generate_fair_report.py: produces an 8/9-model comparison
markdown report. Supports --exclude to drop specific models, headlines
"clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0
task pass counts, and coverage parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:48:43 -07:00
scoootscooob
01a31e55fb sweep: per-container state isolation + qwen model-id fix
scripts/container_sweep_single.sh: clone pristine OpenClaw state to
/tmp/ per sweep before starting the gateway. Carries over config
(openclaw.json, identity/, devices/, exec-approvals.json, tasks/,
subagents/, flows/, cron/) but leaves runtime dirs (agents/,
workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR
to the isolated dir so the gateway writes to /tmp instead of the
shared host mount. Fixes the cascading "RPC agents.create timed out
after 60s" failures caused by 4k+ stale agents accumulating across
sequential sweeps.

profiles/frontier_qwen_3_6.yaml: fix base_model from
openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus
(no dash). The dashed slug is unknown to OpenRouter and silently fails;
the no-dash version is the real canonical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:48:30 -07:00
scoootscooob
8a5be9c686 clawbench: per-sweep cache archiving + generic sweep templates
- scripts/_archive_cache.sh: snapshot run_cache/<model>/ to
  run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json.
  Sourced by sweep scripts so transcripts survive the next sweep's
  cache wipe and stay available for audits.
- scripts/container_sweep_single.sh: base multi-model sweep.
  Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so
  their caches are force-cleared at sweep start. Calls archive helper
  on exit.
- scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast
  fix validation (~20 min) instead of full 3-run sweep (~60 min).
- Dockerfile.main: parametrized clawbench-on-openclaw image with
  ARG BASE for pinning to any openclaw tag.
- scripts/git_checkpoint.py + README: documented checkpoint workflow
  for tagging known-good states during risky work.
- .gitignore: un-ignore scripts/, keep targeted ignores for
  __pycache__, .tmp, .local.py.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 12:46:45 -07:00
scoootscooob
6ab3004d63 Remove reports and scripts from repo, add to gitignore
Reports and eval scripts contain internal benchmark data that
should not be public.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 00:51:50 -07:00
scoootscooob
0d07aa4d08 Re-judge GPT 5.4: resolve judge auth caveat, full coverage
Re-ran Sonnet 4.6 judge on all 60 GPT 5.4 runs that had auth errors
during the original sweep. Called the Anthropic API directly using
cached transcripts. Results:

- judge_task_coverage: 0.6 -> 1.0 (all 40 tasks fully judged)
- judge_error_count: 60 -> 0
- overall_judge_score: 0.438 -> 0.239 (was inflated by excluding errors)
- overall_score: 0.456 -> 0.457 (unchanged, judge gated on C >= 0.9999)

No judge caveat remains. All 6 models now have complete, unbiased
judge coverage across all 720 runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 00:36:27 -07:00
Codex
4744a6ae7e ClawBench: 7-model frontier baseline + bake-off tooling
Profiles (profiles/):
- frontier_opus_4_6.yaml      (Anthropic Claude Opus 4.6 — closed)
- frontier_gpt_5_4.yaml       (OpenAI GPT-5.4 — closed)
- frontier_gemini_3_pro.yaml  (Google Gemini 3.1 Pro — closed)
- frontier_glm_5_1.yaml       (Zhipu AI GLM-5.1 via OpenRouter — open)
- frontier_qwen_3_6.yaml      (Alibaba Qwen3.6-Plus via OpenRouter — open)
- frontier_minimax_m27.yaml   (MiniMax M2.7 via OpenRouter — open)
- frontier_kimi_k25.yaml      (Moonshot Kimi K2.5 via OpenRouter — open)
- example_research_stack.yaml (example for docs)

All seven profiles share an identical plugin stack (anthropic +
memory-lancedb + browser-playwright) so base_model is the only
structural variable across the bake-off.

Scripts (scripts/):
- run_open_vs_closed_bakeoff.py: driver that runs each profile
  through the harness and generates a comparison table. Wraps
  `clawbench run --profile` via an inline Click entry (the package
  has no __main__.py so `python -m clawbench.cli` is a no-op).
- analyze_open_vs_closed.py: historical DB analyzer — per-bucket
  mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies
  OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/
  Moonshot land in the open bucket.
- ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py,
  scale_timeouts.py, seed_historical_db.py: task-corpus tooling.

Reports (reports/):
- FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run
  (3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6
  scored 63.9% with real token streaming (174K tok, $0.18 cost).
  The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too
  easy to separate frontier models at n=1. Documents
  infrastructure findings around gateway plugin allowlist behavior,
  token streaming gaps for non-Anthropic providers, and hot-reload
  cascade when config changes mid-run.
- open_vs_closed_bakeoff_summary.md: auto-generated headline table
- FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run
- REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run
- PARALLEL_HARNESS_REPORT.md: concurrency validation writeup
- V05_DELIVERY_REPORT.md: v0.5 framework delivery notes
- CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning

Artifacts (reports/artifacts/):
- frontier_*.json: the 7 BenchmarkResult files from the bake-off
  (committed snapshot for reproducibility; runtime results still
  go to results/ which remains gitignored)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:14:11 -07:00