diff --git a/README.md b/README.md index aa1ef81..8ba30f4 100644 --- a/README.md +++ b/README.md @@ -104,20 +104,56 @@ Core v1 drops the noisy tasks and reports variance decomposition alongside ranki Inspired by *"When LLMs Are Dreaming, Where Do They Go?"* — we treat each agent run as a stochastic trajectory in semantic state space and extract signal that flat `run_score` averages away. -| Diagnostic | Formula / Method | Reveals | -|---|---|---| -| **Constraint Index C(q)** | `-z(PR) - z(entropy) + z(BOPS)` over response embeddings | Which tasks converge to one answer vs diverge openly | -| **Regime classification** | Trajectory drift / recurrence / support-volume thresholds | Per-run dynamical signature (trapped / limit-cycle / diffusive) | -| **Survival analysis** | `S(t) = P(T_F > t)` where T_F = first empty assistant turn | Per-turn failure rates; long-horizon capability | -| **SNR-weighted ranking** | `w(task) = SNR × |C(q)|`, winsorized at p95 | Headline metric that weights tasks by their signal density | -| **Variance decomposition** | `Var(score) = Var_seeds + Var_models` per task | Separate capability signal from coin-flip noise | +Current code-path formulas: + +```text +Per assistant step t: +x_t = [tool_family_proportions(6), error_flag, normalized_tokens, normalized_text_len, progress] +drift_t = cosine_distance(x_0, x_t) +step_t = cosine_distance(x_{t-1}, x_t) + +Task-level Constraint Index: +PR(q) = tr(Σ_q)^2 / tr(Σ_q^2) +H(q) = -Σ_i p_i log2 p_i, p_i = λ_i / Σ_j λ_j, λ = eigvals(Σ_q) +BOPS(q) = mean_m mean_{i t) +h(t) = P(T_F = t | T_F >= t) +``` + +Implemented regime classifier in `clawbench/dynamics.py`: + +```text +trapped if H_tools < 0.5 or (error_rate > 0.6 and std(drift) < 0.05) +convergent if std(drift_last_quartile) < 0.1 and mean(step_last_quartile) < 0.15 and error_rate < 0.2 +diffusive if H_tools > 1.5 and error_rate < 0.15 and constraint_index_run < 0.8 +chaotic if H_tools > 2.0 and var(step[1:]) > 0.02 +limit_cycle if max autocorr(centered step[1:], lags 2..5) > 0.3 +unknown otherwise, or <3 assistant turns +``` + +The task-level `C(q)` uses a normalized bag-of-words response vector built from the full assistant trajectory text plus tool-call names and compacted inputs, not just the last assistant turn. From the v4-19 sweep data: - **Gemini 3.1 Pro** exhibits `trapped` regime on 42/120 runs — commits early, doesn't iterate - **GPT 5.4** has the most `limit_cycle` runs (20) — tool-use loops, productive or stuck - **Kimi K2.5** dies at median turn 3 (worst survival); **GPT 5.4** survives to turn 8 at 60% rate (best) -All scripts under `scripts/` — pure numpy + scipy, no torch / sentence-transformers required, runs on any archive dir. +All scripts under `scripts/` run on cached per-run JSONs with plain numpy-based tooling; no torch or sentence-transformers required. ### 4. We ablate configurations, not just models @@ -264,9 +300,12 @@ The `1/y_i^2` term means the worst score dominates. A configuration scoring 0.85 Flat-mean compresses frontier model gaps. An alternative that weights tasks by their signal density: ``` -weight(task) = max(0, SNR(task)) × |C(q)(task)| # unbounded -weight_winsorized(task) = min(weight(task), p95) # prevent single-task dominance -score(model) = Σ weight × mean_run_score / Σ weight +w_q = max(0, SNR(q)) × |C(q)| +w_q^wins = min(w_q, p95({w_q})) + +flat_score(model) = mean_q mean_run_score(model, q) over covered tasks +weighted_score(model) = Σ_q w_q mean_run_score(model, q) / Σ_q w_q +winsorized_score(model) = Σ_q w_q^wins mean_run_score(model, q) / Σ_q w_q^wins ``` Under SNR × |C(q)| winsorized on the same 1,080-run archive, **Opus 4.7 ranks #1** (instead of Opus 4.6 under flat mean) and **GPT 5.4 drops from #3 to #7** — its task-specific cliffs (0.16 on `t3-feature-export`) fall on the highest-signal tasks. This exposes what the flat mean averages away. @@ -349,27 +388,48 @@ clawbench run \ -o results/opus46_core_v1.json ``` -### Analyze an archive with the diagnostic suite +### Analyze a real archive ```bash -# 1. Aggregate coverage + fair-comparison audit +# Fair-comparison audit python3 scripts/audit_runs.py - -# 2. Rejudge any judge-infrastructure failures via direct Anthropic API -python3 scripts/rejudge_all.py \ - --drift-dir data/drift_2026-04-19-full \ - --archive-dir data/run_cache_archive/v2026-4-19-full - -# 3. Generate the fair comparison report python3 scripts/generate_fair_report.py --tag v2026-4-19-full -# 4. Dynamical-systems diagnostics (C(q), regimes, survival, SNR-weighted) -.venv/bin/python3 scripts/compute_constraint_index.py -.venv/bin/python3 scripts/classify_regimes.py -.venv/bin/python3 scripts/variance_decomp.py -.venv/bin/python3 scripts/survival_analysis.py -.venv/bin/python3 scripts/snr_weighted_ranking.py -.venv/bin/python3 scripts/generate_dynamical_report.py +# Posterior dynamics + ranking from cached per-run JSONs +python3 scripts/run_posterior_dynamics_pipeline.py \ + --archive-dir .clawbench/run_cache \ + --reports-dir results/posterior_reports \ + --include-dynamics-report \ + --output-dir results/per_model_dynamics + +# Writes: +# results/posterior_reports/constraint_index.json +# results/posterior_reports/regimes.json +# results/posterior_reports/variance_decomposition.json +# results/posterior_reports/survival_analysis.json +# results/posterior_reports/snr_weighted_ranking.json +# results/posterior_reports/EVAL_REPORT_DYNAMICAL.md +# results/per_model_dynamics//dynamics.json +# results/per_model_dynamics//*.png +``` + +If you only want one model's offline dynamics bundle: + +```bash +clawbench dynamics-report \ + --archive-dir .clawbench/run_cache \ + --model ollama/gpt-oss:20b \ + --output-dir results/gptoss_dynamics + +# Quick CI path: skip plot rendering +clawbench dynamics-report \ + --archive-dir .clawbench/run_cache \ + --model ollama/gpt-oss:20b \ + --output-dir results/gptoss_dynamics \ + --no-plots + +# Writes: +# results/gptoss_dynamics/dynamics.json ``` ### Running locally with small models (Ollama) @@ -379,7 +439,24 @@ A single consumer GPU running an open-weight model is enough to develop plugin p ```bash ollama pull gpt-oss:20b export OPENCLAW_GATEWAY_TOKEN= -clawbench run --model ollama/gpt-oss:20b --task t1-fs-quick-note --runs 1 +export CLAWBENCH_RUN_CACHE_DIR=$PWD/.clawbench/run_cache + +# Real benchmark run + immediate per-run dynamics bundle +clawbench run \ + --model ollama/gpt-oss:20b \ + --task t1-fs-quick-note \ + --runs 1 \ + --dynamics \ + -o results/ollama_smoke.json + +# Optional second local model +ollama pull qwen3.5:27b + +# Offline posterior analysis reads CLAWBENCH_RUN_CACHE_DIR +python3 scripts/run_posterior_dynamics_pipeline.py \ + --archive-dir .clawbench/run_cache \ + --reports-dir results/posterior_reports + clawbench diagnose profiles/local_ollama_gpt_oss.yaml ``` @@ -415,6 +492,9 @@ clawbench/ │ ├── profile.py # v0.5 plugin fingerprinting │ ├── diagnostic.py # Configuration Diagnostic report │ ├── factor_analysis.py # fANOVA factor importance +│ ├── dynamics.py # Trajectory metrics + sensitivity analysis +│ ├── dynamics_archive.py # Cached-run loading + offline report assembly +│ ├── dynamics_plots.py # Offline dynamics visualizations │ └── cli.py # CLI entry points │ ├── tasks-public/ # Core v1 PUBLIC release (19 tasks) @@ -431,6 +511,7 @@ clawbench/ │ ├── audit_per_run.py # Per-run cross-model audit │ ├── rejudge_all.py # Direct-API rejudge for broken gateway judges │ ├── generate_fair_report.py # Fair N-model comparison report +│ ├── run_posterior_dynamics_pipeline.py # One-shot posterior analysis driver │ ├── compute_constraint_index.py # C(q) per task │ ├── classify_regimes.py # Per-run dynamical regime classifier │ ├── variance_decomp.py # Seed-noise vs capability-signal decomposition @@ -439,7 +520,7 @@ clawbench/ │ └── generate_dynamical_report.py # Combined dynamical-systems report │ ├── profiles/ # v0.5 plugin profile YAMLs -├── tests/ # 107 tests +├── tests/ # Test suite ├── Dockerfile # Layered on ghcr.io/openclaw/openclaw:latest ├── CLAWBENCH_V0_4_SPEC.md # Full specification └── PARTNER_TRACE_SPEC.md # Trace interchange format @@ -469,7 +550,7 @@ clawbench/ ## Testing ```bash -python -m pytest -q # 107 tests +python -m pytest -q ``` Key test invariants: diff --git a/SPACE_README.md b/SPACE_README.md index 270a4a3..1bde868 100644 --- a/SPACE_README.md +++ b/SPACE_README.md @@ -136,6 +136,15 @@ submission Important rule: browser tasks stay serialized on one dedicated lane to avoid Chromium and port-range collisions. +## Submission presets + +The Submit tab now exposes two preset audiences so the Space can serve both general Claw users and lower-budget exploratory runs: + +- `Claw Users` keeps the full preset catalog, including provider-backed frontier models. +- `Budget Researchers` narrows the list to local or lower-cost presets such as `ollama/gpt-oss:20b`, `ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and `huggingface/google/gemma-4-26B-A4B-it`. + +You can still enter any custom model ID directly; the preset audience only filters the shortcut catalog and the bulk-submit action. + ## Task inventory | Task | Tier | Family | Main verification | diff --git a/app.py b/app.py index 429015d..c07e6f3 100644 --- a/app.py +++ b/app.py @@ -26,6 +26,15 @@ from clawbench.hub import ( load_submission_rows_from_parquet, resolve_dataset_repo, ) +from clawbench.submission_models import ( + CUSTOM_PRESET_LABEL, + PRESET_AUDIENCE_ALL, + PRESET_AUDIENCE_CHOICES, + PRESET_MODEL_MAP, + preset_labels_for_audience, + preset_models_for_audience, + resolve_model_selection, +) logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s") logger = logging.getLogger("clawbench.app") @@ -51,31 +60,6 @@ def _env_int(name: str, default: int, *, minimum: int, maximum: int) -> int: DEFAULT_RUNS_PER_TASK = _env_int("CLAWBENCH_DEFAULT_RUNS_PER_TASK", 3, minimum=1, maximum=10) DEFAULT_PARALLEL_LANES = _env_int("CLAWBENCH_DEFAULT_PARALLEL_LANES", 1, minimum=1, maximum=4) -# --------------------------------------------------------------------------- -# Preset models for quick submission -# --------------------------------------------------------------------------- - -PRESET_MODELS = { - # All models verified working on HF Inference API (free with HF_TOKEN) - # Tested 2026-04-07 via router.huggingface.co/v1/chat/completions - # - # --- Chinese open-source --- - "GLM 5.1 (754B MoE)": "huggingface/zai-org/GLM-5.1", - "GLM 5 (400B MoE)": "huggingface/zai-org/GLM-5", - "Qwen3 32B": "huggingface/Qwen/Qwen3-32B", - "DeepSeek R1": "huggingface/deepseek-ai/DeepSeek-R1", - "Kimi K2 Instruct": "huggingface/moonshotai/Kimi-K2-Instruct", - "MiniMax M2.5": "huggingface/MiniMaxAI/MiniMax-M2.5", - # --- Google open-source --- - "Gemma 4 26B MoE": "huggingface/google/gemma-4-26B-A4B-it", - # --- Meta open-source --- - "Llama 3.3 70B": "huggingface/meta-llama/Llama-3.3-70B-Instruct", - "Llama 3.1 70B": "huggingface/meta-llama/Llama-3.1-70B-Instruct", - # --- Proprietary models (require runtime auth configured for the model provider) --- - "Claude Sonnet 4.6": "anthropic/claude-sonnet-4-6", - "Claude Opus 4.6": "anthropic/claude-opus-4-6", -} - # --------------------------------------------------------------------------- # Background worker (starts in a thread) # --------------------------------------------------------------------------- @@ -271,15 +255,14 @@ def submit_model( prompt_variant: str, submitter: str, ) -> str: - # Use preset if selected, otherwise use custom model ID - model_id = PRESET_MODELS.get(preset, "") or model.strip() + model_id, provider_id = resolve_model_selection(model, preset, provider) if not model_id: return "Please enter a model ID or select a preset." selected_tier = tier if tier != "all" else None request = SubmissionRequest( model=model_id, - provider=provider.strip(), + provider=provider_id, judge_model=judge_model.strip(), runs_per_task=int(runs), max_parallel_lanes=int(max_parallel_lanes), @@ -292,20 +275,38 @@ def submit_model( return f"Submitted [{model_id}]! Job ID: {job.job_id}. Check the Queue tab." -def submit_all_presets(runs: int, max_parallel_lanes: int, submitter: str) -> str: - """Submit all preset models at once.""" +def submit_all_presets( + preset_audience: str, + runs: int, + max_parallel_lanes: int, + submitter: str, +) -> str: + """Submit all preset models from the selected audience track.""" + presets = preset_models_for_audience(preset_audience) + if not presets: + return f"No presets configured for {preset_audience}." + submitted = [] - for name, model_id in PRESET_MODELS.items(): + for preset in presets: request = SubmissionRequest( - model=model_id, - provider="", + model=preset.model_id, + provider=preset.provider, runs_per_task=int(runs), max_parallel_lanes=int(max_parallel_lanes), submitter=submitter.strip(), ) job = asyncio.run(queue.submit(request)) - submitted.append(f"{name} ({job.job_id})") - return f"Submitted {len(submitted)} models:\n" + "\n".join(f" - {s}" for s in submitted) + submitted.append(f"{preset.label} ({job.job_id})") + return f"Submitted {len(submitted)} models from {preset_audience}:\n" + "\n".join( + f" - {item}" for item in submitted + ) + + +def update_preset_choices(preset_audience: str): + return gr.update( + choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(preset_audience), + value=CUSTOM_PRESET_LABEL, + ) # --------------------------------------------------------------------------- @@ -952,7 +953,7 @@ STAT_JUDGE = ( ) STAT_PRESETS = ( '
Presets
' - + str(len(PRESET_MODELS)) + + str(len(PRESET_MODEL_MAP)) + "
" ) @@ -986,12 +987,28 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo "run via HuggingFace Inference API. You can also use locally hosted models " "(for example Ollama) when your OpenClaw runtime has them configured." ) + gr.Markdown( + "Use `Preset Audience` to switch between the full Claw catalog and a smaller budget track. " + "The budget track keeps local and lower-cost options upfront, including `ollama/gpt-oss:20b`, " + "`ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and " + "`huggingface/google/gemma-4-26B-A4B-it`." + ) + preset_audience_input = gr.Dropdown( + choices=list(PRESET_AUDIENCE_CHOICES), + value=PRESET_AUDIENCE_ALL, + label="Preset Audience", + ) preset_input = gr.Dropdown( - choices=["(custom)"] + list(PRESET_MODELS.keys()), - value="(custom)", + choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(PRESET_AUDIENCE_ALL), + value=CUSTOM_PRESET_LABEL, label="Preset models", ) + preset_audience_input.change( + fn=update_preset_choices, + inputs=preset_audience_input, + outputs=preset_input, + ) with gr.Row(): model_input = gr.Textbox( label="Custom Model ID (if not using preset)", @@ -1074,26 +1091,35 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo ) submit_all_btn.click( fn=submit_all_presets, - inputs=[runs_input, max_parallel_lanes_input, submitter_input], + inputs=[preset_audience_input, runs_input, max_parallel_lanes_input, submitter_input], outputs=submit_output, ) gr.Markdown(""" -**All presets verified working on HF Inference API (free):** +**Preset audiences:** -| Model | Provider | Size | Runtime | -|-------|----------|------|---------| -| GLM 5.1 | Z.ai | 754B MoE | HF free | -| GLM 5 | Z.ai | 400B MoE | HF free | -| Qwen3 32B | Alibaba | 32B | HF free | -| DeepSeek R1 | DeepSeek | 671B MoE | HF free | -| Kimi K2 Instruct | Moonshot AI | MoE | HF free | -| MiniMax M2.5 | MiniMax | MoE | HF free | -| Gemma 4 26B MoE | Google | 26B MoE | HF free | -| Llama 3.3 70B | Meta | 70B | HF free | -| Llama 3.1 70B | Meta | 70B | HF free | -| Claude Sonnet 4.6 | Anthropic | - | configured auth | -| Claude Opus 4.6 | Anthropic | - | configured auth | +| Audience | What it optimizes for | Presets | +|---|---|---| +| Claw Users | Full preset catalog, including provider-backed frontier options | Anthropic, HF open-weight, and Ollama presets | +| Budget Researchers | Smaller local/free-friendly track | GPT-OSS 20B, Qwen 3.5 27B, Qwen3 32B, Gemma 4 26B | + +**Current preset catalog:** + +| Model | Provider | Audience | +|---|---|---| +| GPT-OSS 20B (Ollama) | Ollama | Claw Users, Budget Researchers | +| Qwen 3.5 27B (Ollama) | Ollama | Claw Users, Budget Researchers | +| Qwen3 32B | HuggingFace | Claw Users, Budget Researchers | +| Gemma 4 26B MoE | HuggingFace | Claw Users, Budget Researchers | +| GLM 5.1 | HuggingFace | Claw Users | +| GLM 5 | HuggingFace | Claw Users | +| DeepSeek R1 | HuggingFace | Claw Users | +| Kimi K2 Instruct | HuggingFace | Claw Users | +| MiniMax M2.5 | HuggingFace | Claw Users | +| Llama 3.3 70B | HuggingFace | Claw Users | +| Llama 3.1 70B | HuggingFace | Claw Users | +| Claude Sonnet 4.6 | Anthropic | Claw Users | +| Claude Opus 4.6 | Anthropic | Claw Users | """) with gr.Tab("Queue"): diff --git a/clawbench/cli.py b/clawbench/cli.py index a009177..523e42d 100644 --- a/clawbench/cli.py +++ b/clawbench/cli.py @@ -116,6 +116,11 @@ def cli(verbose: bool) -> None: show_default=True, help="Where to write ecosystem insight files after a --profile run.", ) +@click.option( + "--dynamics", + is_flag=True, + help="Run quick post-benchmark dynamics analysis. Prefer dynamics-report for offline cache/archive analysis.", +) def run( model: str, gateway_token: str, @@ -137,6 +142,7 @@ def run( browser_concurrency: int, profile: Path | None, insights_dir: Path, + dynamics: bool, ) -> None: gateway_config = GatewayConfig(token=gateway_token) harness = BenchmarkHarness( @@ -165,6 +171,9 @@ def run( json.dump(result.model_dump(), handle, indent=2) click.echo(f"\nResults saved to {out_path}") + if dynamics: + _run_dynamics_analysis(harness.last_task_runs, out_path) + if profile is not None: _run_v05_diagnostic( profile_path=profile, @@ -179,6 +188,83 @@ def run( asyncio.run(upload_result(result)) +@cli.command("dynamics-report") +@click.option( + "--archive-dir", + type=click.Path(exists=True, file_okay=False, path_type=Path), + required=True, + help="Path to a run cache/archive root or a single model cache directory.", +) +@click.option( + "--model", + default=None, + help="Model id to select when the archive root contains multiple model directories.", +) +@click.option("--tier", type=click.Choice(["tier1", "tier2", "tier3", "tier4", "tier5"])) +@click.option("--task", "task_ids", multiple=True, help="Specific task IDs to include from the archive.") +@click.option( + "--output-dir", + type=click.Path(path_type=Path), + default=Path("results/offline_dynamics"), + show_default=True, + help="Directory where dynamics.json and plots will be written.", +) +@click.option( + "--no-plots", + is_flag=True, + help="Write only dynamics.json and skip plot rendering.", +) +def dynamics_report( + archive_dir: Path, + model: str | None, + tier: str | None, + task_ids: tuple[str, ...], + output_dir: Path, + no_plots: bool, +) -> None: + """Generate dynamics plots and a JSON report from cached TaskRunResult archives.""" + from clawbench.dynamics_archive import load_task_runs_archive + + try: + task_runs = load_task_runs_archive( + archive_dir=archive_dir, + model=model, + task_ids=task_ids, + tier=tier, + ) + except ValueError as exc: + raise click.ClickException(str(exc)) from exc + + if not task_runs: + raise click.ClickException(f"No cached runs found under {archive_dir}") + + report_path, plots, n_runs = _write_dynamics_report( + task_runs, + output_dir, + generate_plots=not no_plots, + ) + click.echo(f"Loaded {n_runs} cached runs across {len(task_runs)} tasks") + click.echo(f"Dynamics report saved to {report_path}") + click.echo(f"Saved {len(plots)} plots to {output_dir}/") + + +def _write_dynamics_report( + task_runs: dict[str, list], + output_dir: Path, + *, + generate_plots: bool = True, +) -> tuple[Path, list[Path], int]: + from clawbench.dynamics_archive import write_dynamics_report + + report_path, plots = write_dynamics_report( + task_runs, + output_dir, + generate_plots=generate_plots, + ) + n_runs = sum(len(runs) for runs in task_runs.values()) + return report_path, plots, n_runs + + def _run_v05_diagnostic( *, profile_path: Path, @@ -693,5 +779,23 @@ def show(result_file: str) -> None: ) +def _run_dynamics_analysis( + task_runs: dict[str, list], + result_path: str, +) -> None: + """Compute stratified dynamics from raw TaskRunResult objects.""" + run_stem = Path(result_path).stem + dyn_dir = Path(result_path).parent / f"{run_stem}_dynamics" + try: + dyn_path, plots, n_runs = _write_dynamics_report(task_runs, dyn_dir) + except ValueError as exc: + click.echo(str(exc)) + return + + click.echo(f"\n[dynamics] Analysed {n_runs} cached runs") + click.echo(f" Dynamics report saved to {dyn_path}") + click.echo(f" Saved {len(plots)} plots to {dyn_dir}/") + + def main() -> None: cli() diff --git a/clawbench/client.py b/clawbench/client.py index 1e68e3c..ead55bc 100644 --- a/clawbench/client.py +++ b/clawbench/client.py @@ -8,7 +8,9 @@ import logging import math import os import re +import shutil import subprocess +import sys import uuid from dataclasses import dataclass, field from typing import Any @@ -24,10 +26,10 @@ logger = logging.getLogger(__name__) PROTOCOL_VERSION = 3 DEVICE_IDENTITY_HELPER_JS = r""" -const crypto = require("node:crypto"); -const fs = require("node:fs"); -const os = require("node:os"); -const path = require("node:path"); +const crypto = require("crypto"); +const fs = require("fs"); +const os = require("os"); +const path = require("path"); const ED25519_SPKI_PREFIX = Buffer.from("302a300506032b6570032100", "hex"); @@ -52,7 +54,7 @@ function fingerprintPublicKey(publicKeyPem) { } function generateIdentity() { - const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519"); + const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519", {}); const publicKeyPem = publicKey.export({ type: "spki", format: "pem" }).toString(); const privateKeyPem = privateKey.export({ type: "pkcs8", format: "pem" }).toString(); return { @@ -445,12 +447,48 @@ class GatewayClient: max_wait_seconds=2.0, ) ) + + # Some gateway/provider paths persist assistant messages in session + # history without emitting complete streaming events. Backfill from + # sessions.get if stream capture appears incomplete. + history_messages = await self.get_session_messages(session_key) + collected_assistant = sum( + 1 for msg in collected_messages if msg.role == "assistant" + ) + history_assistant = sum( + 1 for msg in history_messages if msg.role == "assistant" + ) + if history_messages and ( + len(history_messages) > len(collected_messages) + or history_assistant > collected_assistant + ): + collected_messages = history_messages finally: self._event_queues.pop(chat_queue_key, None) self._event_queues.pop(msg_queue_key, None) return _correlate_transcript(Transcript(messages=collected_messages)) + async def get_session_messages(self, session_key: str) -> list[TranscriptMessage]: + try: + response = await self._rpc("sessions.get", {"key": session_key}) + except Exception: + return [] + + payload = response.get("payload", {}) + raw_messages = payload.get("messages", []) + if not isinstance(raw_messages, list): + return [] + + parsed: list[TranscriptMessage] = [] + for raw in raw_messages: + if not isinstance(raw, dict): + continue + msg = _parse_single_message(raw) + if msg is not None: + parsed.append(msg) + return parsed + async def _rpc( self, method: str, @@ -551,9 +589,17 @@ def _build_connect_device( "deviceFamily": device_family or "", } ) + + node_executable = _resolve_node_executable() + if not node_executable: + logger.warning( + "Failed to build device identity payload: no Node executable found" + ) + return None + try: completed = subprocess.run( - ["node", "-e", DEVICE_IDENTITY_HELPER_JS], + [node_executable, "-e", DEVICE_IDENTITY_HELPER_JS], input=helper_input, capture_output=True, text=True, @@ -577,6 +623,25 @@ def _build_connect_device( return payload +def _resolve_node_executable() -> str | None: + """Resolve Node binary, preferring the active Python/conda environment.""" + candidates: list[str] = [] + + # First try the same environment as the active Python interpreter. + candidates.append(os.path.join(os.path.dirname(sys.executable), "node")) + + # Then try CONDA_PREFIX when available. + conda_prefix = os.environ.get("CONDA_PREFIX") + if conda_prefix: + candidates.append(os.path.join(conda_prefix, "bin", "node")) + + for candidate in candidates: + if os.path.isfile(candidate) and os.access(candidate, os.X_OK): + return candidate + + return shutil.which("node") + + def _is_transient_gateway_connect_error(exc: Exception) -> bool: if isinstance(exc, InvalidStatus): return exc.response.status_code in {502, 503, 504} @@ -615,6 +680,9 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N if block_type == "text": text_parts.append(block.get("text", "")) continue + if block_type == "output_text": + text_parts.append(block.get("text", "")) + continue if block_type in {"tool_use", "toolCall"}: arguments = block.get("input", block.get("arguments", {})) if isinstance(arguments, str): @@ -641,6 +709,16 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N if tool_result_content: text_parts.append(tool_result_content) + # Some providers surface assistant failures in a dedicated error field + # with empty content blocks. Preserve that signal in transcript text. + error_message = message_data.get("errorMessage", "") + if isinstance(error_message, str) and error_message.strip(): + text_parts.append(error_message.strip()) + + direct_text = message_data.get("text", "") + if isinstance(direct_text, str) and direct_text.strip(): + text_parts.append(direct_text.strip()) + if not text_parts and not tool_calls and not tool_result_for: return None diff --git a/clawbench/dynamics.py b/clawbench/dynamics.py new file mode 100644 index 0000000..7086c1c --- /dev/null +++ b/clawbench/dynamics.py @@ -0,0 +1,695 @@ +"""Dynamics analysis for ClawBench agent trajectories. + +Treats each agent run as a discrete dynamical system and computes step +embeddings, trajectory metrics, sensitivity analysis, regime classification, +Kaplan-Meier survival, non-Markov memory, and stratified assessment with +Bayesian importance-weight correction for distribution shift. +""" + +from __future__ import annotations + +import math +from collections import Counter +from dataclasses import dataclass, field +from enum import Enum +from typing import TYPE_CHECKING, Callable + +import numpy as np + +if TYPE_CHECKING: + from clawbench.schemas import TaskRunResult, Transcript + +# ── Constants ────────────────────────────────────────────────────────── + +TOOL_FAMILIES = ("browser", "edit", "execute", "memory", "read", "search") +_N_FAM = len(TOOL_FAMILIES) + +# ── Types ────────────────────────────────────────────────────────────── + + +class Regime(str, Enum): + convergent = "convergent" + chaotic = "chaotic" + trapped = "trapped" + diffusive = "diffusive" + limit_cycle = "limit_cycle" + unknown = "unknown" + + +@dataclass +class Dynamics: + """Computed dynamics for a single trajectory.""" + + n_steps: int + embeddings: np.ndarray # (n_steps, 10) + drift: np.ndarray # cosine distance from step 0 + step_size: np.ndarray # cosine distance from step t-1 + entropy_series: list[float] # running tool-family entropy + error_rate_series: list[float] # running error fraction + tokens_series: list[int] + latency_series: list[float] + tool_sequence: list[str] # primary family per step + markov: dict[str, dict[str, float]] + family_dist: dict[str, float] + regime: Regime + mean_drift: float + mean_step_size: float + tool_entropy: float + error_rate: float + constraint_index: float + pca_trajectory: np.ndarray | None = None # (n_steps, 2) + bigram_transitions: dict[str, dict[str, float]] = field(default_factory=dict) + memory_depth: float = 0.0 # I(X_t; X_{t-2} | X_{t-1}) + + +@dataclass +class Sensitivity: + """Pairwise comparison between two runs of the same task.""" + + task_id: str + score_delta: float + tool_edit_distance: int + family_js_divergence: float + embedding_divergence: np.ndarray # (min_steps,) + lyapunov_proxy: float + + +@dataclass +class SurvivalPoint: + time: float + survival: float + + +# ── Helpers ──────────────────────────────────────────────────────────── + + +def _cosine_dist(a: np.ndarray, b: np.ndarray) -> float: + na, nb = np.linalg.norm(a), np.linalg.norm(b) + if na < 1e-12 or nb < 1e-12: + return 1.0 + return float(1.0 - np.dot(a, b) / (na * nb)) + + +def _entropy(counts: dict[str, int]) -> float: + total = sum(counts.values()) + if total == 0: + return 0.0 + return -sum( + (c / total) * math.log2(c / total) for c in counts.values() if c > 0 + ) + + +def _js_divergence(p: dict[str, int], q: dict[str, int]) -> float: + keys = set(p) | set(q) + if not keys: + return 0.0 + tp, tq = sum(p.values()) or 1, sum(q.values()) or 1 + jsd = 0.0 + for k in keys: + pk, qk = p.get(k, 0) / tp, q.get(k, 0) / tq + mk = (pk + qk) / 2 + if pk > 0 and mk > 0: + jsd += 0.5 * pk * math.log2(pk / mk) + if qk > 0 and mk > 0: + jsd += 0.5 * qk * math.log2(qk / mk) + return jsd + + +def _levenshtein(a: list, b: list) -> int: + if not a: + return len(b) + if not b: + return len(a) + prev = list(range(len(b) + 1)) + for ca in a: + curr = [prev[0] + 1] + [0] * len(b) + for j, cb in enumerate(b): + curr[j + 1] = min( + prev[j] + (0 if ca == cb else 1), + prev[j + 1] + 1, + curr[j] + 1, + ) + prev = curr + return prev[-1] + + +def _classify_tool(name: str) -> str: + lo = name.lower() + for fam in TOOL_FAMILIES: + if fam in lo: + return fam + _ALIASES = { + "edit": ("write_file", "create_file", "str_replace", "patch"), + "execute": ("bash", "terminal", "shell", "run", "exec"), + "browser": ("browse", "click", "navigate", "screenshot"), + "search": ("grep", "find", "glob", "semantic"), + "read": ("cat", "head", "tail", "view", "list_dir"), + } + for fam, keywords in _ALIASES.items(): + if any(k in lo for k in keywords): + return fam + return "execute" + + +def _normalize_tool_family(name: str, family: str | None) -> str: + if family in TOOL_FAMILIES: + return family + return _classify_tool(name) + + +# ── Feature embedding ────────────────────────────────────────────────── + + +def _embed_transcript( + transcript: Transcript, +) -> tuple[np.ndarray, list[str], list[int], list[float], list[bool]]: + """Build (n_steps, 10) feature matrix from assistant turns. + + Features: [0:6] tool-family proportions, [6] error flag, + [7] normalised tokens, [8] normalised text length, [9] progress. + """ + msgs = transcript.assistant_messages + n = len(msgs) + if n == 0: + return np.empty((0, _N_FAM + 4)), [], [], [], [] + + X = np.zeros((n, _N_FAM + 4)) + families: list[str] = [] + tokens: list[int] = [] + latencies: list[float] = [] + errors: list[bool] = [] + raw_tokens = np.zeros(n) + raw_text = np.zeros(n) + + for i, msg in enumerate(msgs): + fam_counts: Counter = Counter() + has_err = False + for tc in msg.tool_calls: + fam = _normalize_tool_family(tc.name, tc.family) + fam_counts[fam] += 1 + if tc.success is False or tc.error: + has_err = True + n_tc = sum(fam_counts.values()) or 1 + for j, fam in enumerate(TOOL_FAMILIES): + X[i, j] = fam_counts.get(fam, 0) / n_tc + X[i, _N_FAM] = 1.0 if has_err else 0.0 + X[i, _N_FAM + 3] = i / max(n - 1, 1) + + families.append( + max(fam_counts, key=fam_counts.get) if fam_counts else "execute" + ) + errors.append(has_err) + tokens.append(msg.usage.total_tokens) + raw_tokens[i] = float(msg.usage.total_tokens) + raw_text[i] = float(len(msg.text)) + dt = msg.timestamp_ms - msgs[i - 1].timestamp_ms if i > 0 else 0 + latencies.append(max(float(dt), 0.0)) + + mx_tok = raw_tokens.max() or 1 + mx_txt = raw_text.max() or 1 + X[:, _N_FAM + 1] = raw_tokens / mx_tok + X[:, _N_FAM + 2] = raw_text / mx_txt + + return X, families, tokens, latencies, errors + + +# ── Non-Markov memory ──────────────────────────────────────────────── + + +def _compute_bigram_transitions(seq: list[str]) -> dict[str, dict[str, float]]: + """P(family_t | family_{t-1}, family_{t-2}) grouped by bigram context.""" + if len(seq) < 3: + return {} + bigrams: dict[str, Counter] = {} + for a, b, c in zip(seq[:-2], seq[1:-1], seq[2:]): + ctx = f"{a}->{b}" + bigrams.setdefault(ctx, Counter())[c] += 1 + return { + ctx: {k: v / sum(cnts.values()) for k, v in cnts.items()} + for ctx, cnts in bigrams.items() + } + + +def _conditional_mi(seq: list[str]) -> float: + """I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator.""" + if len(seq) < 3: + return 0.0 + n = len(seq) - 2 + triple = Counter(zip(seq[:-2], seq[1:-1], seq[2:])) + pair_01 = Counter(zip(seq[:-2], seq[1:-1])) + pair_12 = Counter(zip(seq[1:-1], seq[2:])) + single = Counter(seq[1:-1]) + + mi = 0.0 + for (a, b, c), count in triple.items(): + p_abc = count / n + p_ab, p_bc, p_b = pair_01[(a, b)] / n, pair_12[(b, c)] / n, single[b] / n + if p_ab > 0 and p_bc > 0 and p_b > 0: + mi += p_abc * math.log2((p_abc * p_b) / (p_ab * p_bc)) + return max(mi, 0.0) + + +# ── Core analysis ────────────────────────────────────────────────────── + + +def compute_dynamics(transcript: Transcript) -> Dynamics: + """Compute trajectory dynamics from a single run transcript.""" + X, families, tokens, latencies, errors = _embed_transcript(transcript) + n = len(families) + + drift = ( + np.array([_cosine_dist(X[0], X[i]) for i in range(n)]) + if n else np.array([]) + ) + step_sz = np.zeros(n) + for i in range(1, n): + step_sz[i] = _cosine_dist(X[i - 1], X[i]) + + fam_acc: Counter = Counter() + err_count = 0 + entropy_s: list[float] = [] + error_s: list[float] = [] + for i, (fam, err) in enumerate(zip(families, errors)): + fam_acc[fam] += 1 + err_count += int(err) + entropy_s.append(_entropy(dict(fam_acc))) + error_s.append(err_count / (i + 1)) + + total = sum(fam_acc.values()) or 1 + fam_dist = {k: v / total for k, v in fam_acc.items()} + + mc: dict[str, Counter] = {f: Counter() for f in TOOL_FAMILIES} + for a, b in zip(families[:-1], families[1:]): + mc[a][b] += 1 + markov = { + src: ({dst: c / t for dst, c in cnts.items()} if (t := sum(cnts.values())) else {}) + for src, cnts in mc.items() + } + + ci = 0.5 + if n > 2: + cov = np.cov(X.T) + eigvals = np.maximum(np.linalg.eigvalsh(cov), 0) + tv = eigvals.sum() + if tv > 1e-10: + p = eigvals / tv + pr = 1.0 / np.sum(p**2) + ci = 1.0 - (pr - 1) / (X.shape[1] - 1) + + h = _entropy(dict(fam_acc)) + er = err_count / n if n else 0 + regime = _classify_regime(drift, step_sz, h, er, ci, n) + + return Dynamics( + n_steps=n, + embeddings=X, + drift=drift, + step_size=step_sz, + entropy_series=entropy_s, + error_rate_series=error_s, + tokens_series=tokens, + latency_series=latencies, + tool_sequence=families, + markov=markov, + family_dist=fam_dist, + regime=regime, + mean_drift=float(np.mean(drift)) if n else 0, + mean_step_size=float(np.mean(step_sz)) if n else 0, + tool_entropy=h, + error_rate=er, + constraint_index=ci, + bigram_transitions=_compute_bigram_transitions(families), + memory_depth=_conditional_mi(families), + ) + + +def _classify_regime(drift, step_sz, entropy, error_rate, ci, n) -> Regime: + if n < 3: + return Regime.unknown + if entropy < 0.5 or (error_rate > 0.6 and float(np.std(drift)) < 0.05): + return Regime.trapped + q = max(1, n // 4) + late_drift_std = float(np.std(drift[-q:])) + late_step_mean = float(np.mean(step_sz[-q:])) + if late_drift_std < 0.1 and late_step_mean < 0.15 and error_rate < 0.2: + return Regime.convergent + if entropy > 1.5 and error_rate < 0.15 and ci < 0.8: + return Regime.diffusive + step_var = float(np.var(step_sz[1:])) if n > 1 else 0 + if entropy > 2.0 and step_var > 0.02: + return Regime.chaotic + if n > 6: + ss = step_sz[1:] + ss_c = ss - ss.mean() + norm = np.dot(ss_c, ss_c) + if norm > 1e-10: + ac = np.correlate(ss_c, ss_c, mode="full") + ac = ac[len(ac) // 2:] / norm + if len(ac) > 5 and max(ac[2:6]) > 0.3: + return Regime.limit_cycle + return Regime.unknown + + +# ── Sensitivity ──────────────────────────────────────────────────────── + + +def compute_sensitivity( + run_a: TaskRunResult, + run_b: TaskRunResult, + task_id: str = "", +) -> Sensitivity: + """Compare two runs of the same task for prompt sensitivity.""" + Xa, fam_a, *_ = _embed_transcript(run_a.transcript) + Xb, fam_b, *_ = _embed_transcript(run_b.transcript) + + min_n = min(len(Xa), len(Xb)) + emb_div = ( + np.array([_cosine_dist(Xa[i], Xb[i]) for i in range(min_n)]) + if min_n else np.array([]) + ) + + lyap = 0.0 + if min_n > 1: + d0 = max(_cosine_dist(Xa[0], Xb[0]), 1e-6) + lyap = sum( + math.log(max(emb_div[t], 1e-6) / d0) / t for t in range(1, min_n) + ) / (min_n - 1) + + return Sensitivity( + task_id=task_id or run_a.task_id, + score_delta=abs(run_a.run_score - run_b.run_score), + tool_edit_distance=_levenshtein(fam_a, fam_b), + family_js_divergence=_js_divergence(dict(Counter(fam_a)), dict(Counter(fam_b))), + embedding_divergence=emb_div, + lyapunov_proxy=lyap, + ) + + +# ── Survival analysis ───────────────────────────────────────────────── + + +def kaplan_meier( + event_times: list[float], + censored: list[bool] | None = None, +) -> list[SurvivalPoint]: + """Kaplan-Meier survival estimator.""" + n = len(event_times) + if n == 0: + return [] + if censored is None: + censored = [False] * n + pairs = sorted(zip(event_times, censored)) + pts = [SurvivalPoint(0.0, 1.0)] + at_risk = n + surv = 1.0 + for t, cens in pairs: + if cens: + at_risk -= 1 + continue + if at_risk > 0: + surv *= (at_risk - 1) / at_risk + at_risk -= 1 + pts.append(SurvivalPoint(t, surv)) + return pts + + +def find_event_step(transcript: Transcript, event: str) -> float | None: + """Return step index of the first occurrence of *event*, or None.""" + msgs = transcript.assistant_messages + if event == "first_error_recovery": + in_err = False + for i, m in enumerate(msgs): + any_err = any(tc.success is False or tc.error for tc in m.tool_calls) + if any_err: + in_err = True + elif in_err: + return float(i) + elif event == "first_correct_write": + for i, m in enumerate(msgs): + for tc in m.tool_calls: + fam = tc.family or _classify_tool(tc.name) + if fam == "edit" and tc.success is not False and not tc.error: + return float(i) + elif event == "task_completion": + if msgs: + last = msgs[-1] + if not any(tc.success is False or tc.error for tc in last.tool_calls): + return float(len(msgs) - 1) + elif event == "failure_absorption": + err_seen = False + for i, m in enumerate(msgs): + any_err = any(tc.success is False or tc.error for tc in m.tool_calls) + if any_err: + err_seen = True + elif err_seen and m.tool_calls: + return float(i) + return None + + +# ── PCA trajectory bundles ───────────────────────────────────────────── + + +def compute_pca_bundle( + dynamics_list: list[Dynamics], +) -> tuple[np.ndarray, list[np.ndarray]]: + """Fit PCA on pooled embeddings, project each trajectory into PC1-PC2.""" + non_empty = [d.embeddings for d in dynamics_list if d.n_steps > 0] + if not non_empty: + for d in dynamics_list: + d.pca_trajectory = np.empty((0, 2)) + return np.zeros((2, _N_FAM + 4)), [] + all_emb = np.vstack(non_empty) + mean = all_emb.mean(axis=0) + centred = all_emb - mean + _, _, Vt = np.linalg.svd(centred, full_matrices=False) + components = Vt[:2] + + projections: list[np.ndarray] = [] + for d in dynamics_list: + proj = (d.embeddings - mean) @ components.T if d.n_steps else np.empty((0, 2)) + d.pca_trajectory = proj + projections.append(proj) + return components, projections + + +# ── Stratified assessment with Bayesian reweighting ─────────────────── + + +@dataclass +class StratumStats: + """Distributional statistics for one stratum of runs.""" + + name: str + n_runs: int + weight: float + + # Score distribution + scores: np.ndarray + score_mean: float + score_std: float + score_quantiles: dict[str, float] # q10, q25, q50, q75, q90 + + # Dynamics distributions + entropy_dist: np.ndarray + error_rate_dist: np.ndarray + constraint_dist: np.ndarray + memory_depth_dist: np.ndarray + mean_drift_dist: np.ndarray + mean_step_size_dist: np.ndarray + + # Time-series curves (aligned by step index) + drift_curve_mean: np.ndarray + drift_curve_std: np.ndarray + step_curve_mean: np.ndarray + step_curve_std: np.ndarray + + regime_counts: dict[str, int] + sensitivity_deltas: np.ndarray + + +# Scalar fields on StratumStats that reweight() aggregates. +_REWEIGHT_FIELDS = [ + ("entropy", "entropy_dist"), + ("error_rate", "error_rate_dist"), + ("constraint", "constraint_dist"), + ("memory_depth", "memory_depth_dist"), + ("mean_drift", "mean_drift_dist"), + ("mean_step_size", "mean_step_size_dist"), +] + + +@dataclass +class StratifiedAssessment: + """Full stratified assessment with Bayesian reweighting. + + Call ``reweight(target_weights)`` with a different task distribution + to obtain importance-weighted aggregate estimates. + """ + + strata: list[StratumStats] + stratifier_name: str + total_runs: int + observed_mean_score: float + observed_std_score: float + + def stratum_names(self) -> list[str]: + return [s.name for s in self.strata] + + def reweight(self, target_weights: dict[str, float]) -> dict[str, float]: + """Bayesian importance-weight correction. + + w_k = p_target(k) / p_observed(k), then normalised. + """ + t_total = sum(target_weights.values()) or 1.0 + p_target = {k: v / t_total for k, v in target_weights.items()} + by_name = {s.name: s for s in self.strata} + + weights = { + name: pt / by_name[name].weight + for name, pt in p_target.items() + if name in by_name and by_name[name].weight > 1e-12 + } + if not weights: + return {"score_mean": self.observed_mean_score, + "score_std": self.observed_std_score} + + w_total = sum(weights.values()) + w = {k: v / w_total for k, v in weights.items()} + + # Reweight score (mean + law-of-total-variance) + score_mu = sum(w[k] * by_name[k].score_mean for k in w) + score_var = sum( + w[k] * (by_name[k].score_std ** 2 + (by_name[k].score_mean - score_mu) ** 2) + for k in w + ) + result = {"score_mean": score_mu, "score_std": math.sqrt(max(score_var, 0.0))} + + def _safe_mean(arr: np.ndarray) -> float: + return float(np.mean(arr)) if len(arr) > 0 else 0.0 + + for label, dist_attr in _REWEIGHT_FIELDS: + result[f"{label}_mean"] = sum( + w[k] * _safe_mean(getattr(by_name[k], dist_attr)) for k in w + ) + return result + + +def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]: + """Mean and std of variable-length arrays aligned at step 0.""" + if not arrays: + return np.array([]), np.array([]) + max_len = max(len(a) for a in arrays) + mat = np.full((len(arrays), max_len), np.nan) + for i, a in enumerate(arrays): + mat[i, :len(a)] = a + return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0) + + +def build_strata( + runs: list[TaskRunResult], + dynamics_list: list[Dynamics], + scores: list[float], + stratifier: Callable[[TaskRunResult, Dynamics], str], + stratifier_name: str = "custom", + sensitivities: list[Sensitivity] | None = None, +) -> StratifiedAssessment: + """Group runs into strata and compute per-stratum distributions.""" + assert len(runs) == len(dynamics_list) == len(scores) + + groups: dict[str, list[int]] = {} + for idx, (r, d) in enumerate(zip(runs, dynamics_list)): + groups.setdefault(stratifier(r, d), []).append(idx) + + total = len(runs) + all_scores = np.array(scores) + + sens_by_task: dict[str, list[Sensitivity]] = {} + if sensitivities: + for s in sensitivities: + sens_by_task.setdefault(s.task_id, []).append(s) + + strata: list[StratumStats] = [] + for name, idxs in sorted(groups.items()): + n = len(idxs) + sc = np.array([scores[i] for i in idxs]) + dyns = [dynamics_list[i] for i in idxs] + + qs = {f"q{q}": float(np.percentile(sc, q)) if n else 0.0 + for q in (10, 25, 50, 75, 90)} + + drift_m, drift_s = _aligned_mean_std([d.drift for d in dyns]) + step_m, step_s = _aligned_mean_std([d.step_size for d in dyns]) + + stratum_tasks = {runs[i].task_id for i in idxs} + sens_deltas = [ + s.score_delta + for tid in stratum_tasks + for s in sens_by_task.get(tid, []) + ] + + strata.append(StratumStats( + name=name, n_runs=n, weight=n / total if total else 0.0, + scores=sc, + score_mean=float(np.mean(sc)) if n else 0.0, + score_std=float(np.std(sc)) if n else 0.0, + score_quantiles=qs, + entropy_dist=np.array([d.tool_entropy for d in dyns]), + error_rate_dist=np.array([d.error_rate for d in dyns]), + constraint_dist=np.array([d.constraint_index for d in dyns]), + memory_depth_dist=np.array([d.memory_depth for d in dyns]), + mean_drift_dist=np.array([d.mean_drift for d in dyns]), + mean_step_size_dist=np.array([d.mean_step_size for d in dyns]), + drift_curve_mean=drift_m, drift_curve_std=drift_s, + step_curve_mean=step_m, step_curve_std=step_s, + regime_counts=dict(Counter(d.regime.value for d in dyns)), + sensitivity_deltas=np.array(sens_deltas) if sens_deltas else np.array([]), + )) + + return StratifiedAssessment( + strata=strata, + stratifier_name=stratifier_name, + total_runs=total, + observed_mean_score=float(np.mean(all_scores)) if total else 0.0, + observed_std_score=float(np.std(all_scores)) if total else 0.0, + ) + + +# ── Built-in stratifiers ────────────────────────────────────────────── + + +def stratify_by_regime(run: TaskRunResult, dyn: Dynamics) -> str: + return dyn.regime.value + + +def stratify_by_task(run: TaskRunResult, dyn: Dynamics) -> str: + return run.task_id + + +def stratify_by_tier(run: TaskRunResult, dyn: Dynamics) -> str: + tid = run.task_id.lower() + for i in range(1, 6): + if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"): + return f"tier{i}" + return "unknown" + + +def stratify_by_tool_mix(run: TaskRunResult, dyn: Dynamics) -> str: + if not dyn.family_dist: + return "unknown" + return max(dyn.family_dist, key=dyn.family_dist.get) + + +def stratify_by_prompt_style(run: TaskRunResult, dyn: Dynamics) -> str: + user_msgs = [m for m in run.transcript.messages if m.role == "user"] + if not user_msgs: + return "unknown" + wc = len(user_msgs[0].text.split()) + return "terse" if wc <= 6 else ("medium" if wc <= 15 else "verbose") + + +def stratify_by_scenario(run: TaskRunResult, dyn: Dynamics) -> str: + return run.scenario or "unknown" + + +def stratify_by_family(run: TaskRunResult, dyn: Dynamics) -> str: + return run.family or "unknown" diff --git a/clawbench/dynamics_archive.py b/clawbench/dynamics_archive.py new file mode 100644 index 0000000..92470c1 --- /dev/null +++ b/clawbench/dynamics_archive.py @@ -0,0 +1,493 @@ +"""Offline dynamics analysis helpers for cached ClawBench runs.""" + +from __future__ import annotations + +import json +from itertools import combinations +from pathlib import Path +from typing import Iterable + +import numpy as np + +from clawbench.dynamics import ( + build_strata, + compute_dynamics, + compute_pca_bundle, + compute_sensitivity, + find_event_step, + kaplan_meier, + stratify_by_regime, + stratify_by_scenario, + stratify_by_tier, + stratify_by_tool_mix, +) +from clawbench.dynamics_plots import generate_all_plots +from clawbench.schemas import TaskRunResult + +_TIER_PREFIXES = { + "tier1": ("t1-", "t1_"), + "tier2": ("t2-", "t2_"), + "tier3": ("t3-", "t3_"), + "tier4": ("t4-", "t4_"), + "tier5": ("t5-", "t5_"), +} + + +def safe_model_name(model: str) -> str: + return model.replace("/", "_").replace(":", "_") + + +def _candidate_model_dir_names(model: str) -> set[str]: + return { + model, + safe_model_name(model), + model.replace("/", "_"), + model.replace("/", "-").replace(":", "-"), + } + + +def _has_run_files(path: Path) -> bool: + try: + for child in path.iterdir(): + if child.is_file() and child.name.startswith("run") and child.suffix == ".json": + return True + except FileNotFoundError: + return False + return False + + +def _is_task_collection_root(path: Path) -> bool: + try: + for child in path.iterdir(): + if child.is_dir() and _has_run_files(child): + return True + except FileNotFoundError: + return False + return False + + +def _resolve_model_roots(archive_dir: Path, model: str | None) -> list[Path]: + if _is_task_collection_root(archive_dir): + if model is not None and archive_dir.name not in _candidate_model_dir_names(model): + raise ValueError( + f"Archive dir {archive_dir} does not match requested model {model}." + ) + return [archive_dir] + + roots = [ + child + for child in sorted(archive_dir.iterdir()) + if child.is_dir() and _is_task_collection_root(child) + ] + if model is not None: + candidates = _candidate_model_dir_names(model) + roots = [root for root in roots if root.name in candidates] + elif len(roots) > 1: + raise ValueError( + "Archive root contains multiple model directories. Pass --model or point " + "--archive-dir at a specific model directory." + ) + return roots + + +def discover_model_roots(archive_dir: Path) -> dict[str, Path]: + """Discover model directories inside an archive root. + + Returns a mapping of model directory name to its path. If archive_dir is + itself a model cache root (contains task directories with run*.json), the + mapping contains a single entry. + """ + if not archive_dir.exists(): + raise ValueError(f"Archive dir does not exist: {archive_dir}") + + if _is_task_collection_root(archive_dir): + return {archive_dir.name: archive_dir} + + roots = { + child.name: child + for child in sorted(archive_dir.iterdir()) + if child.is_dir() and _is_task_collection_root(child) + } + return roots + + +def _matches_tier(task_id: str, tier: str | None) -> bool: + if tier is None: + return True + return task_id.lower().startswith(_TIER_PREFIXES[tier]) + + +def load_task_runs_archive( + archive_dir: Path, + model: str | None = None, + task_ids: Iterable[str] | None = None, + tier: str | None = None, +) -> dict[str, list[TaskRunResult]]: + """Load cached TaskRunResult objects from a run cache/archive directory.""" + task_filter = set(task_ids or []) + task_runs: dict[str, list[TaskRunResult]] = {} + + if not archive_dir.exists(): + raise ValueError(f"Archive dir does not exist: {archive_dir}") + + roots = _resolve_model_roots(archive_dir, model) + if not roots: + return {} + + for root in roots: + for task_dir in sorted(child for child in root.iterdir() if child.is_dir()): + task_id = task_dir.name + if task_filter and task_id not in task_filter: + continue + if not _matches_tier(task_id, tier): + continue + + runs = [] + for run_file in sorted(task_dir.glob("run*.json")): + try: + run = TaskRunResult.model_validate_json( + run_file.read_text(encoding="utf-8") + ) + except Exception: + continue + runs.append(run) + + if runs: + task_runs.setdefault(task_id, []).extend(runs) + + for task_id, runs in task_runs.items(): + runs.sort(key=lambda run: run.run_index) + + return task_runs + + +def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]: + if not arrays: + return np.array([]), np.array([]) + max_len = max(len(arr) for arr in arrays) + if max_len == 0: + return np.array([]), np.array([]) + mat = np.full((len(arrays), max_len), np.nan) + for idx, arr in enumerate(arrays): + mat[idx, :len(arr)] = arr + return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0) + + +def _round_list(values: np.ndarray, digits: int = 4) -> list[float]: + return [round(float(value), digits) for value in values.tolist()] + + +def _empty_sensitivity_summary() -> dict[str, object]: + return { + "n_pairs": 0, + "mean_score_delta": 0.0, + "mean_tool_edit_distance": 0.0, + "mean_family_js_divergence": 0.0, + "mean_lyapunov_proxy": 0.0, + "mean_initial_divergence": 0.0, + "mean_final_divergence": 0.0, + "mean_contraction_delta": 0.0, + "mean_contraction_ratio": 0.0, + "fraction_converging_pairs": 0.0, + "mean_divergence_curve": [], + "std_divergence_curve": [], + "pair_points": [], + } + + +def _summarize_sensitivity_group(pairs: list) -> dict[str, object]: + if not pairs: + return _empty_sensitivity_summary() + + divergence_curves = [pair.embedding_divergence for pair in pairs if len(pair.embedding_divergence) > 0] + curve_mean, curve_std = _aligned_mean_std(divergence_curves) + + pair_points = [] + for pair in pairs: + if len(pair.embedding_divergence) > 0: + initial_divergence = float(pair.embedding_divergence[0]) + final_divergence = float(pair.embedding_divergence[-1]) + contraction_delta = final_divergence - initial_divergence + contraction_ratio = final_divergence / max(initial_divergence, 1e-6) + else: + initial_divergence = 0.0 + final_divergence = 0.0 + contraction_delta = 0.0 + contraction_ratio = 0.0 + pair_points.append( + { + "score_delta": round(float(pair.score_delta), 4), + "tool_edit_distance": int(pair.tool_edit_distance), + "family_js_divergence": round(float(pair.family_js_divergence), 4), + "lyapunov_proxy": round(float(pair.lyapunov_proxy), 4), + "initial_divergence": round(initial_divergence, 4), + "final_divergence": round(final_divergence, 4), + "contraction_delta": round(contraction_delta, 4), + "contraction_ratio": round(contraction_ratio, 4), + } + ) + + converging_pairs = sum( + 1 for point in pair_points if point["final_divergence"] < point["initial_divergence"] + ) + + return { + "n_pairs": len(pairs), + "mean_score_delta": round(float(np.mean([pair.score_delta for pair in pairs])), 4), + "mean_tool_edit_distance": round(float(np.mean([pair.tool_edit_distance for pair in pairs])), 4), + "mean_family_js_divergence": round(float(np.mean([pair.family_js_divergence for pair in pairs])), 4), + "mean_lyapunov_proxy": round(float(np.mean([pair.lyapunov_proxy for pair in pairs])), 4), + "mean_initial_divergence": round(float(np.mean([point["initial_divergence"] for point in pair_points])), 4), + "mean_final_divergence": round(float(np.mean([point["final_divergence"] for point in pair_points])), 4), + "mean_contraction_delta": round(float(np.mean([point["contraction_delta"] for point in pair_points])), 4), + "mean_contraction_ratio": round(float(np.mean([point["contraction_ratio"] for point in pair_points])), 4), + "fraction_converging_pairs": round(converging_pairs / len(pair_points), 4), + "mean_divergence_curve": _round_list(curve_mean), + "std_divergence_curve": _round_list(curve_std), + "pair_points": pair_points, + } + + +def _build_sensitivity_sections( + valid_runs_by_task: dict[str, list[TaskRunResult]], +) -> tuple[list, dict[str, object]]: + same_task_pairs = [] + per_task: dict[str, object] = {} + for task_id, runs in sorted(valid_runs_by_task.items()): + if len(runs) < 2: + continue + task_pairs = [ + compute_sensitivity(run_a, run_b, task_id=task_id) + for run_a, run_b in combinations(runs, 2) + ] + if task_pairs: + same_task_pairs.extend(task_pairs) + per_task[task_id] = _summarize_sensitivity_group(task_pairs) + + same_task_summary = _summarize_sensitivity_group(same_task_pairs) + same_task_summary["per_task"] = per_task + + perturbation_pairs = [] + per_variant_group: dict[str, object] = {} + runs_by_variant_group: dict[str, list[TaskRunResult]] = {} + for runs in valid_runs_by_task.values(): + for run in runs: + runs_by_variant_group.setdefault(run.variant_group or run.task_id, []).append(run) + + for variant_group, runs in sorted(runs_by_variant_group.items()): + distinct_members = { + (run.task_id, run.prompt_variant, run.variant_id) + for run in runs + } + if len(distinct_members) < 2: + continue + + group_pairs = [] + for run_a, run_b in combinations(runs, 2): + if ( + run_a.task_id == run_b.task_id + and run_a.prompt_variant == run_b.prompt_variant + and run_a.variant_id == run_b.variant_id + ): + continue + group_pairs.append(compute_sensitivity(run_a, run_b, task_id=variant_group)) + + if not group_pairs: + continue + + perturbation_pairs.extend(group_pairs) + group_summary = _summarize_sensitivity_group(group_pairs) + group_summary["members"] = [ + { + "task_id": task_id, + "prompt_variant": prompt_variant, + "variant_id": variant_id, + } + for task_id, prompt_variant, variant_id in sorted(distinct_members) + ] + per_variant_group[variant_group] = group_summary + + perturbation_summary = _summarize_sensitivity_group(perturbation_pairs) + perturbation_summary["per_variant_group"] = per_variant_group + + return same_task_pairs, { + "same_task": same_task_summary, + "prompt_perturbation": perturbation_summary, + } + + +def build_dynamics_report( + task_runs: dict[str, list[TaskRunResult]], + include_pca: bool = True, +) -> tuple[dict, list]: + """Compute stratified dynamics report data from cached runs.""" + all_runs = [run for runs in task_runs.values() for run in runs] + if not all_runs: + raise ValueError("No cached runs were loaded.") + + dynamics_list = [] + scores = [] + valid_runs = [] + for run in all_runs: + if not run.transcript.messages: + continue + dynamics_list.append(compute_dynamics(run.transcript)) + scores.append(run.run_score) + valid_runs.append(run) + + if not valid_runs: + raise ValueError("No runs with transcripts were found in the archive.") + + valid_runs_by_task: dict[str, list[TaskRunResult]] = {} + for run in valid_runs: + valid_runs_by_task.setdefault(run.task_id, []).append(run) + + same_task_sensitivities, sensitivity_summary = _build_sensitivity_sections(valid_runs_by_task) + + stratifiers = { + "tier": stratify_by_tier, + "regime": stratify_by_regime, + "tool_mix": stratify_by_tool_mix, + "scenario": stratify_by_scenario, + } + + report: dict[str, object] = { + "n_runs": len(valid_runs), + "n_tasks": len(task_runs), + "strata": {}, + } + + stratified = {} + for name, fn in stratifiers.items(): + assessment = build_strata( + valid_runs, + dynamics_list, + scores, + fn, + name, + sensitivities=same_task_sensitivities, + ) + stratified[name] = assessment + strata_summary = [] + for stratum in assessment.strata: + strata_summary.append( + { + "name": stratum.name, + "n_runs": stratum.n_runs, + "weight": round(stratum.weight, 4), + "score_mean": round(stratum.score_mean, 4), + "score_std": round(stratum.score_std, 4), + "score_quantiles": { + key: round(value, 4) + for key, value in stratum.score_quantiles.items() + }, + "entropy_mean": round(float(stratum.entropy_dist.mean()), 4) + if len(stratum.entropy_dist) + else 0.0, + "error_rate_mean": round(float(stratum.error_rate_dist.mean()), 4) + if len(stratum.error_rate_dist) + else 0.0, + "constraint_mean": round(float(stratum.constraint_dist.mean()), 4) + if len(stratum.constraint_dist) + else 0.0, + "memory_depth_mean": round(float(stratum.memory_depth_dist.mean()), 4) + if len(stratum.memory_depth_dist) + else 0.0, + "sensitivity_pairs": int(len(stratum.sensitivity_deltas)), + "sensitivity_mean_score_delta": round(float(stratum.sensitivity_deltas.mean()), 4) + if len(stratum.sensitivity_deltas) + else 0.0, + "regime_counts": stratum.regime_counts, + } + ) + report["strata"][name] = { + "observed_mean_score": round(assessment.observed_mean_score, 4), + "observed_std_score": round(assessment.observed_std_score, 4), + "strata": strata_summary, + } + + report["per_run"] = [ + { + "task_id": run.task_id, + "run_index": run.run_index, + "score": round(run.run_score, 4), + "regime": dynamics.regime.value, + "entropy": round(dynamics.tool_entropy, 4), + "error_rate": round(dynamics.error_rate, 4), + "constraint_index": round(dynamics.constraint_index, 4), + "memory_depth": round(dynamics.memory_depth, 4), + "n_steps": dynamics.n_steps, + "mean_drift": round(dynamics.mean_drift, 4), + "mean_step_size": round(dynamics.mean_step_size, 4), + } + for run, dynamics in zip(valid_runs, dynamics_list) + ] + report["sensitivity"] = sensitivity_summary + + if include_pca: + compute_pca_bundle(dynamics_list) + + events = [] + censored = [] + for run in valid_runs: + step = find_event_step(run.transcript, "first_correct_write") + if step is not None: + events.append(step) + censored.append(False) + else: + events.append(float(len(run.transcript.assistant_messages))) + censored.append(True) + km_points = kaplan_meier(events, censored) + return report, generate_all_plots, { + "valid_runs": valid_runs, + "dynamics_list": dynamics_list, + "stratified": stratified, + "km_points": km_points, + "sensitivity": sensitivity_summary, + } + + +def write_dynamics_report( + task_runs: dict[str, list[TaskRunResult]], + out_dir: Path, + report_name: str = "dynamics.json", + generate_plots: bool = True, +) -> tuple[Path, list[Path]]: + """Write the dynamics report JSON and plots to an output directory.""" + report, plotter, plot_data = build_dynamics_report(task_runs, include_pca=generate_plots) + out_dir.mkdir(parents=True, exist_ok=True) + + report_path = out_dir / report_name + report_path.write_text(json.dumps(report, indent=2), encoding="utf-8") + + plots: list[Path] = [] + if generate_plots: + plots = plotter( + plot_data["dynamics_list"], + plot_data["valid_runs"], + plot_data["stratified"], + km_points=plot_data["km_points"], + event_name="first_correct_write", + out_dir=out_dir, + sensitivity_summary=plot_data["sensitivity"], + ) + return report_path, plots + + +def load_task_runs_by_model( + archive_dir: Path, + tier: str | None = None, + task_ids: Iterable[str] | None = None, +) -> dict[str, dict[str, list[TaskRunResult]]]: + """Load cached TaskRunResult objects grouped by model directory name.""" + grouped: dict[str, dict[str, list[TaskRunResult]]] = {} + for model_name, model_dir in discover_model_roots(archive_dir).items(): + task_runs = load_task_runs_archive( + archive_dir=model_dir, + model=None, + task_ids=task_ids, + tier=tier, + ) + if task_runs: + grouped[model_name] = task_runs + return grouped \ No newline at end of file diff --git a/clawbench/dynamics_plots.py b/clawbench/dynamics_plots.py new file mode 100644 index 0000000..9cc6097 --- /dev/null +++ b/clawbench/dynamics_plots.py @@ -0,0 +1,411 @@ +"""Plotting utilities for dynamics analysis. + +Generates publication-ready figures from dynamics data and saves to a +results directory. All plots use matplotlib with the Agg backend so they +work headlessly. +""" +from __future__ import annotations + +from pathlib import Path + +import matplotlib +matplotlib.use("Agg") +import matplotlib.pyplot as plt +import numpy as np + +from clawbench.dynamics import ( + Dynamics, + StratifiedAssessment, + StratumStats, + SurvivalPoint, +) + + +def _savefig(fig: plt.Figure, path: Path) -> None: + fig.savefig(path, dpi=150, bbox_inches="tight") + plt.close(fig) + + +def _plot_series_curves( + dynamics_list: list[Dynamics], + labels: list[str], + out_path: Path, + *, + series_attr: str, + ylabel: str, + title: str, +) -> None: + """Plot a step-aligned per-run series coloured by label.""" + fig, ax = plt.subplots(figsize=(10, 5)) + cmap = plt.cm.tab10 + unique = sorted(set(labels)) + colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)} + + for d, lbl in zip(dynamics_list, labels): + series = np.asarray(getattr(d, series_attr), dtype=float) + if len(series) < 2: + continue + ax.plot(series, alpha=0.6, color=colour_map[lbl], linewidth=1) + + for lbl in unique: + ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2) + ax.legend(fontsize=8, loc="upper left") + ax.set_xlabel("Step") + ax.set_ylabel(ylabel) + ax.set_title(title) + _savefig(fig, out_path) + + +def plot_drift_curves( + dynamics_list: list[Dynamics], + labels: list[str], + out_path: Path, +) -> None: + """Drift-from-origin curves coloured by label (e.g. task_id or regime).""" + _plot_series_curves( + dynamics_list, + labels, + out_path, + series_attr="drift", + ylabel="Cosine distance from step 0", + title="Drift from Origin", + ) + + +def plot_step_size_curves( + dynamics_list: list[Dynamics], + labels: list[str], + out_path: Path, +) -> None: + """Step-to-step movement curves coloured by label.""" + _plot_series_curves( + dynamics_list, + labels, + out_path, + series_attr="step_size", + ylabel="Cosine distance from previous step", + title="Step-to-Step Movement", + ) + + +def plot_pca_trajectories( + dynamics_list: list[Dynamics], + labels: list[str], + out_path: Path, +) -> None: + """PCA phase portraits (PC1 vs PC2) coloured by label.""" + fig, ax = plt.subplots(figsize=(8, 8)) + cmap = plt.cm.tab10 + unique = sorted(set(labels)) + colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)} + + for d, lbl in zip(dynamics_list, labels): + if d.pca_trajectory is None or len(d.pca_trajectory) < 2: + continue + traj = d.pca_trajectory + ax.plot(traj[:, 0], traj[:, 1], alpha=0.5, color=colour_map[lbl], linewidth=1) + ax.scatter(traj[0, 0], traj[0, 1], color=colour_map[lbl], marker="o", s=30, zorder=5) + ax.scatter(traj[-1, 0], traj[-1, 1], color=colour_map[lbl], marker="x", s=30, zorder=5) + + for lbl in unique: + ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2) + ax.legend(fontsize=8) + ax.set_xlabel("PC1") + ax.set_ylabel("PC2") + ax.set_title("PCA Phase Portrait (o=start, x=end)") + _savefig(fig, out_path) + + +def plot_regime_distribution( + strata: list[StratumStats], + stratifier_name: str, + out_path: Path, +) -> None: + """Stacked bar chart of regime counts per stratum.""" + fig, ax = plt.subplots(figsize=(10, 5)) + all_regimes = sorted({r for s in strata for r in s.regime_counts}) + x = np.arange(len(strata)) + bottom = np.zeros(len(strata)) + cmap = plt.cm.Set2 + + for j, regime in enumerate(all_regimes): + counts = [s.regime_counts.get(regime, 0) for s in strata] + ax.bar(x, counts, bottom=bottom, label=regime, color=cmap(j / max(len(all_regimes) - 1, 1))) + bottom += np.array(counts) + + ax.set_xticks(x) + ax.set_xticklabels([s.name for s in strata], rotation=30, ha="right") + ax.set_ylabel("Count") + ax.set_title(f"Regime Distribution by {stratifier_name}") + ax.legend(fontsize=8) + _savefig(fig, out_path) + + +def plot_score_distributions( + strata: list[StratumStats], + stratifier_name: str, + out_path: Path, +) -> None: + """Box plots of score distributions per stratum.""" + fig, ax = plt.subplots(figsize=(10, 5)) + data = [s.scores for s in strata if len(s.scores) > 0] + labels = [s.name for s in strata if len(s.scores) > 0] + + if data: + ax.boxplot(data, labels=labels, patch_artist=True, + boxprops=dict(facecolor="lightblue", alpha=0.7)) + ax.set_ylabel("Score") + ax.set_title(f"Score Distribution by {stratifier_name}") + plt.xticks(rotation=30, ha="right") + _savefig(fig, out_path) + + +def plot_survival_curve( + km_points: list[SurvivalPoint], + event_name: str, + out_path: Path, +) -> None: + """Kaplan-Meier survival curve.""" + if not km_points: + return + fig, ax = plt.subplots(figsize=(8, 5)) + times = [p.time for p in km_points] + surv = [p.survival for p in km_points] + ax.step(times, surv, where="post", linewidth=2, color="steelblue") + ax.fill_between(times, surv, step="post", alpha=0.15, color="steelblue") + ax.set_xlabel("Step") + ax.set_ylabel("Survival probability") + ax.set_title(f"Kaplan-Meier: {event_name}") + ax.set_ylim(-0.05, 1.05) + _savefig(fig, out_path) + + +def plot_stratum_dynamics_heatmap( + strata: list[StratumStats], + stratifier_name: str, + out_path: Path, +) -> None: + """Heatmap of mean dynamics metrics across strata.""" + metrics = ["entropy", "error_rate", "constraint", "memory_depth", "mean_drift", "mean_step_size"] + data = np.zeros((len(strata), len(metrics))) + for i, s in enumerate(strata): + arrays = [s.entropy_dist, s.error_rate_dist, s.constraint_dist, + s.memory_depth_dist, s.mean_drift_dist, s.mean_step_size_dist] + for j, arr in enumerate(arrays): + data[i, j] = float(np.mean(arr)) if len(arr) > 0 else 0.0 + + fig, ax = plt.subplots(figsize=(10, max(3, len(strata) * 0.6))) + im = ax.imshow(data, aspect="auto", cmap="YlOrRd") + ax.set_xticks(range(len(metrics))) + ax.set_xticklabels(metrics, rotation=30, ha="right") + ax.set_yticks(range(len(strata))) + ax.set_yticklabels([s.name for s in strata]) + for i in range(len(strata)): + for j in range(len(metrics)): + ax.text(j, i, f"{data[i, j]:.2f}", ha="center", va="center", fontsize=8) + fig.colorbar(im, ax=ax, shrink=0.8) + ax.set_title(f"Dynamics Metrics by {stratifier_name}") + _savefig(fig, out_path) + + +def plot_pairwise_divergence_curves( + per_task_sensitivity: dict[str, dict], + out_path: Path, +) -> bool: + """Plot mean pairwise trajectory divergence over aligned steps.""" + if not per_task_sensitivity: + return False + + fig, ax = plt.subplots(figsize=(10, 5)) + cmap = plt.cm.tab10 + tasks = sorted(per_task_sensitivity) + colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)} + + plotted = False + for task in tasks: + summary = per_task_sensitivity[task] + mean_curve = np.asarray(summary.get("mean_divergence_curve", []), dtype=float) + std_curve = np.asarray(summary.get("std_divergence_curve", []), dtype=float) + if len(mean_curve) == 0: + continue + steps = np.arange(len(mean_curve)) + ax.plot(steps, mean_curve, linewidth=2, color=colour_map[task], label=task) + if len(std_curve) == len(mean_curve): + ax.fill_between(steps, mean_curve - std_curve, mean_curve + std_curve, color=colour_map[task], alpha=0.12) + plotted = True + + if not plotted: + plt.close(fig) + return False + + ax.set_xlabel("Aligned step") + ax.set_ylabel("Pairwise embedding divergence") + ax.set_title("Do Repeated Trajectories Converge or Diverge?") + ax.legend(fontsize=8) + _savefig(fig, out_path) + return True + + +def plot_pairwise_contraction_scatter( + per_task_sensitivity: dict[str, dict], + out_path: Path, +) -> bool: + """Scatter initial vs final pairwise divergence; below diagonal means convergence.""" + if not per_task_sensitivity: + return False + + fig, ax = plt.subplots(figsize=(7, 6)) + cmap = plt.cm.tab10 + tasks = sorted(per_task_sensitivity) + colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)} + + max_seen = 0.0 + plotted = False + for task in tasks: + points = per_task_sensitivity[task].get("pair_points", []) + if not points: + continue + xs = [point["initial_divergence"] for point in points] + ys = [point["final_divergence"] for point in points] + max_seen = max(max_seen, *(xs + ys)) + ax.scatter(xs, ys, s=60, alpha=0.8, color=colour_map[task], label=task) + plotted = True + + if not plotted: + plt.close(fig) + return False + + limit = max(max_seen, 0.1) + ax.plot([0, limit], [0, limit], linestyle="--", color="black", linewidth=1) + ax.set_xlabel("Initial pairwise divergence") + ax.set_ylabel("Final pairwise divergence") + ax.set_title("Pairwise Trajectory Contraction") + ax.legend(fontsize=8) + _savefig(fig, out_path) + return True + + +def plot_sensitivity_heatmap( + per_task_sensitivity: dict[str, dict], + out_path: Path, +) -> bool: + """Heatmap of per-task sensitivity metrics.""" + if not per_task_sensitivity: + return False + + metrics = [ + ("mean_score_delta", "score_delta"), + ("mean_tool_edit_distance", "tool_edit"), + ("mean_family_js_divergence", "js_div"), + ("mean_lyapunov_proxy", "lyapunov"), + ("fraction_converging_pairs", "frac_converging"), + ] + tasks = sorted(per_task_sensitivity) + data = np.zeros((len(tasks), len(metrics))) + for row_idx, task in enumerate(tasks): + summary = per_task_sensitivity[task] + for col_idx, (key, _label) in enumerate(metrics): + data[row_idx, col_idx] = float(summary.get(key, 0.0)) + + fig, ax = plt.subplots(figsize=(9, max(3, len(tasks) * 0.7))) + im = ax.imshow(data, aspect="auto", cmap="Blues") + ax.set_xticks(range(len(metrics))) + ax.set_xticklabels([label for _key, label in metrics], rotation=30, ha="right") + ax.set_yticks(range(len(tasks))) + ax.set_yticklabels(tasks) + for row_idx in range(len(tasks)): + for col_idx in range(len(metrics)): + ax.text(col_idx, row_idx, f"{data[row_idx, col_idx]:.2f}", ha="center", va="center", fontsize=8) + fig.colorbar(im, ax=ax, shrink=0.8) + ax.set_title("Pairwise Sensitivity by Task") + _savefig(fig, out_path) + return True + + +def generate_all_plots( + dynamics_list: list[Dynamics], + runs: list, + stratified: dict[str, StratifiedAssessment], + km_points: list[SurvivalPoint] | None = None, + event_name: str = "first_correct_write", + out_dir: Path = Path("results"), + sensitivity_summary: dict[str, dict] | None = None, +) -> list[Path]: + """Generate all dynamics plots and return list of saved paths.""" + out_dir.mkdir(parents=True, exist_ok=True) + saved: list[Path] = [] + + # Labels by regime + regime_labels = [d.regime.value for d in dynamics_list] + tier_labels = [] + for r in runs: + tid = r.task_id.lower() + tier = "unknown" + for i in range(1, 6): + if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"): + tier = f"tier{i}" + break + tier_labels.append(tier) + + # Drift curves by regime + p = out_dir / "drift_by_regime.png" + plot_drift_curves(dynamics_list, regime_labels, p) + saved.append(p) + + # Drift curves by tier + p = out_dir / "drift_by_tier.png" + plot_drift_curves(dynamics_list, tier_labels, p) + saved.append(p) + + p = out_dir / "step_size_by_regime.png" + plot_step_size_curves(dynamics_list, regime_labels, p) + saved.append(p) + + p = out_dir / "step_size_by_tier.png" + plot_step_size_curves(dynamics_list, tier_labels, p) + saved.append(p) + + # PCA trajectories + has_pca = any(d.pca_trajectory is not None for d in dynamics_list) + if has_pca: + p = out_dir / "pca_by_regime.png" + plot_pca_trajectories(dynamics_list, regime_labels, p) + saved.append(p) + p = out_dir / "pca_by_tier.png" + plot_pca_trajectories(dynamics_list, tier_labels, p) + saved.append(p) + + # Per-stratifier plots + for name, sa in stratified.items(): + p = out_dir / f"regimes_by_{name}.png" + plot_regime_distribution(sa.strata, name, p) + saved.append(p) + + p = out_dir / f"scores_by_{name}.png" + plot_score_distributions(sa.strata, name, p) + saved.append(p) + + p = out_dir / f"dynamics_heatmap_{name}.png" + plot_stratum_dynamics_heatmap(sa.strata, name, p) + saved.append(p) + + # Survival curve + if km_points: + p = out_dir / f"survival_{event_name}.png" + plot_survival_curve(km_points, event_name, p) + saved.append(p) + + per_task_sensitivity = (sensitivity_summary or {}).get("same_task", {}).get("per_task", {}) + p = out_dir / "pairwise_divergence_by_task.png" + if plot_pairwise_divergence_curves(per_task_sensitivity, p): + saved.append(p) + + p = out_dir / "pairwise_contraction_scatter.png" + if plot_pairwise_contraction_scatter(per_task_sensitivity, p): + saved.append(p) + + p = out_dir / "sensitivity_heatmap.png" + if plot_sensitivity_heatmap(per_task_sensitivity, p): + saved.append(p) + + return saved diff --git a/clawbench/harness.py b/clawbench/harness.py index 6dc521d..b955aa6 100644 --- a/clawbench/harness.py +++ b/clawbench/harness.py @@ -103,6 +103,7 @@ class BenchmarkHarness: self.concurrency = max(1, int(concurrency)) self.browser_concurrency = max(1, int(browser_concurrency)) self.repo_root = Path(__file__).parent.parent + self.last_task_runs: dict[str, list[TaskRunResult]] = {} async def run(self) -> BenchmarkResult: tasks = load_all_tasks( @@ -148,6 +149,7 @@ class BenchmarkHarness: f"({mean_run:.1f}s avg, concurrency={self.concurrency})[/dim]" ) + self.last_task_runs = all_results return self._aggregate(tasks, all_results) async def _execute_runs( diff --git a/clawbench/submission_models.py b/clawbench/submission_models.py new file mode 100644 index 0000000..0988b83 --- /dev/null +++ b/clawbench/submission_models.py @@ -0,0 +1,147 @@ +"""Preset model catalog and selection helpers for the Space submit UI.""" + +from __future__ import annotations + +from dataclasses import dataclass + +CUSTOM_PRESET_LABEL = "(custom)" + +PRESET_AUDIENCE_ALL = "All Presets" +PRESET_AUDIENCE_CLAW = "Claw Users" +PRESET_AUDIENCE_BUDGET = "Budget Researchers" + +PRESET_AUDIENCE_CHOICES = ( + PRESET_AUDIENCE_ALL, + PRESET_AUDIENCE_CLAW, + PRESET_AUDIENCE_BUDGET, +) + + +@dataclass(frozen=True) +class PresetModel: + label: str + model_id: str + provider: str + audiences: tuple[str, ...] + + +PRESET_MODELS = ( + PresetModel( + label="GPT-OSS 20B (Ollama)", + model_id="ollama/gpt-oss:20b", + provider="ollama", + audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET), + ), + PresetModel( + label="Qwen 3.5 27B (Ollama)", + model_id="ollama/qwen3.5:27b", + provider="ollama", + audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET), + ), + PresetModel( + label="Qwen3 32B", + model_id="huggingface/Qwen/Qwen3-32B", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET), + ), + PresetModel( + label="Gemma 4 26B MoE", + model_id="huggingface/google/gemma-4-26B-A4B-it", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET), + ), + PresetModel( + label="GLM 5.1 (754B MoE)", + model_id="huggingface/zai-org/GLM-5.1", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW,), + ), + PresetModel( + label="GLM 5 (400B MoE)", + model_id="huggingface/zai-org/GLM-5", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW,), + ), + PresetModel( + label="DeepSeek R1", + model_id="huggingface/deepseek-ai/DeepSeek-R1", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW,), + ), + PresetModel( + label="Kimi K2 Instruct", + model_id="huggingface/moonshotai/Kimi-K2-Instruct", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW,), + ), + PresetModel( + label="MiniMax M2.5", + model_id="huggingface/MiniMaxAI/MiniMax-M2.5", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW,), + ), + PresetModel( + label="Llama 3.3 70B", + model_id="huggingface/meta-llama/Llama-3.3-70B-Instruct", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW,), + ), + PresetModel( + label="Llama 3.1 70B", + model_id="huggingface/meta-llama/Llama-3.1-70B-Instruct", + provider="huggingface", + audiences=(PRESET_AUDIENCE_CLAW,), + ), + PresetModel( + label="Claude Sonnet 4.6", + model_id="anthropic/claude-sonnet-4-6", + provider="anthropic", + audiences=(PRESET_AUDIENCE_CLAW,), + ), + PresetModel( + label="Claude Opus 4.6", + model_id="anthropic/claude-opus-4-6", + provider="anthropic", + audiences=(PRESET_AUDIENCE_CLAW,), + ), +) + +PRESET_MODEL_MAP = {preset.label: preset.model_id for preset in PRESET_MODELS} +_PRESET_BY_LABEL = {preset.label: preset for preset in PRESET_MODELS} + + +def infer_provider(model_id: str) -> str: + normalized = model_id.strip() + if not normalized or "/" not in normalized: + return "" + return normalized.split("/", 1)[0].strip().lower() + + +def preset_models_for_audience(audience: str | None) -> list[PresetModel]: + if not audience or audience == PRESET_AUDIENCE_ALL: + return list(PRESET_MODELS) + return [preset for preset in PRESET_MODELS if audience in preset.audiences] + + +def preset_labels_for_audience(audience: str | None) -> list[str]: + return [preset.label for preset in preset_models_for_audience(audience)] + + +def resolve_model_selection( + model: str, + preset_label: str, + provider: str = "", +) -> tuple[str, str]: + selected_model = model.strip() + selected_provider = provider.strip() + + preset = _PRESET_BY_LABEL.get(preset_label) + if preset is not None: + selected_model = preset.model_id + if not selected_provider: + selected_provider = preset.provider + + if not selected_provider: + selected_provider = infer_provider(selected_model) + + return selected_model, selected_provider \ No newline at end of file diff --git a/scripts/classify_regimes.py b/scripts/classify_regimes.py index a896720..899b8b1 100644 --- a/scripts/classify_regimes.py +++ b/scripts/classify_regimes.py @@ -1,140 +1,112 @@ -"""Classify each archived run's dynamical regime from its turn trajectory. +#!/usr/bin/env python3 +"""Classify posterior run trajectories into dynamical regimes. -Following "When LLMs Are Dreaming..." §What We Expect to See: +We embed each assistant turn using bag-of-words text plus tool-call summaries, +then compute simple geometric proxies: - TRAPPED/ATTRACTOR — low support (Vol_log), high recurrence, high BOPS. - Agent converged to a point; may be good (solved it) - or bad (got stuck in a loop on a single idea). + drift_mean = mean ||x_t - x_{t-1}|| + from_start = max ||x_t - x_0|| + recurrence = max cosine(x_i, x_j) for non-adjacent turns + vol_log = log det(Sigma + eps I) - LIMIT-CYCLE — high recurrence + bounded drift + quasi-periodic revisits. - Agent loops between a few states. - - DIFFUSIVE/WANDERING — growing support, rising drift, low recurrence. - Agent explores without converging; often "goal drift". - - SENSITIVE — (requires paraphrased-pair runs; skip here.) - - TOO-SHORT — trajectory < 3 assistant turns; can't classify dynamics. - -We work in a TF-IDF bag-of-words embedding space (same vocab as C(q)), -with each turn's state vector = its assistant text + tool-call args. - -Metrics per run: - - drift_mean: mean ||e_t − e_{t−1}|| across turns - - from_start: max ||e_t − e_0|| (farthest the run drifted from origin) - - recurrence: max_{i 0.80 and drift_mean < 0.25 → limit_cycle - elif drift_mean > 0.35 and vol_log > −3 → diffusive - else → mixed - -Output: reports/regimes.json with per-run classification. - -Usage: - .venv/bin/python3 scripts/classify_regimes.py +Runs are then bucketed into coarse regimes such as trapped, limit_cycle, and +diffusive using quartile-based thresholds estimated from the observed archive. """ from __future__ import annotations +import argparse import json import re +import sys from collections import Counter, defaultdict from pathlib import Path import numpy as np -ROOT = Path(__file__).resolve().parent.parent -ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full" +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) -MODELS = [ - "anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7", - "anthropic_claude-sonnet-4-6", "openai_gpt-5.4", - "google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1", - "openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5", - "openrouter_qwen_qwen3.6-plus", -] +from clawbench.dynamics_archive import load_task_runs_by_model WORD_RE = re.compile(r"[a-z]{3,}") -STOPWORDS = set("the and that with this have from what your will can but not " - "was will are been one would there been they will their has " - "had its were only some than about these which into also each " - "when where them how who them very much more most other then " - "here such does like just make many like want need take".split()) +STOPWORDS = set( + "the and that with this have from what your will can but not " + "was are been one would there they their has had its were only some " + "than about these which into also each when where them how who very " + "much more most other then here such does like just make many want need take".split() +) def tokenize(text: str) -> list[str]: return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS] -def build_vocab(all_turn_texts: list[str], top_k: int = 500) -> dict[str, int]: - c = Counter() - for t in all_turn_texts: - c.update(set(tokenize(t))) - return {w: i for i, (w, _) in enumerate(c.most_common(top_k))} +def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]: + counter = Counter() + for text in texts: + counter.update(set(tokenize(text))) + return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))} def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray: - v = np.zeros(len(vocab), dtype=np.float32) - for w, c in Counter(tokenize(text)).items(): - if w in vocab: - v[vocab[w]] = c - n = np.linalg.norm(v) - return v / n if n > 0 else v + vec = np.zeros(len(vocab), dtype=np.float32) + for word, cnt in Counter(tokenize(text)).items(): + if word in vocab: + vec[vocab[word]] = cnt + norm = np.linalg.norm(vec) + return vec / norm if norm > 0 else vec -def turn_texts(run_data: dict) -> list[str]: - """Extract one text string per assistant turn (text + tool-call summary).""" +def turn_texts(run, fallback_any_message: bool = False) -> list[str]: + source = run.transcript.messages if fallback_any_message else run.transcript.assistant_messages out = [] - for m in run_data.get("transcript", {}).get("messages", []): - if m.get("role") != "assistant": - continue + for msg in source: parts = [] - if m.get("text"): - parts.append(m["text"]) - for tc in (m.get("tool_calls") or []): - name = tc.get("name", "") - args_str = json.dumps(tc.get("arguments", {}))[:200] - parts.append(f"{name} {args_str}") + if msg.text: + parts.append(msg.text) + for tc in msg.tool_calls: + parts.append(tc.name) + if tc.input: + parts.append(json.dumps(tc.input, sort_keys=True)[:200]) if parts: out.append(" ".join(parts)) return out -def trajectory_metrics(vecs: np.ndarray) -> dict: - """Compute dynamical metrics over a (n_turns, d) trajectory matrix.""" +def trajectory_metrics(vecs: np.ndarray) -> dict[str, float]: + """Compute drift, recurrence, and support-volume proxies for one run.""" n = vecs.shape[0] if n < 2: - return {"n_turns": n, "drift_mean": 0.0, "from_start": 0.0, - "recurrence": 0.0, "vol_log": -12.0} - # Drift: consecutive distances + return { + "n_turns": float(n), + "drift_mean": 0.0, + "from_start": 0.0, + "recurrence": 0.0, + "vol_log": -12.0, + } + diffs = np.linalg.norm(np.diff(vecs, axis=0), axis=1) drift_mean = float(diffs.mean()) - # From start: max distance from turn 0 - dists_from_0 = np.linalg.norm(vecs - vecs[0:1], axis=1) - from_start = float(dists_from_0.max()) - # Recurrence: best non-adjacent cosine similarity (ignoring immediate neighbors) + from_start = float(np.linalg.norm(vecs - vecs[0:1], axis=1).max()) + recurrence = 0.0 for i in range(n): for j in range(i + 2, n): - ni, nj = np.linalg.norm(vecs[i]), np.linalg.norm(vecs[j]) + ni = np.linalg.norm(vecs[i]) + nj = np.linalg.norm(vecs[j]) if ni > 0 and nj > 0: - c = float(vecs[i] @ vecs[j] / (ni * nj)) - if c > recurrence: - recurrence = c - # Vol_log: log det of turn-state covariance + sim = float(vecs[i] @ vecs[j] / (ni * nj)) + recurrence = max(recurrence, sim) + if n >= 3: - Sigma = np.cov(vecs.T) - # Use log|Σ + εI|; since d is large (500) we take eigenvalues + clip - eigs = np.linalg.eigvalsh(Sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32)) + sigma = np.cov(vecs.T) + eigs = np.linalg.eigvalsh(sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32)) vol_log = float(np.log(np.clip(eigs, 1e-12, None)).sum()) else: vol_log = -12.0 + return { - "n_turns": n, + "n_turns": float(n), "drift_mean": drift_mean, "from_start": from_start, "recurrence": recurrence, @@ -142,109 +114,105 @@ def trajectory_metrics(vecs: np.ndarray) -> dict: } -def classify(m: dict, thresholds: dict) -> str: - """Classify based on quartile thresholds of the actual distribution. - - Thresholds (set empirically from observed distribution): - drift_low = p25 drift_hi = p75 - vol_low = p25 vol_hi = p75 - rec_hi = p75 - - Rules (priority order): - n_turns < 3 → too_short - drift < drift_low AND vol < vol_low → trapped - rec > rec_hi AND drift < median → limit_cycle - drift > drift_hi AND vol > vol_hi → diffusive - else → mixed - """ - n = m["n_turns"] - if n < 3: +def classify(metrics: dict[str, float], thresholds: dict[str, float]) -> str: + """Map trajectory metrics to a coarse regime label.""" + n_turns = int(metrics["n_turns"]) + if n_turns < 3: return "too_short" - d = m["drift_mean"] - rec = m["recurrence"] - vol = m["vol_log"] - if d < thresholds["drift_low"] and vol < thresholds["vol_low"]: + drift = metrics["drift_mean"] + recurrence = metrics["recurrence"] + vol = metrics["vol_log"] + + if drift < thresholds["drift_low"] and vol < thresholds["vol_low"]: return "trapped" - if rec > thresholds["rec_hi"] and d < thresholds["drift_med"]: + if recurrence > thresholds["rec_hi"] and drift < thresholds["drift_med"]: return "limit_cycle" - if d > thresholds["drift_hi"] and vol > thresholds["vol_hi"]: + if drift > thresholds["drift_hi"] and vol > thresholds["vol_hi"]: return "diffusive" return "mixed" def main() -> None: - # First pass: collect turn texts to build vocab + parser = argparse.ArgumentParser(description="Classify cached run regimes") + parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache")) + parser.add_argument("--reports-dir", type=Path, default=Path("reports")) + parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None) + args = parser.parse_args() + + grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier) + if not grouped: + raise SystemExit(f"No cached runs found under {args.archive_dir}") + all_turn_texts: list[str] = [] - run_turns: dict[tuple, list[str]] = {} - for model in MODELS: - for rf in (ARCH / model).rglob("run*.json"): - try: - d = json.loads(rf.read_text()) - except Exception: - continue - task = rf.parent.name - run_idx = int(re.match(r"run(\d+)", rf.stem).group(1)) - ts = turn_texts(d) - run_turns[(model, task, run_idx)] = ts - all_turn_texts.extend(ts) + run_turns: dict[str, list[str]] = {} + + for model_name, task_runs in grouped.items(): + for task_id, runs in task_runs.items(): + for run in runs: + ts = turn_texts(run, fallback_any_message=False) + key = f"{model_name}/{task_id}/run{run.run_index}" + run_turns[key] = ts + all_turn_texts.extend(ts) + + used_fallback_messages = False + if not all_turn_texts: + used_fallback_messages = True + all_turn_texts = [] + run_turns = {} + for model_name, task_runs in grouped.items(): + for task_id, runs in task_runs.items(): + for run in runs: + ts = turn_texts(run, fallback_any_message=True) + key = f"{model_name}/{task_id}/run{run.run_index}" + run_turns[key] = ts + all_turn_texts.extend(ts) + + if not all_turn_texts: + raise SystemExit("No usable turn text found in archive.") vocab = build_vocab(all_turn_texts, top_k=500) - print(f"Runs collected: {len(run_turns)} vocab size: {len(vocab)}") - # Second pass: vectorize + compute metrics - per_run: dict[str, dict] = {} + per_run: dict[str, dict[str, float | str]] = {} for key, ts in run_turns.items(): - model, task, run_idx = key if not ts: continue - vecs = np.stack([vectorize(t, vocab) for t in ts]) - m = trajectory_metrics(vecs) - per_run[f"{model}/{task}/run{run_idx}"] = m + vecs = np.stack([vectorize(text, vocab) for text in ts]) + per_run[key] = trajectory_metrics(vecs) - # Derive thresholds from actual distribution of n_turns>=3 runs - drifts = np.array([v["drift_mean"] for v in per_run.values() if v["n_turns"] >= 3]) - recs = np.array([v["recurrence"] for v in per_run.values() if v["n_turns"] >= 3]) - vols = np.array([v["vol_log"] for v in per_run.values() if v["n_turns"] >= 3]) - thresholds = { - "drift_low": float(np.percentile(drifts, 25)), - "drift_med": float(np.percentile(drifts, 50)), - "drift_hi": float(np.percentile(drifts, 75)), - "vol_low": float(np.percentile(vols, 25)), - "vol_hi": float(np.percentile(vols, 75)), - "rec_hi": float(np.percentile(recs, 75)), - } - print(f"\nThresholds (quartile-based from observed distribution):") - for k, v in thresholds.items(): - print(f" {k:<12} {v:>10.3f}") + eligible = [r for r in per_run.values() if int(r["n_turns"]) >= 3] + if eligible: + drifts = np.array([float(v["drift_mean"]) for v in eligible]) + recs = np.array([float(v["recurrence"]) for v in eligible]) + vols = np.array([float(v["vol_log"]) for v in eligible]) + thresholds = { + "drift_low": float(np.percentile(drifts, 25)), + "drift_med": float(np.percentile(drifts, 50)), + "drift_hi": float(np.percentile(drifts, 75)), + "vol_low": float(np.percentile(vols, 25)), + "vol_hi": float(np.percentile(vols, 75)), + "rec_hi": float(np.percentile(recs, 75)), + } + else: + thresholds = { + "drift_low": 0.15, + "drift_med": 0.25, + "drift_hi": 0.35, + "vol_low": -6.0, + "vol_hi": -3.0, + "rec_hi": 0.8, + } - # Apply classifier with thresholds - for key in per_run: - per_run[key]["regime"] = classify(per_run[key], thresholds) + for key, metrics in per_run.items(): + metrics["regime"] = classify(metrics, thresholds) + metrics["turn_source"] = "any_message" if used_fallback_messages else "assistant" - # Summary by regime - counts = Counter(v["regime"] for v in per_run.values()) - print(f"\nRegime distribution (n={len(per_run)} runs):") - for regime, n in counts.most_common(): - print(f" {regime:<14} {n:>4} ({100*n/len(per_run):>4.1f}%)") + args.reports_dir.mkdir(parents=True, exist_ok=True) + out = args.reports_dir / "regimes.json" + out.write_text(json.dumps(per_run, indent=2), encoding="utf-8") - # Per-model regime breakdown - print(f"\n{'Model':<10} " + " ".join(f"{r:>11}" for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"])) - print("-" * 70) - pm_counts = defaultdict(Counter) - for key, v in per_run.items(): - model = key.split("/")[0] - pm_counts[model][v["regime"]] += 1 - for model in MODELS: - row = [f"{model.split('_')[-1][:9]:<10}"] - for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]: - row.append(f"{pm_counts[model][r]:>11}") - print(" ".join(row)) - - # Write output - out = ROOT / "reports" / "regimes.json" - out.parent.mkdir(exist_ok=True) - out.write_text(json.dumps(per_run, indent=2)) - print(f"\nWrote: {out}") + counts = Counter(str(v["regime"]) for v in per_run.values()) + print(f"Wrote: {out}") + print(f"Regime counts: {dict(counts)}") if __name__ == "__main__": diff --git a/scripts/compute_constraint_index.py b/scripts/compute_constraint_index.py index 8ab5dd7..4f6adae 100644 --- a/scripts/compute_constraint_index.py +++ b/scripts/compute_constraint_index.py @@ -1,145 +1,127 @@ -"""Compute Constraint Index C(q) per task from existing v4-19-full archive. +#!/usr/bin/env python3 +"""Compute posterior Constraint Index C(q) from cached runs. -Following "When LLMs Are Dreaming..." paper §Query-design: +Task-level constraint index: - C(q) = z(PR(q)) + z(entropy(q)) + z(BOPS(q)) + C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q)) Where: - - PR(q): participation ratio = (tr Σ)² / tr(Σ²) of response embeddings - across all (model, run) responses to query q. Low PR = everyone - writes similar thing (prompt is constrained). High PR = responses - spread out (prompt is open-ended). - - entropy(q): Shannon entropy of (discretized) response-feature distribution. - - BOPS(q): Bayesian Optimal Prediction Score — how well can we predict - response given q? Proxied here as inter-run cosine similarity - for the same model (high similarity = high predictability). -Since we don't have sentence-transformers, we use TF-IDF-style bag-of-words -from the final assistant message per run. This is crude but measures the -same signal — whether models produce similar vs divergent output. + PR(q) = participation ratio of the task response covariance + H(q) = Shannon entropy of the covariance eigenspectrum + BOPS(q) = within-model inter-run predictability proxy -Output: reports/constraint_index.json with per-task C(q) components + - combined z-score. +High C(q) means a task is more constrained: models and repeated runs tend to +land in a narrower response manifold. Low C(q) means the task is more open or +stylistically underconstrained. -Usage: - .venv/bin/python3 scripts/compute_constraint_index.py +This implementation uses a normalized bag-of-words representation built from +the full assistant trajectory text plus tool-call names and compacted inputs. """ from __future__ import annotations +import argparse import json import re -import glob +import sys from collections import Counter, defaultdict from pathlib import Path import numpy as np -from scipy.stats import entropy as shannon_entropy -ROOT = Path(__file__).resolve().parent.parent -ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full" +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) -MODELS = [ - "anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7", - "anthropic_claude-sonnet-4-6", "openai_gpt-5.4", - "google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1", - "openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5", - "openrouter_qwen_qwen3.6-plus", -] +from clawbench.dynamics_archive import load_task_runs_by_model WORD_RE = re.compile(r"[a-z]{3,}") -STOPWORDS = set("the and that with this have from what your will can but not " - "was will are been one would there been they will their has " - "had its were only some than about these which into also each " - "when where them how who them very much more most other then " - "here such does like just make many like want need take".split()) +STOPWORDS = set( + "the and that with this have from what your will can but not " + "was are been one would there they their has had its were only some " + "than about these which into also each when where them how who very " + "much more most other then here such does like just make many want need take".split() +) -def final_assistant_text(run_path: Path, max_chars: int = 4000) -> str: - """Extract the last assistant message text + tool-call arg summary.""" - try: - d = json.loads(run_path.read_text()) - except Exception: - return "" - msgs = d.get("transcript", {}).get("messages", []) - texts = [] - for m in msgs: - if m.get("role") != "assistant": - continue - if m.get("text"): - texts.append(m["text"]) - for tc in (m.get("tool_calls") or []): - name = tc.get("name", "") - args_str = json.dumps(tc.get("arguments", {}))[:200] - texts.append(f"{name} {args_str}") - blob = " ".join(texts)[:max_chars] - return blob +def _assistant_trajectory_text(run, max_chars: int = 4000) -> str: + parts = [] + for message in run.transcript.assistant_messages: + if message.text: + parts.append(message.text) + for call in message.tool_calls: + parts.append(call.name) + if call.input: + parts.append(json.dumps(call.input, sort_keys=True)[:200]) + return " ".join(p for p in parts if p).strip()[:max_chars] + + +def _fallback_text_from_any_message(run) -> str: + for msg in reversed(run.transcript.messages): + parts = [] + if msg.text: + parts.append(msg.text) + for call in msg.tool_calls: + parts.append(call.name) + if call.input: + parts.append(json.dumps(call.input, sort_keys=True)[:200]) + if parts: + return " ".join(parts).strip() + return "" def tokenize(text: str) -> list[str]: - return [w for w in WORD_RE.findall(text.lower()) if w not in STOPWORDS] + return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS] def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]: - """Build a vocab of the top-k most common tokens across all texts.""" - counter = Counter() - for t in texts: - counter.update(set(tokenize(t))) - return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))} + counts = Counter() + for text in texts: + counts.update(set(tokenize(text))) + return {word: idx for idx, (word, _) in enumerate(counts.most_common(top_k))} def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray: - """TF-IDF-ish: token frequency normalized to unit L2 for cosine geometry.""" - v = np.zeros(len(vocab), dtype=np.float32) + vec = np.zeros(len(vocab), dtype=np.float32) toks = tokenize(text) if not toks: - return v + return vec counts = Counter(toks) - for w, c in counts.items(): - if w in vocab: - v[vocab[w]] = c - n = np.linalg.norm(v) - return v / n if n > 0 else v + for word, cnt in counts.items(): + if word in vocab: + vec[vocab[word]] = cnt + norm = np.linalg.norm(vec) + return vec / norm if norm > 0 else vec def participation_ratio(X: np.ndarray) -> float: - """PR(X) = (tr Σ)² / tr(Σ²). Measures effective dimensionality 1–d.""" + """PR(X) = (tr Sigma)^2 / tr(Sigma^2), an effective dimensionality proxy.""" if X.shape[0] < 2: return 1.0 - Sigma = np.cov(X.T) - if Sigma.ndim == 0: + sigma = np.cov(X.T) + if sigma.ndim == 0: return 1.0 - tr = np.trace(Sigma) - tr_sq = np.trace(Sigma @ Sigma) + tr = np.trace(sigma) + tr_sq = np.trace(sigma @ sigma) if tr_sq < 1e-12: return 1.0 - return float(tr ** 2 / tr_sq) + return float((tr**2) / tr_sq) -def response_entropy(X: np.ndarray, n_clusters: int = 8) -> float: - """Entropy of a k-means-like discretization of responses. - - Since we have small n per task (~27 responses), we cluster by nearest- - centroid using the top-few PCA directions. Simpler: use normalized - eigenvalues of covariance as a proxy for entropy over principal modes. - """ +def response_entropy(X: np.ndarray) -> float: + """Entropy over normalized covariance eigenvalues, in bits.""" if X.shape[0] < 2: return 0.0 - Sigma = np.cov(X.T) - eigs = np.linalg.eigvalsh(Sigma) + sigma = np.cov(X.T) + eigs = np.linalg.eigvalsh(sigma) eigs = np.clip(eigs, 1e-12, None) - eigs = eigs / eigs.sum() - return float(shannon_entropy(eigs, base=2)) + probs = eigs / eigs.sum() + return float(-np.sum(probs * np.log2(probs))) def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> float: - """BOPS proxy: inter-run cosine similarity within same model. - - High similarity = predictable (high BOPS). Low similarity = novel each run. - Returns mean cosine across all pairs within each model, averaged across models. - """ + """Mean within-model pairwise cosine similarity across repeated runs.""" per_model_means = [] - for _model, vecs in run_vecs.items(): + for vecs in run_vecs.values(): if len(vecs) < 2: continue sims = [] @@ -154,91 +136,88 @@ def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> floa return float(np.mean(per_model_means)) if per_model_means else 0.0 +def zscore(value: float, arr: np.ndarray) -> float: + std = arr.std() + return float((value - arr.mean()) / std) if std > 1e-12 else 0.0 + + def main() -> None: - # Gather: per-task list of texts + per-model list of per-run vectors + parser = argparse.ArgumentParser(description="Compute posterior constraint index per task") + parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache")) + parser.add_argument("--reports-dir", type=Path, default=Path("reports")) + parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None) + args = parser.parse_args() + + grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier) + if not grouped: + raise SystemExit(f"No cached runs found under {args.archive_dir}") + per_task_texts: dict[str, list[str]] = defaultdict(list) - per_task_model_runs: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list)) - for model in MODELS: - model_dir = ARCH / model - if not model_dir.exists(): - continue - for task_dir in model_dir.iterdir(): - if not task_dir.is_dir(): - continue - task = task_dir.name - for rf in sorted(task_dir.glob("run*.json")): - text = final_assistant_text(rf) + per_task_model_texts: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list)) + + use_fallback_messages = False + for model_name, task_runs in grouped.items(): + for task_id, runs in task_runs.items(): + for run in runs: + text = _assistant_trajectory_text(run) if text: - per_task_texts[task].append(text) - per_task_model_runs[task][model].append(text) + per_task_texts[task_id].append(text) + per_task_model_texts[task_id][model_name].append(text) - print(f"Tasks with responses: {len(per_task_texts)}") + all_texts = [text for texts in per_task_texts.values() for text in texts] + if not all_texts: + use_fallback_messages = True + for model_name, task_runs in grouped.items(): + for task_id, runs in task_runs.items(): + for run in runs: + text = _fallback_text_from_any_message(run) + if text: + per_task_texts[task_id].append(text) + per_task_model_texts[task_id][model_name].append(text) + all_texts = [text for texts in per_task_texts.values() for text in texts] + + if not all_texts: + raise SystemExit("No usable text found in cached transcripts.") - # Build a GLOBAL vocab across all tasks for comparable vector spaces - all_texts = [t for ts in per_task_texts.values() for t in ts] vocab = build_vocab(all_texts, top_k=500) - print(f"Global vocab size: {len(vocab)}") - - # Compute per-task metrics - per_task: dict[str, dict] = {} - for task, texts in sorted(per_task_texts.items()): - if len(texts) < 5: - continue - X = np.stack([vectorize(t, vocab) for t in texts]) # (n_responses, vocab_dim) + per_task: dict[str, dict[str, float | str]] = {} + for task_id, texts in sorted(per_task_texts.items()): + X = np.stack([vectorize(text, vocab) for text in texts]) pr = participation_ratio(X) ent = response_entropy(X) - # BOPS: within-model run predictability - model_vecs: dict[str, list[np.ndarray]] = {} - for m, ts in per_task_model_runs[task].items(): - model_vecs[m] = [vectorize(t, vocab) for t in ts] + model_vecs = { + model_name: [vectorize(text, vocab) for text in model_texts] + for model_name, model_texts in per_task_model_texts[task_id].items() + } bops = bops_inter_run_predictability(model_vecs) - per_task[task] = { + per_task[task_id] = { "n_responses": len(texts), "PR": pr, "entropy": ent, "BOPS": bops, + "data_source": "fallback_any_message" if use_fallback_messages else "assistant_final", } - # Z-score each component across tasks → combine into C(q) + if not per_task: + raise SystemExit("Not enough data to compute C(q).") + prs = np.array([v["PR"] for v in per_task.values()]) ents = np.array([v["entropy"] for v in per_task.values()]) bopss = np.array([v["BOPS"] for v in per_task.values()]) - def z(x, arr): - return float((x - arr.mean()) / (arr.std() or 1.0)) + for task_id, v in per_task.items(): + z_pr = zscore(v["PR"], prs) + z_ent = zscore(v["entropy"], ents) + z_bops = zscore(v["BOPS"], bopss) + v["z_PR"] = z_pr + v["z_entropy"] = z_ent + v["z_BOPS"] = z_bops + v["C_q"] = -z_pr - z_ent + z_bops - for task, v in per_task.items(): - zpr = z(v["PR"], prs) - zent = z(v["entropy"], ents) - zbops = z(v["BOPS"], bopss) - # Paper: higher PR/entropy = MORE open-ended. Higher BOPS = MORE predictable. - # "Constraint" = opposite of openness. C(q) high ⇒ constrained task. - # So: C(q) = −z(PR) − z(entropy) + z(BOPS) - v["z_PR"] = zpr - v["z_entropy"] = zent - v["z_BOPS"] = zbops - v["C_q"] = -zpr - zent + zbops - - # Sort + print - ranked = sorted(per_task.items(), key=lambda kv: -kv[1]["C_q"]) - print(f"\n{'Task':<38} {'n':>3} {'PR':>5} {'H':>5} {'BOPS':>5} {'C(q)':>6} (constraint level)") - print("-" * 78) - for task, v in ranked: - print(f"{task:<38} {v['n_responses']:>3} {v['PR']:>5.2f} {v['entropy']:>5.2f} " - f"{v['BOPS']:>5.2f} {v['C_q']:>+6.2f}") - - out_path = ROOT / "reports" / "constraint_index.json" - out_path.parent.mkdir(exist_ok=True) - out_path.write_text(json.dumps(per_task, indent=2)) - print(f"\nWrote: {out_path}") - - # Bucket summary - highs = [t for t, v in per_task.items() if v["C_q"] > 0.5] - lows = [t for t, v in per_task.items() if v["C_q"] < -0.5] - mids = [t for t, v in per_task.items() if -0.5 <= v["C_q"] <= 0.5] - print(f"\nHigh-constraint (C>+0.5): {len(highs)} tasks (responses converge)") - print(f"Mid: {len(mids)} tasks") - print(f"Low-constraint (C<-0.5): {len(lows)} tasks (responses diverge — open-ended)") + args.reports_dir.mkdir(parents=True, exist_ok=True) + out_path = args.reports_dir / "constraint_index.json" + out_path.write_text(json.dumps(per_task, indent=2), encoding="utf-8") + print(f"Wrote: {out_path}") if __name__ == "__main__": diff --git a/scripts/generate_dynamical_report.py b/scripts/generate_dynamical_report.py index 5aa2210..55d52e8 100644 --- a/scripts/generate_dynamical_report.py +++ b/scripts/generate_dynamical_report.py @@ -1,221 +1,144 @@ -"""Assemble a combined dynamical-systems report integrating: - - Constraint Index C(q) per task - - Regime classification per run - - Seed vs capability variance - - Survival / hazard analysis +#!/usr/bin/env python3 +"""Assemble a combined posterior dynamical-systems markdown report. -Requires: reports/constraint_index.json, reports/regimes.json, - reports/variance_decomposition.json, reports/survival_analysis.json +Inputs: + - constraint_index.json + - regimes.json + - variance_decomposition.json + - survival_analysis.json + - snr_weighted_ranking.json (optional) -Output: reports/EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md +Output: + - EVAL_REPORT_DYNAMICAL.md + +The goal is to keep a compact human-readable summary next to the machine +outputs produced by the posterior analysis pipeline. """ from __future__ import annotations +import argparse import json from collections import Counter, defaultdict from pathlib import Path -from statistics import mean -ROOT = Path(__file__).resolve().parent.parent -REPORTS = ROOT / "reports" -MODEL_MAP = { - "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"), - "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"), - "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"), - "gpt54": ("openai_gpt-5.4", "GPT 5.4"), - "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"), - "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"), - "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"), - "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"), - "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"), -} +def _read_json(path: Path): + if not path.exists(): + raise SystemExit(f"Missing required report file: {path}") + return json.loads(path.read_text(encoding="utf-8")) def main() -> None: - cq = json.loads((REPORTS / "constraint_index.json").read_text()) - regimes = json.loads((REPORTS / "regimes.json").read_text()) - variance = json.loads((REPORTS / "variance_decomposition.json").read_text()) - survival = json.loads((REPORTS / "survival_analysis.json").read_text()) - - lines = [] - L = lines.append - L("# ClawBench — Dynamical Systems Analysis (v2026-4-19-full)") - L("") - L("Inspired by *\"When LLMs Are Dreaming, Where Do They Go?\"* — treats") - L("agent runs as dynamical systems and extracts signal ClawBench's flat") - L("run_score can't: task constraint level, per-run regime, noise vs") - L("signal ratio, and per-turn survival curves.") - L("") - - # ----------------- 1. Constraint Index summary ----------------- - L("## 1. Constraint Index C(q) per task") - L("") - L("C(q) = −z(PR) − z(entropy) + z(BOPS). High C(q) = task is constrained") - L("(responses converge); low C(q) = open-ended (responses diverge).") - L("") - high = sorted([(t, v) for t, v in cq.items() if v["C_q"] > 0.5], - key=lambda kv: -kv[1]["C_q"]) - low = sorted([(t, v) for t, v in cq.items() if v["C_q"] < -0.5], - key=lambda kv: kv[1]["C_q"]) - mid = [t for t, v in cq.items() if -0.5 <= v["C_q"] <= 0.5] - L(f"- **High-constraint ({len(high)} tasks, C>+0.5):** {', '.join(t for t, _ in high[:5])}, …") - L(f"- **Low-constraint ({len(low)} tasks, C<−0.5):** {', '.join(t for t, _ in low[:5])}, …") - L(f"- **Middle ({len(mid)} tasks):** {', '.join(mid[:5])}, …") - L("") - L("Top 5 most-constrained and most-divergent tasks:") - L("") - L("| Constraint | Task | PR | Entropy | BOPS | C(q) |") - L("|---|---|:---:|:---:|:---:|:---:|") - for t, v in high[:5]: - L(f"| HIGH | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |") - for t, v in low[:5]: - L(f"| LOW | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |") - L("") - - # ----------------- 2. Regime distribution ----------------- - L("## 2. Dynamical regime per run") - L("") - L("Each run's turn-by-turn trajectory classified by drift, recurrence,") - L("and support volume thresholds (quartile-based).") - L("") - pm = defaultdict(Counter) - for key, v in regimes.items(): - model_sub = key.split("/")[0] - # Reverse-map to label - label = next((l for l, (s, _) in MODEL_MAP.items() if s == model_sub), None) - if label: - pm[label][v["regime"]] += 1 - L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |") - L("|---|:---:|:---:|:---:|:---:|:---:|") - for label, (_sub, pretty) in MODEL_MAP.items(): - c = pm[label] - L(f"| {pretty} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | " - f"{c['diffusive']} | {c['mixed']} |") - L("") - L("**Interpretation:**") - L("- `trapped` = low drift + small support: agent converges to a point.") - L(" Often good on constrained tasks, sometimes 'stuck'.") - L("- `limit_cycle` = repeats similar states non-consecutively: tool-use loop.") - L("- `diffusive` = keeps exploring without converging. Goal drift risk.") - L("- `mixed` = no strong signature.") - L("") - L("Notable findings:") - L("") - # Find outliers - trap_counts = [(label, pm[label]["trapped"]) for label in MODEL_MAP] - cycle_counts = [(label, pm[label]["limit_cycle"]) for label in MODEL_MAP] - trap_counts.sort(key=lambda x: -x[1]) - cycle_counts.sort(key=lambda x: -x[1]) - L(f"- Most `trapped` runs: **{MODEL_MAP[trap_counts[0][0]][1]}** ({trap_counts[0][1]} runs) —") - L(f" converges aggressively; often one-shot answer without iteration.") - L(f"- Most `limit_cycle` runs: **{MODEL_MAP[cycle_counts[0][0]][1]}** ({cycle_counts[0][1]} runs) —") - L(f" repeats tool patterns between turns; check for productive vs stuck loops.") - L("") - - # ----------------- 3. Variance decomposition ----------------- - L("## 3. Seed-noise vs capability-signal") - L("") - agg = variance["aggregate"] - L(f"- **Seed-noise variance** (same model, 3 runs): **{agg['mean_seed_var']:.4f}**") - L(f"- **Capability variance** (across models): **{agg['mean_cap_var']:.4f}**") - L(f"- **Capability fraction: {agg['capability_fraction']:.1%}**") - L(f" (= fraction of benchmark variance that reflects real model differences)") - L("") - L("**The other ~47% is seed noise.** Any ranking gap < √(2·seed_var) ≈") - L(f"0.20 between two models is within noise. Top-5 models' gap is 0.02 →") - L("**statistically indistinguishable.**") - L("") - L("### SNR tiers across 40 tasks") - L("") - per_task = variance["per_task"] - hi = [r for r in per_task if r["snr"] >= 5] - mid = [r for r in per_task if 1 <= r["snr"] < 5] - lo = [r for r in per_task if r["snr"] < 1] - L(f"- **High-SNR ({len(hi)} tasks, SNR ≥ 5):** reliably discriminate models") - for r in hi[:3]: - L(f" - `{r['task']}` (SNR={r['snr']:.1f})") - L(f"- **Mid-SNR ({len(mid)} tasks, 1 ≤ SNR < 5):** moderate signal") - L(f"- **Low-SNR ({len(lo)} tasks, SNR < 1):** seed noise dominates; these") - L(f" tasks give essentially random rankings") - for r in sorted(lo, key=lambda x: x['snr'])[:3]: - L(f" - `{r['task']}` (SNR={r['snr']:.2f}) — random") - L("") - - # ----------------- 4. Survival analysis ----------------- - L("## 4. Per-turn survival: when do runs fail?") - L("") - L("T_F = first turn where agent emits empty response or run ends in failure.") - L("S(t) = fraction of runs still on-track past turn t. Low = dies early.") - L("") - L("| Model | Median fail turn | S(3) | S(5) | S(8) | S(12) | S(20) |") - L("|---|:---:|:---:|:---:|:---:|:---:|:---:|") - for label, (_sub, pretty) in MODEL_MAP.items(): - d = survival.get(label, {}) - surv = d.get("survival", [0]*20) - med = d.get("median_fail_turn", "—") - med_str = f"{med:.1f}" if isinstance(med, (int, float)) and med != float("inf") else str(med) - L(f"| {pretty} | {med_str} | {surv[2]:.2f} | {surv[4]:.2f} | " - f"{surv[7]:.2f} | {surv[11]:.2f} | {surv[19]:.2f} |") - L("") - # Narrative - surv_rank_t8 = sorted( - [(label, survival[label]["survival"][7]) - for label in MODEL_MAP if label in survival], - key=lambda x: -x[1] + parser = argparse.ArgumentParser(description="Generate a combined dynamical report markdown") + parser.add_argument("--reports-dir", type=Path, default=Path("reports")) + parser.add_argument( + "--output", + type=Path, + default=None, + help="Markdown output path; defaults to /EVAL_REPORT_DYNAMICAL.md", ) - best = MODEL_MAP[surv_rank_t8[0][0]][1] - worst = MODEL_MAP[surv_rank_t8[-1][0]][1] - L(f"- **{best}** survives longest — {surv_rank_t8[0][1]:.0%} of runs still") - L(f" producing output at turn 8.") - L(f"- **{worst}** dies earliest — only {surv_rank_t8[-1][1]:.0%} make it to turn 8.") + args = parser.parse_args() + + reports = args.reports_dir + output_path = args.output or (reports / "EVAL_REPORT_DYNAMICAL.md") + cq = _read_json(reports / "constraint_index.json") + regimes = _read_json(reports / "regimes.json") + variance = _read_json(reports / "variance_decomposition.json") + survival = _read_json(reports / "survival_analysis.json") + ranking_path = reports / "snr_weighted_ranking.json" + ranking = json.loads(ranking_path.read_text(encoding="utf-8")) if ranking_path.exists() else None + + lines: list[str] = [] + L = lines.append + + L("# ClawBench Posterior Dynamical Report") L("") - L("This is signal invisible in flat run_score: two models can score") - L("similarly but have very different failure profiles. Pick accordingly") - L("for long-horizon deployments.") + L("This report combines posterior-only diagnostics from cached run artifacts.") L("") - # ----------------- 5. Integrated view ----------------- - L("## 5. Integrated view — combining all four lenses") + L("## 1. Constraint Index C(q)") L("") - L("For a model to be **reliably good** at a task, we need:") - L("- (a) It scores well (run_score high)") - L("- (b) Variance across seeds is low (predictable)") - L("- (c) It doesn't exhibit pathological regime (trapped on wrong answer / cycling)") - L("- (d) It survives multi-turn without dying early") + values = [(task, float(data.get("C_q", 0.0))) for task, data in cq.items()] + values.sort(key=lambda row: row[1], reverse=True) + highs = [row for row in values if row[1] > 0.5] + lows = [row for row in values if row[1] < -0.5] + L(f"- High-constraint tasks (C > 0.5): {len(highs)}") + L(f"- Low-constraint tasks (C < -0.5): {len(lows)}") L("") - L("These lenses disagree constructively:") + if values: + L("Top tasks by C(q):") + L("") + L("| Task | C(q) |") + L("|---|---:|") + for task, c_q in values[:10]: + L(f"| {task} | {c_q:+.3f} |") + L("") + + L("## 2. Regime Classification") L("") - L("- **Opus 4.6** tops flat run_score but median failure at turn 5.5 (earlier than Opus 4.7's 7).") - L("- **GPT 5.4** is mid-pack on flat score but has highest S(8)=0.60 — long-horizon champion.") - L("- **Sonnet 4.6** most `trapped` runs — it commits early and sticks. Good on") - L(" constrained tasks, bad on open-ended (cf. memory-recall-continuation 0.15).") - L("- **GLM 5.1** most balanced regime distribution; justifies broad performance.") - L("- **Kimi K2.5** median fail at turn 3 — it's not just low-scoring, it's") - L(" specifically fragile under multi-turn execution.") + by_model = defaultdict(Counter) + for key, row in regimes.items(): + model = key.split("/")[0] + regime = row.get("regime", "unknown") + by_model[model][regime] += 1 + + L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |") + L("|---|---:|---:|---:|---:|---:|") + for model in sorted(by_model): + c = by_model[model] + L( + f"| {model} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | " + f"{c['diffusive']} | {c['mixed']} |" + ) L("") - # ----------------- 6. What to do next ----------------- - L("## 6. Implications for the benchmark") + L("## 3. Variance Decomposition") + L("") + agg = variance.get("aggregate", {}) + L(f"- Mean seed variance: {agg.get('mean_seed_var', 0.0):.6f}") + L(f"- Mean capability variance: {agg.get('mean_cap_var', 0.0):.6f}") + L(f"- Capability fraction: {agg.get('capability_fraction', 0.0):.1%}") + L(f"- High-SNR tasks: {agg.get('high_snr_tasks', 0)}") + L(f"- Mid-SNR tasks: {agg.get('mid_snr_tasks', 0)}") + L(f"- Low-SNR tasks: {agg.get('low_snr_tasks', 0)}") L("") - L("- **47% seed noise** means any gap < 0.02 is meaningless. Treat top-5") - L(" as a statistical tie. Dropping the 21 low-SNR tasks would sharpen") - L(" remaining rankings considerably.") - L("- **Weight tasks by SNR × |C(q)|** instead of flat mean. High-SNR,") - L(" high-|C(q)| tasks give the cleanest capability signal.") - L("- **Report survival curves alongside run_score** to surface long-horizon") - L(" capability that single-number metrics hide.") - L("- **Flag 'trapped' runs that scored high** — the model may have") - L(" guessed-and-committed rather than reasoned; not same reliability.") - L("- **Add a Tier 6 long-horizon (100+ turn) task set** to actually") - L(" measure the dynamical regimes the paper proposes — current") - L(" trajectories are too short (median 6 assistant turns) for clean") - L(" Lyapunov or attractor diagnostics.") - out = REPORTS / "EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md" - out.write_text("\n".join(lines) + "\n") - print(f"Wrote: {out}") + L("## 4. Survival Analysis") + L("") + L("| Model | Runs | Events | Median failure turn | S(3) | S(5) | S(8) |") + L("|---|---:|---:|---:|---:|---:|---:|") + for model in sorted(survival): + row = survival[model] + surv = row.get("survival", [0.0] * 8) + med = row.get("median_fail_turn", "inf") + if isinstance(med, float) and med == float("inf"): + med_display = "inf" + else: + med_display = f"{float(med):.1f}" + L( + f"| {model} | {row.get('n_runs', 0)} | {row.get('n_events', 0)} | " + f"{med_display} | {surv[2] if len(surv) > 2 else 0.0:.2f} | " + f"{surv[4] if len(surv) > 4 else 0.0:.2f} | {surv[7] if len(surv) > 7 else 0.0:.2f} |" + ) + L("") + + if ranking is not None: + L("## 5. SNR-weighted Ranking") + L("") + L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |") + L("|---:|---|---:|---:|---:|---:|") + for idx, row in enumerate(ranking.get("results", []), start=1): + L( + f"| {idx} | {row.get('model', '')} | {row.get('flat', 0.0):.4f} | " + f"{row.get('snr_x_abs_cq', 0.0):.4f} | {row.get('snr_x_abs_cq_winsorized', 0.0):.4f} | " + f"{row.get('coverage', 0)} |" + ) + L("") + + output_path.parent.mkdir(parents=True, exist_ok=True) + output_path.write_text("\n".join(lines) + "\n", encoding="utf-8") + print(f"Wrote: {output_path}") if __name__ == "__main__": diff --git a/scripts/run_posterior_dynamics_pipeline.py b/scripts/run_posterior_dynamics_pipeline.py new file mode 100644 index 0000000..eff95a7 --- /dev/null +++ b/scripts/run_posterior_dynamics_pipeline.py @@ -0,0 +1,89 @@ +#!/usr/bin/env python3 +"""Run the full posterior dynamical analysis pipeline.""" + +from __future__ import annotations + +import argparse +import subprocess +import sys +from pathlib import Path + + +REPO_ROOT = Path(__file__).resolve().parent.parent +sys.path.insert(0, str(REPO_ROOT)) + +from clawbench.dynamics_archive import discover_model_roots, load_task_runs_archive, write_dynamics_report + + +def _run(cmd: list[str]) -> None: + print("$", " ".join(cmd)) + result = subprocess.run(cmd, cwd=REPO_ROOT) + if result.returncode != 0: + raise SystemExit(result.returncode) + + +def _resolve_path(path: Path) -> Path: + return path if path.is_absolute() else (REPO_ROOT / path) + + +def _write_dynamics_reports( + archive_dir: Path, + output_dir: Path, + tier: str | None, +) -> None: + roots = discover_model_roots(archive_dir) + if not roots: + raise SystemExit(f"No cached runs found under {archive_dir}") + + multiple_models = len(roots) > 1 + wrote_any = False + for model_name, model_dir in roots.items(): + task_runs = load_task_runs_archive(model_dir, tier=tier) + if not task_runs: + continue + + wrote_any = True + model_output_dir = output_dir / model_name if multiple_models else output_dir + report_path, plots = write_dynamics_report(task_runs, model_output_dir) + n_runs = sum(len(runs) for runs in task_runs.values()) + + print(f"[dynamics] {model_name}: loaded {n_runs} cached runs across {len(task_runs)} tasks") + print(f"[dynamics] {model_name}: wrote {report_path}") + print(f"[dynamics] {model_name}: saved {len(plots)} plots to {model_output_dir}/") + + if not wrote_any: + raise SystemExit(f"No cached runs found under {archive_dir}") + + +def main() -> None: + parser = argparse.ArgumentParser(description="Run posterior dynamics pipeline end to end") + parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache")) + parser.add_argument("--reports-dir", type=Path, default=Path("reports")) + parser.add_argument("--output-dir", type=Path, default=Path("results/posterior_dynamics")) + parser.add_argument( + "--include-dynamics-report", + action="store_true", + help="Also build per-model dynamics.json files and plots from the archive.", + ) + parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None) + args = parser.parse_args() + + py = sys.executable + archive_dir = _resolve_path(args.archive_dir) + reports_dir = _resolve_path(args.reports_dir) + output_dir = _resolve_path(args.output_dir) + tier_args = ["--tier", args.tier] if args.tier else [] + scripts_dir = REPO_ROOT / "scripts" + + _run([py, str(scripts_dir / "compute_constraint_index.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args]) + _run([py, str(scripts_dir / "classify_regimes.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args]) + _run([py, str(scripts_dir / "variance_decomp.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args]) + _run([py, str(scripts_dir / "survival_analysis.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args]) + _run([py, str(scripts_dir / "snr_weighted_ranking.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args]) + _run([py, str(scripts_dir / "generate_dynamical_report.py"), "--reports-dir", str(reports_dir)]) + if args.include_dynamics_report: + _write_dynamics_reports(archive_dir, output_dir, args.tier) + + +if __name__ == "__main__": + main() diff --git a/scripts/snr_weighted_ranking.py b/scripts/snr_weighted_ranking.py index 52185d3..a308f8a 100644 --- a/scripts/snr_weighted_ranking.py +++ b/scripts/snr_weighted_ranking.py @@ -1,148 +1,130 @@ -"""SNR × |C(q)|-weighted ranking — the dynamical-systems-informed metric. +#!/usr/bin/env python3 +"""SNR x |C(q)| weighted ranking from posterior cached runs. -Motivation: from variance_decomp.py we know 47% of run_score variance is -seed noise. From compute_constraint_index.py we know some tasks are -high-constraint (everyone converges) and others are open-ended (responses -diverge for style reasons, not capability). +Weighted headline score: -Weighted mean: - w(task) = SNR(task) × |C(q)(task)| - score(model) = Σ_task w(task) · mean_run_score(task, model) / Σ_task w(task) + w(q) = max(0, SNR(q)) * |C(q)| + score(model) = sum_q w(q) * mean_run_score(model, q) / sum_q w(q) -Why: -- High SNR tasks contribute more than low-SNR tasks (noise-weighted) -- |C(q)| amplifies tasks that are either strongly constrained OR strongly - open-ended (i.e. measures what they're supposed to measure, regardless - of polarity) -- Moderate C(q) tasks (C near 0) are inherently ambiguous — down-weighted +We also report: -Outputs: - - Per-model weighted score - - Comparison against flat-mean ranking - - Published to reports/snr_weighted_ranking.json + snr_only = SNR-weighted mean + snr_x_abs_cq = SNR x |C(q)| weighted mean + snr_x_abs_cq_winsorized = same, but top task weights are clamped at p95 + +This keeps noisy low-SNR tasks from dominating and upweights tasks whose +response geometry suggests a stronger capability signal. """ from __future__ import annotations -import glob +import argparse import json +import sys from collections import defaultdict from pathlib import Path from statistics import mean import numpy as np -ROOT = Path(__file__).resolve().parent.parent -ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full" -REPORTS = ROOT / "reports" +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) -MODELS = { - "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"), - "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"), - "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"), - "gpt54": ("openai_gpt-5.4", "GPT 5.4"), - "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"), - "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"), - "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"), - "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"), - "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"), -} +from clawbench.dynamics_archive import load_task_runs_by_model def main() -> None: - cq = json.loads((REPORTS / "constraint_index.json").read_text()) - var = json.loads((REPORTS / "variance_decomposition.json").read_text()) - snr_by_task = {r["task"]: r["snr"] for r in var["per_task"]} + parser = argparse.ArgumentParser(description="Compute SNR-weighted posterior model ranking") + parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache")) + parser.add_argument("--reports-dir", type=Path, default=Path("reports")) + parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None) + args = parser.parse_args() - # Per (model, task): mean run_score over the 3 runs - per_mt: dict[str, dict[str, list[float]]] = defaultdict(dict) - for label, (sub, _) in MODELS.items(): - for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"): - try: - d = json.loads(Path(p).read_text()) - except Exception: - continue - task = p.split("/")[-2] - per_mt[label].setdefault(task, []).append(d.get("run_score", 0)) - per_mt_mean = { - m: {t: mean(v) for t, v in d.items() if v} for m, d in per_mt.items() + cq_path = args.reports_dir / "constraint_index.json" + var_path = args.reports_dir / "variance_decomposition.json" + if not cq_path.exists() or not var_path.exists(): + raise SystemExit("Missing prerequisite reports: run compute_constraint_index.py and variance_decomp.py first.") + + cq = json.loads(cq_path.read_text(encoding="utf-8")) + var = json.loads(var_path.read_text(encoding="utf-8")) + snr_by_task = {row["task"]: row["snr"] for row in var.get("per_task", [])} + + grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier) + if not grouped: + raise SystemExit(f"No cached runs found under {args.archive_dir}") + + per_model_task_scores: dict[str, dict[str, list[float]]] = defaultdict(dict) + for model_name, task_runs in grouped.items(): + for task_id, runs in task_runs.items(): + per_model_task_scores[model_name][task_id] = [float(run.run_score) for run in runs] + + per_model_task_mean = { + model_name: { + task_id: mean(vals) + for task_id, vals in task_scores.items() + if vals + } + for model_name, task_scores in per_model_task_scores.items() } - # Only consider tasks present in both C(q) and SNR common_tasks = sorted(set(cq) & set(snr_by_task)) - print(f"Using {len(common_tasks)} tasks with both C(q) and SNR.") + if not common_tasks: + raise SystemExit("No overlap between constraint_index and variance_decomposition task sets.") - # Compute weights w(task) = SNR × |C(q)|, clamped to [0, ∞) - weights = {} - for t in common_tasks: - w = max(0.0, snr_by_task[t]) * abs(cq[t]["C_q"]) - weights[t] = w - # Also: SNR-only weighting (simpler, no C(q)) - snr_weights = {t: max(0.0, snr_by_task[t]) for t in common_tasks} - # Also: Winsorize — clamp top-1 task's weight to 95th percentile to - # prevent single task from dominating - import numpy as _np - _w95 = float(_np.percentile(list(weights.values()), 95)) - weights_wins = {t: min(w, _w95) for t, w in weights.items()} - wsum = sum(weights.values()) - if wsum == 0: - print("All weights zero — bail.") - return + weights = {task: max(0.0, snr_by_task[task]) * abs(cq[task].get("C_q", 0.0)) for task in common_tasks} + snr_weights = {task: max(0.0, snr_by_task[task]) for task in common_tasks} - # Compute per-model scores under 3 variants - results = [] + w95 = float(np.percentile(list(weights.values()), 95)) if weights else 0.0 + winsorized = {task: min(weight, w95) for task, weight in weights.items()} + + w_sum = sum(weights.values()) snr_sum = sum(snr_weights.values()) - wins_sum = sum(weights_wins.values()) - for label, (sub, pretty) in MODELS.items(): - task_means = per_mt_mean.get(label, {}) - if not task_means: + wins_sum = sum(winsorized.values()) + + results = [] + for model_name, task_means in per_model_task_mean.items(): + covered = [task for task in common_tasks if task in task_means] + if not covered: continue - num_cq = sum(weights[t] * task_means.get(t, 0) for t in common_tasks) - num_snr = sum(snr_weights[t] * task_means.get(t, 0) for t in common_tasks) - num_wins = sum(weights_wins[t] * task_means.get(t, 0) for t in common_tasks) - wscore = num_cq / wsum - snr_only = num_snr / snr_sum if snr_sum > 0 else 0 - wins_score = num_wins / wins_sum if wins_sum > 0 else 0 - flat = mean(task_means[t] for t in common_tasks if t in task_means) - results.append((label, pretty, flat, wscore, snr_only, wins_score)) - print() - print(f"{'Model':<16} {'Flat':>7} {'SNR×|C|':>8} {'Winsorized':>11} {'SNR-only':>9}") - print("-" * 66) - # Rank by winsorized variant (primary) - for label, pretty, flat, w, snr_only, wins in sorted(results, key=lambda x: -x[5]): - print(f"{pretty:<16} {flat:>7.4f} {w:>8.4f} {wins:>11.4f} {snr_only:>9.4f}") + flat = mean(task_means[task] for task in covered) + weighted = ( + sum(weights[task] * task_means.get(task, 0.0) for task in common_tasks) / w_sum + if w_sum > 1e-12 + else 0.0 + ) + snr_only = ( + sum(snr_weights[task] * task_means.get(task, 0.0) for task in common_tasks) / snr_sum + if snr_sum > 1e-12 + else 0.0 + ) + wins_score = ( + sum(winsorized[task] * task_means.get(task, 0.0) for task in common_tasks) / wins_sum + if wins_sum > 1e-12 + else 0.0 + ) - # Rank comparisons - print("\n=== Ranking shifts vs flat-mean (winsorized) ===") - flat_rank_order = sorted(results, key=lambda x: -x[2]) - flat_rank = {r[0]: i + 1 for i, r in enumerate(flat_rank_order)} - wins_rank_order = sorted(results, key=lambda x: -x[5]) - print(f"{'Rank':<5}{'Model':<16} {'Flat':>8} {'Winsorized':>11} {'Δrank':>6}") - for i, (label, pretty, flat, _w, _snr, wins) in enumerate(wins_rank_order, 1): - fr = flat_rank[label] - move = "" - if fr > i: move = f"↑{fr-i}" - elif fr < i: move = f"↓{i-fr}" - print(f"{i:<5}{pretty:<16} {flat:>8.4f} {wins:>11.4f} {move:>6}") + results.append( + { + "model": model_name, + "flat": float(flat), + "snr_x_abs_cq": float(weighted), + "snr_only": float(snr_only), + "snr_x_abs_cq_winsorized": float(wins_score), + "coverage": len(covered), + } + ) + + results.sort(key=lambda row: row["snr_x_abs_cq_winsorized"], reverse=True) - # Save out = { - "flat_score": {r[0]: r[2] for r in results}, - "snr_x_cq_weighted": {r[0]: r[3] for r in results}, - "snr_x_cq_winsorized": {r[0]: r[5] for r in results}, - "snr_only_weighted": {r[0]: r[4] for r in results}, - "weights_per_task": weights, "common_tasks": common_tasks, + "weights_per_task": weights, + "results": results, } - (REPORTS / "snr_weighted_ranking.json").write_text(json.dumps(out, indent=2)) - print(f"\nWrote reports/snr_weighted_ranking.json") - # Show top-5 contributing tasks (highest weight) for context - print() - print("Top-10 tasks by weight (SNR × |C(q)|):") - for t, w in sorted(weights.items(), key=lambda kv: -kv[1])[:10]: - print(f" {t:<38} SNR={snr_by_task[t]:>5.1f} |C(q)|={abs(cq[t]['C_q']):>5.2f} w={w:>6.2f}") + out_path = args.reports_dir / "snr_weighted_ranking.json" + out_path.write_text(json.dumps(out, indent=2), encoding="utf-8") + print(f"Wrote: {out_path}") if __name__ == "__main__": diff --git a/scripts/survival_analysis.py b/scripts/survival_analysis.py index f9859b7..846ed4f 100644 --- a/scripts/survival_analysis.py +++ b/scripts/survival_analysis.py @@ -1,164 +1,118 @@ -"""Per-turn survival analysis: when do agent runs fail? +#!/usr/bin/env python3 +"""Per-turn survival analysis on posterior cached runs. -Following paper §Latent-state survival: - T_F = inf { t ≥ 0 : failure at time t } - S(t) = P(T_F > t) — survival function - h(t) = P(T_F = t | T_F ≥ t) — hazard rate +For each run, define a failure time T_F as the first assistant turn where the +agent emits neither text nor tool calls, or the final assistant turn of an +unsuccessful run with delivery outcome in {fail, partial}. -For each run, we define FAILURE as the first turn where: - (a) the assistant emits no text AND no tool calls, OR - (b) the run's delivery_outcome is 'fail'/'partial' AND the transcript - ended at this turn (no more assistant turns follow). +We then estimate: -T_F = assistant-turn index of first failure (starting at 1). -If the run succeeded (run_score ≥ 0.7), T_F is right-censored at the -final turn count N (i.e. survived the whole trajectory). + S(t) = P(T_F > t) + h(t) = P(T_F = t | T_F >= t) -Output per model: - - Median turn-to-failure - - Empirical survival curve S(t) for t = 1..20 - - Hazard profile h(t) - - Stratified by task-constraint bucket (using C(q) from earlier) - -Usage: - .venv/bin/python3 scripts/survival_analysis.py +This exposes long-horizon fragility that is easy to hide in flat mean scores. """ from __future__ import annotations -import glob +import argparse import json -import re -from collections import defaultdict +import sys from pathlib import Path from statistics import median -import numpy as np +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) -ROOT = Path(__file__).resolve().parent.parent -ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full" - -MODELS = { - "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"), - "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"), - "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"), - "gpt54": ("openai_gpt-5.4", "GPT 5.4"), - "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"), - "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"), - "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"), - "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"), - "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"), -} +from clawbench.dynamics_archive import load_task_runs_by_model SUCCESS_THRESHOLD = 0.7 -def assistant_turns(d: dict) -> list[dict]: - return [m for m in d.get("transcript", {}).get("messages", []) - if m.get("role") == "assistant"] +def assistant_turns(run) -> list: + return run.transcript.assistant_messages -def find_failure_turn(d: dict) -> tuple[int, bool]: - """Return (T_F, is_event). T_F is 1-indexed turn of failure. - - is_event=True means failure actually happened; False means the run was - censored (survived to end without failing). - """ - turns = assistant_turns(d) +def find_failure_turn(run) -> tuple[int, bool]: + """Return (failure_turn, is_event) with 1-indexed assistant turns.""" + turns = assistant_turns(run) n = len(turns) - run_score = d.get("run_score", 0) or 0 - delivery = d.get("delivery_outcome", "") - # Scan for first empty-turn - for i, t in enumerate(turns, 1): - has_text = bool((t.get("text") or "").strip()) - has_tool_call = bool(t.get("tool_calls")) + for idx, turn in enumerate(turns, 1): + has_text = bool((turn.text or "").strip()) + has_tool_call = bool(turn.tool_calls) if not has_text and not has_tool_call: - return i, True # failure event + return idx, True - # If run was unsuccessful and ended early, mark last turn as failure - if run_score < SUCCESS_THRESHOLD and delivery in ("fail", "partial"): + if run.run_score < SUCCESS_THRESHOLD and run.delivery_outcome.value in {"fail", "partial"}: return max(n, 1), True - # Survived: right-censored at n return max(n, 1), False def empirical_survival(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]: - """Kaplan-Meier-like survival curve, non-parametric. - - S(t) = fraction of runs that survived past turn t. - """ - survival = [] + """Empirical survival curve S(t) over assistant-turn index.""" total = len(times_events) + if total == 0: + return [0.0] * max_t + + survival = [] for t in range(1, max_t + 1): - # Survived past t = either censored at ≥t or event at >t - survived = sum(1 for tf, is_event in times_events - if (not is_event and tf >= t) or (is_event and tf > t)) - survival.append(survived / total if total > 0 else 0.0) + survived = sum( + 1 + for tf, is_event in times_events + if (not is_event and tf >= t) or (is_event and tf > t) + ) + survival.append(survived / total) return survival def hazard(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]: - """Hazard rate h(t) = events at t / at-risk at t.""" - h = [] + """Discrete hazard h(t) = events_at_t / at_risk_at_t.""" + hazard_vals = [] for t in range(1, max_t + 1): at_risk = sum(1 for tf, _ in times_events if tf >= t) - events_at_t = sum(1 for tf, is_event in times_events - if is_event and tf == t) - h.append(events_at_t / at_risk if at_risk > 0 else 0.0) - return h + events_at_t = sum(1 for tf, is_event in times_events if is_event and tf == t) + hazard_vals.append(events_at_t / at_risk if at_risk > 0 else 0.0) + return hazard_vals def main() -> None: - per_model: dict[str, list[tuple[int, bool]]] = defaultdict(list) - for label, (sub, _) in MODELS.items(): - for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"): - try: - d = json.loads(Path(p).read_text()) - except Exception: - continue - tf, is_event = find_failure_turn(d) - per_model[label].append((tf, is_event)) + parser = argparse.ArgumentParser(description="Survival analysis on cached runs") + parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache")) + parser.add_argument("--reports-dir", type=Path, default=Path("reports")) + parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None) + parser.add_argument("--max-turn", type=int, default=20) + args = parser.parse_args() - # Load C(q) to stratify - cq_path = ROOT / "reports" / "constraint_index.json" - cq_by_task = {} - if cq_path.exists(): - cq = json.loads(cq_path.read_text()) - cq_by_task = {t: v["C_q"] for t, v in cq.items()} + grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier) + if not grouped: + raise SystemExit(f"No cached runs found under {args.archive_dir}") - # Print summary - print(f"{'Model':<14} {'n_runs':>6} {'events':>6} {'med_tf':>8} " - f"{'S(3)':>6} {'S(5)':>6} {'S(8)':>6} {'S(12)':>6} {'S(20)':>6}") - print("-" * 90) out = {} - for label, (_sub, pretty) in MODELS.items(): - evs = per_model[label] - n = len(evs) - n_events = sum(1 for _, e in evs if e) - tfs_events = [tf for tf, e in evs if e] - med = median(tfs_events) if tfs_events else float("inf") - surv = empirical_survival(evs, max_t=20) - haz = hazard(evs, max_t=20) - print(f"{pretty:<14} {n:>6} {n_events:>6} {med:>8.1f} " - f"{surv[2]:>6.2f} {surv[4]:>6.2f} {surv[7]:>6.2f} " - f"{surv[11]:>6.2f} {surv[19]:>6.2f}") - out[label] = { - "pretty": pretty, - "n_runs": n, + for model_name, task_runs in grouped.items(): + events = [] + for runs in task_runs.values(): + for run in runs: + events.append(find_failure_turn(run)) + + n_runs = len(events) + n_events = sum(1 for _, is_event in events if is_event) + event_times = [t for t, is_event in events if is_event] + med = median(event_times) if event_times else float("inf") + + out[model_name] = { + "pretty": model_name, + "n_runs": n_runs, "n_events": n_events, "median_fail_turn": med, - "survival": surv, - "hazard": haz, + "survival": empirical_survival(events, max_t=args.max_turn), + "hazard": hazard(events, max_t=args.max_turn), } - print("\n(Interpretation: S(t) = fraction of runs still on-track past turn t.") - print(" Lower values = more frequent early failure.)") - - out_path = ROOT / "reports" / "survival_analysis.json" - out_path.write_text(json.dumps(out, indent=2)) - print(f"\nWrote: {out_path}") + args.reports_dir.mkdir(parents=True, exist_ok=True) + out_path = args.reports_dir / "survival_analysis.json" + out_path.write_text(json.dumps(out, indent=2), encoding="utf-8") + print(f"Wrote: {out_path}") if __name__ == "__main__": diff --git a/scripts/variance_decomp.py b/scripts/variance_decomp.py index 3d009ac..1ffde71 100644 --- a/scripts/variance_decomp.py +++ b/scripts/variance_decomp.py @@ -1,132 +1,118 @@ -"""Decompose run_score variance into seed-noise vs capability-signal. +#!/usr/bin/env python3 +"""Decompose posterior run_score variance into seed noise and capability signal. -Each task has 3 runs per model (same prompt, different random seed). - σ²_seed(task, model) = variance across the 3 runs of (task, model) - σ²_capability(task) = variance across model means for the task +Each task has repeated runs per model. + + sigma^2_seed(task, model) = variance across repeated runs for one model + sigma^2_capability(task) = variance across model means for that task Signal-to-noise ratio per task: - SNR(task) = σ²_capability / σ²_seed -High SNR → differences between models on this task are REAL (not noise). -Low SNR → the 3-run variance per model is so large that cross-model gaps - are indistinguishable from seed noise. These tasks don't - discriminate models reliably. + SNR(task) = sigma^2_capability / mean_model sigma^2_seed -Aggregated over all 40 tasks, we also decompose TOTAL variance: - total_var = mean_capability_var + mean_seed_var - capability_fraction = mean_capability_var / total_var +High SNR means cross-model differences are likely real. Low SNR means the +benchmark signal is dominated by run-to-run variance rather than capability. -This answers "what fraction of the benchmark signal is real model -capability vs. run-to-run luck?" +Aggregate decomposition: -Usage: - .venv/bin/python3 scripts/variance_decomp.py + total_var = mean_task seed_var + mean_task cap_var + capability_fraction = mean_task cap_var / total_var + +This script keeps the posterior/archive-based workflow used by the current +pipeline, but the statistical meaning is the same as the earlier analysis. """ from __future__ import annotations -import glob +import argparse import json -import re +import sys from collections import defaultdict from pathlib import Path from statistics import mean, variance -import numpy as np +sys.path.insert(0, str(Path(__file__).resolve().parent.parent)) -ROOT = Path(__file__).resolve().parent.parent -ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full" - -MODELS = { - "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"), - "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"), - "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"), - "gpt54": ("openai_gpt-5.4", "GPT 5.4"), - "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"), - "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"), - "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"), - "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"), - "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"), -} +from clawbench.dynamics_archive import load_task_runs_by_model def main() -> None: - # {task: {model: [run_scores]}} - scores: dict[str, dict[str, list[float]]] = defaultdict(dict) - for label, (sub, _) in MODELS.items(): - for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"): - task = p.split("/")[-2] - try: - d = json.loads(Path(p).read_text()) - except Exception: - continue - scores[task].setdefault(label, []).append(d.get("run_score", 0)) + parser = argparse.ArgumentParser(description="Variance decomposition on cached runs") + parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache")) + parser.add_argument("--reports-dir", type=Path, default=Path("reports")) + parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None) + args = parser.parse_args() + + grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier) + if not grouped: + raise SystemExit(f"No cached runs found under {args.archive_dir}") + + # Collect repeated run scores as {task -> {model -> [run_scores]}}. + scores: dict[str, dict[str, list[float]]] = defaultdict(dict) + for model_name, task_runs in grouped.items(): + for task_id, runs in task_runs.items(): + vals = [float(run.run_score) for run in runs] + if vals: + scores[task_id][model_name] = vals - # Per-task: seed var per model, cross-model var of means, SNR task_stats = [] - for task, per_model in scores.items(): - # Only use models with all 3 runs for clean seed-variance estimate + for task_id, per_model in scores.items(): model_vars = [] model_means = [] - for m, runs in per_model.items(): + for runs in per_model.values(): if len(runs) >= 2: model_vars.append(variance(runs)) + if runs: model_means.append(mean(runs)) - if len(model_means) < 2 or not model_vars: - continue - mean_seed_var = mean(model_vars) # noise - cap_var = variance(model_means) # signal + + # Mean within-model variance is the seed-noise term. + mean_seed_var = mean(model_vars) if model_vars else 0.0 + # Variance of model means is the capability-signal term. + cap_var = variance(model_means) if len(model_means) >= 2 else 0.0 snr = cap_var / (mean_seed_var + 1e-9) - task_stats.append({ - "task": task, - "seed_var": mean_seed_var, - "cap_var": cap_var, - "snr": snr, - "n_models": len(model_means), - }) + task_stats.append( + { + "task": task_id, + "seed_var": float(mean_seed_var), + "cap_var": float(cap_var), + "snr": float(snr), + "n_models": len(model_means), + "limited_model_diversity": len(model_means) < 2, + } + ) - # Sort by SNR - task_stats.sort(key=lambda x: -x["snr"]) + task_stats.sort(key=lambda row: row["snr"], reverse=True) + if not task_stats: + raise SystemExit("No task-level scores found in archive.") - print(f"{'Task':<38} {'seed_var':>9} {'cap_var':>9} {'SNR':>8}") - print("-" * 70) - for r in task_stats: - print(f"{r['task']:<38} {r['seed_var']:>9.4f} {r['cap_var']:>9.4f} " - f"{r['snr']:>8.2f}") - - # Aggregate decomposition - total_seed = mean(r["seed_var"] for r in task_stats) - total_cap = mean(r["cap_var"] for r in task_stats) + # Aggregate over tasks to estimate how much of benchmark variance is real + # capability signal versus run-to-run noise. + total_seed = mean(row["seed_var"] for row in task_stats) + total_cap = mean(row["cap_var"] for row in task_stats) total = total_seed + total_cap - cap_frac = total_cap / (total + 1e-9) + capability_fraction = total_cap / total if total > 1e-12 else 0.0 - print("\n=== AGGREGATE VARIANCE DECOMPOSITION ===") - print(f" Mean seed variance (noise): {total_seed:.5f}") - print(f" Mean capability variance (signal): {total_cap:.5f}") - print(f" Capability fraction: {cap_frac:.1%}") - print(f" (= what % of run_score variance comes from real model differences)") + # Coarse SNR buckets help downstream reporting and task weighting. + high_snr = [row for row in task_stats if row["snr"] >= 5] + mid_snr = [row for row in task_stats if 1 <= row["snr"] < 5] + low_snr = [row for row in task_stats if row["snr"] < 1] - # Classify tasks by SNR tiers - high_snr = [r for r in task_stats if r["snr"] >= 5] - mid_snr = [r for r in task_stats if 1 <= r["snr"] < 5] - low_snr = [r for r in task_stats if r["snr"] < 1] - print(f"\n=== SNR TIERS ===") - print(f" High SNR (≥5): {len(high_snr)} tasks — differentiate models reliably") - print(f" Mid SNR (1–5): {len(mid_snr)} tasks — moderate signal") - print(f" Low SNR (<1): {len(low_snr)} tasks — seed noise ≥ capability signal") - print(f" (these tasks give random-ish results; weight down)") - - # Write output - out_path = ROOT / "reports" / "variance_decomposition.json" - out_path.write_text(json.dumps({ + out = { "per_task": task_stats, "aggregate": { - "mean_seed_var": total_seed, - "mean_cap_var": total_cap, - "capability_fraction": cap_frac, + "mean_seed_var": float(total_seed), + "mean_cap_var": float(total_cap), + "capability_fraction": float(capability_fraction), + "high_snr_tasks": len(high_snr), + "mid_snr_tasks": len(mid_snr), + "low_snr_tasks": len(low_snr), }, - }, indent=2)) - print(f"\nWrote: {out_path}") + } + + args.reports_dir.mkdir(parents=True, exist_ok=True) + out_path = args.reports_dir / "variance_decomposition.json" + out_path.write_text(json.dumps(out, indent=2), encoding="utf-8") + print(f"Wrote: {out_path}") if __name__ == "__main__": diff --git a/tests/test_dynamics.py b/tests/test_dynamics.py new file mode 100644 index 0000000..fcd9b52 --- /dev/null +++ b/tests/test_dynamics.py @@ -0,0 +1,356 @@ +"""Tests for clawbench.dynamics.""" + +from __future__ import annotations + +import math + +import numpy as np +import pytest + +from clawbench.dynamics import ( + TOOL_FAMILIES, + Dynamics, + Regime, + Sensitivity, + SurvivalPoint, + StratumStats, + StratifiedAssessment, + _classify_tool, + _cosine_dist, + _entropy, + _js_divergence, + _levenshtein, + build_strata, + compute_dynamics, + compute_sensitivity, + find_event_step, + kaplan_meier, + stratify_by_regime, + stratify_by_tier, +) +from clawbench.schemas import ( + TokenUsage, + ToolCall, + Transcript, + TranscriptMessage, + TaskRunResult, +) + + +# ── helpers ────────────────────────────────────────────────────────── + + +def _msg(role, text="", family=None, success=True, error="", ts=0, tok=100): + tcs = [] + if family: + tcs.append(ToolCall( + name=f"tool_{family}", family=family, + success=success, error=error, mutating=family == "edit", + )) + return TranscriptMessage( + role=role, text=text, tool_calls=tcs, timestamp_ms=ts, + usage=TokenUsage(input_tokens=tok, output_tokens=tok // 2, + total_tokens=tok + tok // 2), + ) + + +def _simple_transcript(families, errors=None): + if errors is None: + errors = [False] * len(families) + msgs = [_msg("user", "task")] + for i, (fam, err) in enumerate(zip(families, errors)): + msgs.append(_msg("assistant", f"step {i}", family=fam, + success=not err, error="err" if err else "", + ts=(i + 1) * 1000, tok=100 + i * 10)) + return Transcript(messages=msgs) + + +def _run(transcript, score=0.5, task_id="t1"): + return TaskRunResult( + task_id=task_id, run_index=0, transcript=transcript, + run_score=score, duration_ms=10000, + token_usage=transcript.total_usage, + ) + + +# ── _cosine_dist ───────────────────────────────────────────────────── + + +def test_cosine_dist_identical(): + a = np.array([1.0, 0.0, 0.5]) + assert _cosine_dist(a, a) == pytest.approx(0.0, abs=1e-9) + + +def test_cosine_dist_orthogonal(): + assert _cosine_dist(np.array([1, 0, 0.0]), np.array([0, 1, 0.0])) == pytest.approx(1.0) + + +def test_cosine_dist_zero_vector(): + assert _cosine_dist(np.zeros(3), np.array([1, 2, 3.0])) == 1.0 + + +# ── _entropy ───────────────────────────────────────────────────────── + + +def test_entropy_uniform(): + assert _entropy({"a": 10, "b": 10}) == pytest.approx(1.0) + + +def test_entropy_single(): + assert _entropy({"a": 100}) == pytest.approx(0.0) + + +def test_entropy_empty(): + assert _entropy({}) == 0.0 + + +# ── _js_divergence ─────────────────────────────────────────────────── + + +def test_jsd_identical(): + d = {"a": 5, "b": 5} + assert _js_divergence(d, d) == pytest.approx(0.0, abs=1e-9) + + +def test_jsd_disjoint(): + assert _js_divergence({"a": 10}, {"b": 10}) > 0.5 + + +# ── _levenshtein ──────────────────────────────────────────────────── + + +def test_levenshtein_equal(): + assert _levenshtein([1, 2, 3], [1, 2, 3]) == 0 + + +def test_levenshtein_empty(): + assert _levenshtein([], [1, 2]) == 2 + + +def test_levenshtein_different(): + assert _levenshtein(["a", "b"], ["c", "d"]) == 2 + + +# ── _classify_tool ────────────────────────────────────────────────── + + +@pytest.mark.parametrize("name,expected", [ + ("bash_execute", "execute"), + ("file_read", "read"), + ("tool_edit", "edit"), + ("web_browser", "browser"), + ("grep_search", "search"), + ("write_file", "edit"), + ("run_tests", "execute"), +]) +def test_classify_tool(name, expected): + assert _classify_tool(name) == expected + + +# ── compute_dynamics ───────────────────────────────────────────────── + + +def test_dynamics_basic(): + t = _simple_transcript(["read", "edit", "execute", "read", "edit"]) + d = compute_dynamics(t) + assert d.n_steps == 5 + assert len(d.drift) == 5 + assert len(d.step_size) == 5 + assert len(d.entropy_series) == 5 + assert len(d.tool_sequence) == 5 + assert d.tool_entropy > 0 + + +def test_dynamics_empty(): + t = Transcript(messages=[_msg("user", "hi")]) + d = compute_dynamics(t) + assert d.n_steps == 0 + assert d.regime == Regime.unknown + + +def test_dynamics_trapped(): + t = _simple_transcript(["execute"] * 15, errors=[True] * 15) + d = compute_dynamics(t) + assert d.regime == Regime.trapped + assert d.error_rate > 0.5 + + +def test_dynamics_convergent(): + cycle = ["read", "search", "edit", "read", "execute"] * 6 + t = _simple_transcript(cycle[:30]) + d = compute_dynamics(t) + assert d.regime in (Regime.convergent, Regime.limit_cycle, Regime.diffusive, Regime.unknown) + assert d.error_rate == 0.0 + + +def test_dynamics_markov_keys(): + t = _simple_transcript(["read", "edit", "read"]) + d = compute_dynamics(t) + assert "read" in d.markov + assert "edit" in d.markov["read"] + + +def test_dynamics_constraint_index_range(): + t = _simple_transcript(["read", "edit", "search", "execute", "browser", "memory"] * 3) + d = compute_dynamics(t) + assert 0 <= d.constraint_index <= 1 + + +def test_dynamics_memory_depth(): + t = _simple_transcript(["read", "edit", "read", "edit", "read", "edit"] * 3) + d = compute_dynamics(t) + assert d.memory_depth >= 0 + + +def test_dynamics_normalizes_unknown_tool_family(): + transcript = Transcript( + messages=[ + _msg("user", "task"), + TranscriptMessage( + role="assistant", + text="searching", + tool_calls=[ + ToolCall( + name="grep_search", + family="unknown", + success=True, + error="", + mutating=False, + ) + ], + timestamp_ms=1000, + usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15), + ), + _msg("assistant", "next", family="read", ts=2000), + _msg("assistant", "done", family="edit", ts=3000), + ] + ) + + dynamics = compute_dynamics(transcript) + + assert dynamics.tool_sequence[0] == "search" + assert "search" in dynamics.markov + + +# ── compute_sensitivity ────────────────────────────────────────────── + + +def test_sensitivity_identical_runs(): + t = _simple_transcript(["read", "edit", "execute"]) + ra = _run(t, score=0.8) + rb = _run(t, score=0.8) + s = compute_sensitivity(ra, rb) + assert s.score_delta == pytest.approx(0.0) + assert s.tool_edit_distance == 0 + + +def test_sensitivity_different_runs(): + ta = _simple_transcript(["read", "edit", "execute"]) + tb = _simple_transcript(["search", "browser", "memory"]) + ra = _run(ta, score=0.9) + rb = _run(tb, score=0.3) + s = compute_sensitivity(ra, rb) + assert s.score_delta == pytest.approx(0.6) + assert s.tool_edit_distance > 0 + assert s.family_js_divergence > 0 + + +# ── kaplan_meier ───────────────────────────────────────────────────── + + +def test_km_basic(): + pts = kaplan_meier([1, 2, 3]) + assert pts[0].time == 0.0 + assert pts[0].survival == 1.0 + assert pts[-1].survival == pytest.approx(0.0) + + +def test_km_with_censoring(): + pts = kaplan_meier([1, 5, 3], censored=[False, True, False]) + assert len(pts) == 3 + assert pts[-1].survival > 0 + + +def test_km_empty(): + assert kaplan_meier([]) == [] + + +# ── find_event_step ────────────────────────────────────────────────── + + +def test_find_first_correct_write(): + t = _simple_transcript(["read", "search", "edit", "execute"]) + assert find_event_step(t, "first_correct_write") == 2.0 + + +def test_find_first_error_recovery(): + t = _simple_transcript( + ["read", "execute", "read"], + errors=[False, True, False], + ) + assert find_event_step(t, "first_error_recovery") == 2.0 + + +def test_find_task_completion(): + t = _simple_transcript(["read", "edit"]) + assert find_event_step(t, "task_completion") == 1.0 + + +def test_find_event_none(): + t = _simple_transcript(["read", "read"]) + assert find_event_step(t, "first_correct_write") is None + + +# ── build_strata + reweight ────────────────────────────────────────── + + +def test_build_strata_by_tier(): + runs, dyns, scores = [], [], [] + for tid, sc in [("t1-a", 0.8), ("t1-b", 0.6), ("t2-a", 0.4), ("t2-b", 0.3)]: + t = _simple_transcript(["read", "edit", "execute"]) + r = _run(t, score=sc, task_id=tid) + runs.append(r) + dyns.append(compute_dynamics(t)) + scores.append(sc) + + sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier") + assert sa.total_runs == 4 + names = sa.stratum_names() + assert "tier1" in names + assert "tier2" in names + for s in sa.strata: + assert s.n_runs == 2 + assert s.weight == pytest.approx(0.5) + + +def test_reweight_shifts_mean(): + runs, dyns, scores = [], [], [] + for tid, sc in [("t1-a", 0.9), ("t1-b", 0.8), ("t2-a", 0.2), ("t2-b", 0.1)]: + t = _simple_transcript(["read", "edit", "execute"]) + r = _run(t, score=sc, task_id=tid) + runs.append(r) + dyns.append(compute_dynamics(t)) + scores.append(sc) + + sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier") + + # Reweight towards tier1 (high scores) + high = sa.reweight({"tier1": 0.9, "tier2": 0.1}) + # Reweight towards tier2 (low scores) + low = sa.reweight({"tier1": 0.1, "tier2": 0.9}) + + assert high["score_mean"] > low["score_mean"] + + +def test_reweight_unknown_stratum(): + runs, dyns, scores = [], [], [] + t = _simple_transcript(["read", "edit"]) + r = _run(t, score=0.5, task_id="t1-x") + runs.append(r) + dyns.append(compute_dynamics(t)) + scores.append(0.5) + + sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier") + # Reweight with a stratum that doesn't exist — should fall back + result = sa.reweight({"nonexistent": 1.0}) + assert "score_mean" in result diff --git a/tests/test_dynamics_archive.py b/tests/test_dynamics_archive.py new file mode 100644 index 0000000..b9aa87a --- /dev/null +++ b/tests/test_dynamics_archive.py @@ -0,0 +1,115 @@ +"""Tests for offline dynamics archive helpers.""" + +from __future__ import annotations + +import json +from pathlib import Path + +from clawbench.dynamics_archive import build_dynamics_report, load_task_runs_archive, safe_model_name, write_dynamics_report +from clawbench.schemas import TaskRunResult, TokenUsage, ToolCall, Transcript, TranscriptMessage + + +def _msg(role: str, text: str = "", family: str | None = None, ts: int = 0) -> TranscriptMessage: + tool_calls = [] + if family is not None: + tool_calls.append( + ToolCall( + name=f"tool_{family}", + family=family, + success=True, + error="", + mutating=family == "edit", + ) + ) + return TranscriptMessage( + role=role, + text=text, + tool_calls=tool_calls, + timestamp_ms=ts, + usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15), + ) + + +def _run(task_id: str, score: float = 0.5, run_index: int = 0) -> TaskRunResult: + transcript = Transcript( + messages=[ + _msg("user", f"Solve {task_id}"), + _msg("assistant", "inspect", family="read", ts=1000), + _msg("assistant", "edit", family="edit", ts=2000), + _msg("assistant", "verify", family="execute", ts=3000), + ] + ) + return TaskRunResult( + task_id=task_id, + run_index=run_index, + transcript=transcript, + run_score=score, + duration_ms=3000, + token_usage=transcript.total_usage, + ) + + +def test_load_task_runs_archive_filters_model_and_tier(tmp_path: Path): + model_dir = tmp_path / safe_model_name("ollama/gpt-oss:20b") + other_dir = tmp_path / safe_model_name("openai/gpt-5.4") + for root, task_id in ((model_dir, "t1-demo-task"), (other_dir, "t2-other-task")): + task_dir = root / task_id + task_dir.mkdir(parents=True) + run = _run(task_id) + (task_dir / "run0.json").write_text(run.model_dump_json(indent=2), encoding="utf-8") + + loaded = load_task_runs_archive( + archive_dir=tmp_path, + model="ollama/gpt-oss:20b", + tier="tier1", + ) + + assert list(loaded) == ["t1-demo-task"] + assert loaded["t1-demo-task"][0].task_id == "t1-demo-task" + + +def test_write_dynamics_report_creates_report_without_plots(tmp_path: Path): + task_runs = { + "t1-demo-task": [_run("t1-demo-task", score=0.8)], + "t2-demo-task": [_run("t2-demo-task", score=0.4)], + } + + report_path, plots = write_dynamics_report(task_runs, tmp_path, generate_plots=False) + + assert report_path.exists() + assert report_path.name == "dynamics.json" + assert plots == [] + + report = json.loads(report_path.read_text(encoding="utf-8")) + assert "sensitivity" in report + assert report["sensitivity"]["same_task"]["n_pairs"] == 0 + + +def test_build_dynamics_report_includes_pairwise_sensitivity(): + task_runs = { + "t1-demo-task": [ + _run("t1-demo-task", score=0.8, run_index=0), + TaskRunResult( + task_id="t1-demo-task", + run_index=1, + transcript=Transcript( + messages=[ + _msg("user", "Solve t1-demo-task"), + _msg("assistant", "inspect", family="search", ts=1000), + _msg("assistant", "edit", family="edit", ts=2000), + _msg("assistant", "verify", family="execute", ts=3000), + ] + ), + run_score=0.5, + duration_ms=3000, + token_usage=TokenUsage(input_tokens=30, output_tokens=15, total_tokens=45), + ), + ] + } + + report, _plotter, _plot_data = build_dynamics_report(task_runs, include_pca=False) + + same_task = report["sensitivity"]["same_task"] + assert same_task["n_pairs"] == 1 + assert "t1-demo-task" in same_task["per_task"] + assert same_task["per_task"]["t1-demo-task"]["mean_score_delta"] > 0 \ No newline at end of file diff --git a/tests/test_dynamics_cli.py b/tests/test_dynamics_cli.py new file mode 100644 index 0000000..94c5fec --- /dev/null +++ b/tests/test_dynamics_cli.py @@ -0,0 +1,76 @@ +from pathlib import Path + +from click.testing import CliRunner + +from clawbench.cli import cli +from clawbench.dynamics_archive import safe_model_name +from clawbench.schemas import TaskRunResult, TokenUsage, ToolCall, Transcript, TranscriptMessage + + +def _msg(role: str, text: str = "", family: str | None = None, ts: int = 0) -> TranscriptMessage: + tool_calls = [] + if family is not None: + tool_calls.append( + ToolCall( + name=f"tool_{family}", + family=family, + success=True, + error="", + mutating=family == "edit", + ) + ) + return TranscriptMessage( + role=role, + text=text, + tool_calls=tool_calls, + timestamp_ms=ts, + usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15), + ) + + +def _run(task_id: str, run_index: int = 0) -> TaskRunResult: + transcript = Transcript( + messages=[ + _msg("user", f"Solve {task_id}"), + _msg("assistant", "inspect", family="read", ts=1000), + _msg("assistant", "edit", family="edit", ts=2000), + _msg("assistant", "verify", family="execute", ts=3000), + ] + ) + return TaskRunResult( + task_id=task_id, + run_index=run_index, + transcript=transcript, + run_score=0.8, + duration_ms=3000, + token_usage=transcript.total_usage, + ) + + +def test_dynamics_report_cli_supports_no_plots(tmp_path: Path): + model_dir = tmp_path / safe_model_name("ollama/gpt-oss:20b") / "t1-demo-task" + model_dir.mkdir(parents=True) + run = _run("t1-demo-task") + (model_dir / "run0.json").write_text(run.model_dump_json(indent=2), encoding="utf-8") + + runner = CliRunner() + output_dir = tmp_path / "out" + result = runner.invoke( + cli, + [ + "dynamics-report", + "--archive-dir", + str(tmp_path), + "--model", + "ollama/gpt-oss:20b", + "--output-dir", + str(output_dir), + "--no-plots", + ], + ) + + assert result.exit_code == 0, result.output + assert "Loaded 1 cached runs across 1 tasks" in result.output + assert "Saved 0 plots" in result.output + assert (output_dir / "dynamics.json").exists() + assert list(output_dir.glob("*.png")) == [] \ No newline at end of file diff --git a/tests/test_submission_models.py b/tests/test_submission_models.py new file mode 100644 index 0000000..bd636ea --- /dev/null +++ b/tests/test_submission_models.py @@ -0,0 +1,44 @@ +from clawbench.submission_models import ( + CUSTOM_PRESET_LABEL, + PRESET_AUDIENCE_BUDGET, + PRESET_AUDIENCE_CLAW, + infer_provider, + preset_labels_for_audience, + resolve_model_selection, +) + + +def test_budget_audience_keeps_budget_friendly_presets(): + labels = preset_labels_for_audience(PRESET_AUDIENCE_BUDGET) + + assert "GPT-OSS 20B (Ollama)" in labels + assert "Qwen 3.5 27B (Ollama)" in labels + assert "Claude Opus 4.6" not in labels + + +def test_claw_audience_keeps_full_catalog(): + labels = preset_labels_for_audience(PRESET_AUDIENCE_CLAW) + + assert "GPT-OSS 20B (Ollama)" in labels + assert "Claude Opus 4.6" in labels + + +def test_resolve_model_selection_prefers_preset_provider(): + model_id, provider = resolve_model_selection("", "GPT-OSS 20B (Ollama)") + + assert model_id == "ollama/gpt-oss:20b" + assert provider == "ollama" + + +def test_resolve_model_selection_infers_custom_provider(): + model_id, provider = resolve_model_selection( + "huggingface/Qwen/Qwen3-32B", + CUSTOM_PRESET_LABEL, + ) + + assert model_id == "huggingface/Qwen/Qwen3-32B" + assert provider == "huggingface" + + +def test_infer_provider_requires_provider_prefix(): + assert infer_provider("qwen3.5:27b") == "" \ No newline at end of file