Add archive dynamics pipeline and audience-based model presets

This commit is contained in:
pllm-uci 2026-04-21 20:24:41 -07:00 committed by scoootscooob
parent 5b50814dfc
commit c209612d46
21 changed files with 3446 additions and 928 deletions

141
README.md
View File

@ -104,20 +104,56 @@ Core v1 drops the noisy tasks and reports variance decomposition alongside ranki
Inspired by *"When LLMs Are Dreaming, Where Do They Go?"* — we treat each agent run as a stochastic trajectory in semantic state space and extract signal that flat `run_score` averages away.
| Diagnostic | Formula / Method | Reveals |
|---|---|---|
| **Constraint Index C(q)** | `-z(PR) - z(entropy) + z(BOPS)` over response embeddings | Which tasks converge to one answer vs diverge openly |
| **Regime classification** | Trajectory drift / recurrence / support-volume thresholds | Per-run dynamical signature (trapped / limit-cycle / diffusive) |
| **Survival analysis** | `S(t) = P(T_F > t)` where T_F = first empty assistant turn | Per-turn failure rates; long-horizon capability |
| **SNR-weighted ranking** | `w(task) = SNR × |C(q)|`, winsorized at p95 | Headline metric that weights tasks by their signal density |
| **Variance decomposition** | `Var(score) = Var_seeds + Var_models` per task | Separate capability signal from coin-flip noise |
Current code-path formulas:
```text
Per assistant step t:
x_t = [tool_family_proportions(6), error_flag, normalized_tokens, normalized_text_len, progress]
drift_t = cosine_distance(x_0, x_t)
step_t = cosine_distance(x_{t-1}, x_t)
Task-level Constraint Index:
PR(q) = tr(Σ_q)^2 / tr(Σ_q^2)
H(q) = -Σ_i p_i log2 p_i, p_i = λ_i / Σ_j λ_j, λ = eigvals(Σ_q)
BOPS(q) = mean_m mean_{i<j} cos(v_{q,m,i}, v_{q,m,j})
C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q))
Per-run constraint index used inside the regime classifier:
PR_run = 1 / Σ_i p_i^2
constraint_index_run = 1 - (PR_run - 1) / (d - 1)
Variance decomposition:
seed_var(q) = mean_m Var(run_score_{q,m,*})
cap_var(q) = Var_m Mean(run_score_{q,m,*})
SNR(q) = cap_var(q) / (seed_var(q) + 1e-9)
capability_fraction = mean_q cap_var(q) / (mean_q cap_var(q) + mean_q seed_var(q))
Survival:
T_F = first assistant turn with empty text and no tool calls,
else final assistant turn if run_score < 0.7 and delivery_outcome in {fail, partial}
S(t) = P(T_F > t)
h(t) = P(T_F = t | T_F >= t)
```
Implemented regime classifier in `clawbench/dynamics.py`:
```text
trapped if H_tools < 0.5 or (error_rate > 0.6 and std(drift) < 0.05)
convergent if std(drift_last_quartile) < 0.1 and mean(step_last_quartile) < 0.15 and error_rate < 0.2
diffusive if H_tools > 1.5 and error_rate < 0.15 and constraint_index_run < 0.8
chaotic if H_tools > 2.0 and var(step[1:]) > 0.02
limit_cycle if max autocorr(centered step[1:], lags 2..5) > 0.3
unknown otherwise, or <3 assistant turns
```
The task-level `C(q)` uses a normalized bag-of-words response vector built from the full assistant trajectory text plus tool-call names and compacted inputs, not just the last assistant turn.
From the v4-19 sweep data:
- **Gemini 3.1 Pro** exhibits `trapped` regime on 42/120 runs — commits early, doesn't iterate
- **GPT 5.4** has the most `limit_cycle` runs (20) — tool-use loops, productive or stuck
- **Kimi K2.5** dies at median turn 3 (worst survival); **GPT 5.4** survives to turn 8 at 60% rate (best)
All scripts under `scripts/` — pure numpy + scipy, no torch / sentence-transformers required, runs on any archive dir.
All scripts under `scripts/` run on cached per-run JSONs with plain numpy-based tooling; no torch or sentence-transformers required.
### 4. We ablate configurations, not just models
@ -264,9 +300,12 @@ The `1/y_i^2` term means the worst score dominates. A configuration scoring 0.85
Flat-mean compresses frontier model gaps. An alternative that weights tasks by their signal density:
```
weight(task) = max(0, SNR(task)) × |C(q)(task)| # unbounded
weight_winsorized(task) = min(weight(task), p95) # prevent single-task dominance
score(model) = Σ weight × mean_run_score / Σ weight
w_q = max(0, SNR(q)) × |C(q)|
w_q^wins = min(w_q, p95({w_q}))
flat_score(model) = mean_q mean_run_score(model, q) over covered tasks
weighted_score(model) = Σ_q w_q mean_run_score(model, q) / Σ_q w_q
winsorized_score(model) = Σ_q w_q^wins mean_run_score(model, q) / Σ_q w_q^wins
```
Under SNR × |C(q)| winsorized on the same 1,080-run archive, **Opus 4.7 ranks #1** (instead of Opus 4.6 under flat mean) and **GPT 5.4 drops from #3 to #7** — its task-specific cliffs (0.16 on `t3-feature-export`) fall on the highest-signal tasks. This exposes what the flat mean averages away.
@ -349,27 +388,48 @@ clawbench run \
-o results/opus46_core_v1.json
```
### Analyze an archive with the diagnostic suite
### Analyze a real archive
```bash
# 1. Aggregate coverage + fair-comparison audit
# Fair-comparison audit
python3 scripts/audit_runs.py
# 2. Rejudge any judge-infrastructure failures via direct Anthropic API
python3 scripts/rejudge_all.py \
--drift-dir data/drift_2026-04-19-full \
--archive-dir data/run_cache_archive/v2026-4-19-full
# 3. Generate the fair comparison report
python3 scripts/generate_fair_report.py --tag v2026-4-19-full
# 4. Dynamical-systems diagnostics (C(q), regimes, survival, SNR-weighted)
.venv/bin/python3 scripts/compute_constraint_index.py
.venv/bin/python3 scripts/classify_regimes.py
.venv/bin/python3 scripts/variance_decomp.py
.venv/bin/python3 scripts/survival_analysis.py
.venv/bin/python3 scripts/snr_weighted_ranking.py
.venv/bin/python3 scripts/generate_dynamical_report.py
# Posterior dynamics + ranking from cached per-run JSONs
python3 scripts/run_posterior_dynamics_pipeline.py \
--archive-dir .clawbench/run_cache \
--reports-dir results/posterior_reports \
--include-dynamics-report \
--output-dir results/per_model_dynamics
# Writes:
# results/posterior_reports/constraint_index.json
# results/posterior_reports/regimes.json
# results/posterior_reports/variance_decomposition.json
# results/posterior_reports/survival_analysis.json
# results/posterior_reports/snr_weighted_ranking.json
# results/posterior_reports/EVAL_REPORT_DYNAMICAL.md
# results/per_model_dynamics/<safe_model_name>/dynamics.json
# results/per_model_dynamics/<safe_model_name>/*.png
```
If you only want one model's offline dynamics bundle:
```bash
clawbench dynamics-report \
--archive-dir .clawbench/run_cache \
--model ollama/gpt-oss:20b \
--output-dir results/gptoss_dynamics
# Quick CI path: skip plot rendering
clawbench dynamics-report \
--archive-dir .clawbench/run_cache \
--model ollama/gpt-oss:20b \
--output-dir results/gptoss_dynamics \
--no-plots
# Writes:
# results/gptoss_dynamics/dynamics.json
```
### Running locally with small models (Ollama)
@ -379,7 +439,24 @@ A single consumer GPU running an open-weight model is enough to develop plugin p
```bash
ollama pull gpt-oss:20b
export OPENCLAW_GATEWAY_TOKEN=<your-gateway-token>
clawbench run --model ollama/gpt-oss:20b --task t1-fs-quick-note --runs 1
export CLAWBENCH_RUN_CACHE_DIR=$PWD/.clawbench/run_cache
# Real benchmark run + immediate per-run dynamics bundle
clawbench run \
--model ollama/gpt-oss:20b \
--task t1-fs-quick-note \
--runs 1 \
--dynamics \
-o results/ollama_smoke.json
# Optional second local model
ollama pull qwen3.5:27b
# Offline posterior analysis reads CLAWBENCH_RUN_CACHE_DIR
python3 scripts/run_posterior_dynamics_pipeline.py \
--archive-dir .clawbench/run_cache \
--reports-dir results/posterior_reports
clawbench diagnose profiles/local_ollama_gpt_oss.yaml
```
@ -415,6 +492,9 @@ clawbench/
│ ├── profile.py # v0.5 plugin fingerprinting
│ ├── diagnostic.py # Configuration Diagnostic report
│ ├── factor_analysis.py # fANOVA factor importance
│ ├── dynamics.py # Trajectory metrics + sensitivity analysis
│ ├── dynamics_archive.py # Cached-run loading + offline report assembly
│ ├── dynamics_plots.py # Offline dynamics visualizations
│ └── cli.py # CLI entry points
├── tasks-public/ # Core v1 PUBLIC release (19 tasks)
@ -431,6 +511,7 @@ clawbench/
│ ├── audit_per_run.py # Per-run cross-model audit
│ ├── rejudge_all.py # Direct-API rejudge for broken gateway judges
│ ├── generate_fair_report.py # Fair N-model comparison report
│ ├── run_posterior_dynamics_pipeline.py # One-shot posterior analysis driver
│ ├── compute_constraint_index.py # C(q) per task
│ ├── classify_regimes.py # Per-run dynamical regime classifier
│ ├── variance_decomp.py # Seed-noise vs capability-signal decomposition
@ -439,7 +520,7 @@ clawbench/
│ └── generate_dynamical_report.py # Combined dynamical-systems report
├── profiles/ # v0.5 plugin profile YAMLs
├── tests/ # 107 tests
├── tests/ # Test suite
├── Dockerfile # Layered on ghcr.io/openclaw/openclaw:latest
├── CLAWBENCH_V0_4_SPEC.md # Full specification
└── PARTNER_TRACE_SPEC.md # Trace interchange format
@ -469,7 +550,7 @@ clawbench/
## Testing
```bash
python -m pytest -q # 107 tests
python -m pytest -q
```
Key test invariants:

View File

@ -136,6 +136,15 @@ submission
Important rule: browser tasks stay serialized on one dedicated lane to avoid Chromium and port-range collisions.
## Submission presets
The Submit tab now exposes two preset audiences so the Space can serve both general Claw users and lower-budget exploratory runs:
- `Claw Users` keeps the full preset catalog, including provider-backed frontier models.
- `Budget Researchers` narrows the list to local or lower-cost presets such as `ollama/gpt-oss:20b`, `ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and `huggingface/google/gemma-4-26B-A4B-it`.
You can still enter any custom model ID directly; the preset audience only filters the shortcut catalog and the bulk-submit action.
## Task inventory
| Task | Tier | Family | Main verification |

132
app.py
View File

@ -26,6 +26,15 @@ from clawbench.hub import (
load_submission_rows_from_parquet,
resolve_dataset_repo,
)
from clawbench.submission_models import (
CUSTOM_PRESET_LABEL,
PRESET_AUDIENCE_ALL,
PRESET_AUDIENCE_CHOICES,
PRESET_MODEL_MAP,
preset_labels_for_audience,
preset_models_for_audience,
resolve_model_selection,
)
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
logger = logging.getLogger("clawbench.app")
@ -51,31 +60,6 @@ def _env_int(name: str, default: int, *, minimum: int, maximum: int) -> int:
DEFAULT_RUNS_PER_TASK = _env_int("CLAWBENCH_DEFAULT_RUNS_PER_TASK", 3, minimum=1, maximum=10)
DEFAULT_PARALLEL_LANES = _env_int("CLAWBENCH_DEFAULT_PARALLEL_LANES", 1, minimum=1, maximum=4)
# ---------------------------------------------------------------------------
# Preset models for quick submission
# ---------------------------------------------------------------------------
PRESET_MODELS = {
# All models verified working on HF Inference API (free with HF_TOKEN)
# Tested 2026-04-07 via router.huggingface.co/v1/chat/completions
#
# --- Chinese open-source ---
"GLM 5.1 (754B MoE)": "huggingface/zai-org/GLM-5.1",
"GLM 5 (400B MoE)": "huggingface/zai-org/GLM-5",
"Qwen3 32B": "huggingface/Qwen/Qwen3-32B",
"DeepSeek R1": "huggingface/deepseek-ai/DeepSeek-R1",
"Kimi K2 Instruct": "huggingface/moonshotai/Kimi-K2-Instruct",
"MiniMax M2.5": "huggingface/MiniMaxAI/MiniMax-M2.5",
# --- Google open-source ---
"Gemma 4 26B MoE": "huggingface/google/gemma-4-26B-A4B-it",
# --- Meta open-source ---
"Llama 3.3 70B": "huggingface/meta-llama/Llama-3.3-70B-Instruct",
"Llama 3.1 70B": "huggingface/meta-llama/Llama-3.1-70B-Instruct",
# --- Proprietary models (require runtime auth configured for the model provider) ---
"Claude Sonnet 4.6": "anthropic/claude-sonnet-4-6",
"Claude Opus 4.6": "anthropic/claude-opus-4-6",
}
# ---------------------------------------------------------------------------
# Background worker (starts in a thread)
# ---------------------------------------------------------------------------
@ -271,15 +255,14 @@ def submit_model(
prompt_variant: str,
submitter: str,
) -> str:
# Use preset if selected, otherwise use custom model ID
model_id = PRESET_MODELS.get(preset, "") or model.strip()
model_id, provider_id = resolve_model_selection(model, preset, provider)
if not model_id:
return "Please enter a model ID or select a preset."
selected_tier = tier if tier != "all" else None
request = SubmissionRequest(
model=model_id,
provider=provider.strip(),
provider=provider_id,
judge_model=judge_model.strip(),
runs_per_task=int(runs),
max_parallel_lanes=int(max_parallel_lanes),
@ -292,20 +275,38 @@ def submit_model(
return f"Submitted [{model_id}]! Job ID: {job.job_id}. Check the Queue tab."
def submit_all_presets(runs: int, max_parallel_lanes: int, submitter: str) -> str:
"""Submit all preset models at once."""
def submit_all_presets(
preset_audience: str,
runs: int,
max_parallel_lanes: int,
submitter: str,
) -> str:
"""Submit all preset models from the selected audience track."""
presets = preset_models_for_audience(preset_audience)
if not presets:
return f"No presets configured for {preset_audience}."
submitted = []
for name, model_id in PRESET_MODELS.items():
for preset in presets:
request = SubmissionRequest(
model=model_id,
provider="",
model=preset.model_id,
provider=preset.provider,
runs_per_task=int(runs),
max_parallel_lanes=int(max_parallel_lanes),
submitter=submitter.strip(),
)
job = asyncio.run(queue.submit(request))
submitted.append(f"{name} ({job.job_id})")
return f"Submitted {len(submitted)} models:\n" + "\n".join(f" - {s}" for s in submitted)
submitted.append(f"{preset.label} ({job.job_id})")
return f"Submitted {len(submitted)} models from {preset_audience}:\n" + "\n".join(
f" - {item}" for item in submitted
)
def update_preset_choices(preset_audience: str):
return gr.update(
choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(preset_audience),
value=CUSTOM_PRESET_LABEL,
)
# ---------------------------------------------------------------------------
@ -952,7 +953,7 @@ STAT_JUDGE = (
)
STAT_PRESETS = (
'<div class="stat-pill"><div class="label">Presets</div><div class="value teal">'
+ str(len(PRESET_MODELS))
+ str(len(PRESET_MODEL_MAP))
+ "</div></div>"
)
@ -986,12 +987,28 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
"run via HuggingFace Inference API. You can also use locally hosted models "
"(for example Ollama) when your OpenClaw runtime has them configured."
)
gr.Markdown(
"Use `Preset Audience` to switch between the full Claw catalog and a smaller budget track. "
"The budget track keeps local and lower-cost options upfront, including `ollama/gpt-oss:20b`, "
"`ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and "
"`huggingface/google/gemma-4-26B-A4B-it`."
)
preset_audience_input = gr.Dropdown(
choices=list(PRESET_AUDIENCE_CHOICES),
value=PRESET_AUDIENCE_ALL,
label="Preset Audience",
)
preset_input = gr.Dropdown(
choices=["(custom)"] + list(PRESET_MODELS.keys()),
value="(custom)",
choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(PRESET_AUDIENCE_ALL),
value=CUSTOM_PRESET_LABEL,
label="Preset models",
)
preset_audience_input.change(
fn=update_preset_choices,
inputs=preset_audience_input,
outputs=preset_input,
)
with gr.Row():
model_input = gr.Textbox(
label="Custom Model ID (if not using preset)",
@ -1074,26 +1091,35 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
)
submit_all_btn.click(
fn=submit_all_presets,
inputs=[runs_input, max_parallel_lanes_input, submitter_input],
inputs=[preset_audience_input, runs_input, max_parallel_lanes_input, submitter_input],
outputs=submit_output,
)
gr.Markdown("""
**All presets verified working on HF Inference API (free):**
**Preset audiences:**
| Model | Provider | Size | Runtime |
|-------|----------|------|---------|
| GLM 5.1 | Z.ai | 754B MoE | HF free |
| GLM 5 | Z.ai | 400B MoE | HF free |
| Qwen3 32B | Alibaba | 32B | HF free |
| DeepSeek R1 | DeepSeek | 671B MoE | HF free |
| Kimi K2 Instruct | Moonshot AI | MoE | HF free |
| MiniMax M2.5 | MiniMax | MoE | HF free |
| Gemma 4 26B MoE | Google | 26B MoE | HF free |
| Llama 3.3 70B | Meta | 70B | HF free |
| Llama 3.1 70B | Meta | 70B | HF free |
| Claude Sonnet 4.6 | Anthropic | - | configured auth |
| Claude Opus 4.6 | Anthropic | - | configured auth |
| Audience | What it optimizes for | Presets |
|---|---|---|
| Claw Users | Full preset catalog, including provider-backed frontier options | Anthropic, HF open-weight, and Ollama presets |
| Budget Researchers | Smaller local/free-friendly track | GPT-OSS 20B, Qwen 3.5 27B, Qwen3 32B, Gemma 4 26B |
**Current preset catalog:**
| Model | Provider | Audience |
|---|---|---|
| GPT-OSS 20B (Ollama) | Ollama | Claw Users, Budget Researchers |
| Qwen 3.5 27B (Ollama) | Ollama | Claw Users, Budget Researchers |
| Qwen3 32B | HuggingFace | Claw Users, Budget Researchers |
| Gemma 4 26B MoE | HuggingFace | Claw Users, Budget Researchers |
| GLM 5.1 | HuggingFace | Claw Users |
| GLM 5 | HuggingFace | Claw Users |
| DeepSeek R1 | HuggingFace | Claw Users |
| Kimi K2 Instruct | HuggingFace | Claw Users |
| MiniMax M2.5 | HuggingFace | Claw Users |
| Llama 3.3 70B | HuggingFace | Claw Users |
| Llama 3.1 70B | HuggingFace | Claw Users |
| Claude Sonnet 4.6 | Anthropic | Claw Users |
| Claude Opus 4.6 | Anthropic | Claw Users |
""")
with gr.Tab("Queue"):

View File

@ -116,6 +116,11 @@ def cli(verbose: bool) -> None:
show_default=True,
help="Where to write ecosystem insight files after a --profile run.",
)
@click.option(
"--dynamics",
is_flag=True,
help="Run quick post-benchmark dynamics analysis. Prefer dynamics-report for offline cache/archive analysis.",
)
def run(
model: str,
gateway_token: str,
@ -137,6 +142,7 @@ def run(
browser_concurrency: int,
profile: Path | None,
insights_dir: Path,
dynamics: bool,
) -> None:
gateway_config = GatewayConfig(token=gateway_token)
harness = BenchmarkHarness(
@ -165,6 +171,9 @@ def run(
json.dump(result.model_dump(), handle, indent=2)
click.echo(f"\nResults saved to {out_path}")
if dynamics:
_run_dynamics_analysis(harness.last_task_runs, out_path)
if profile is not None:
_run_v05_diagnostic(
profile_path=profile,
@ -179,6 +188,83 @@ def run(
asyncio.run(upload_result(result))
@cli.command("dynamics-report")
@click.option(
"--archive-dir",
type=click.Path(exists=True, file_okay=False, path_type=Path),
required=True,
help="Path to a run cache/archive root or a single model cache directory.",
)
@click.option(
"--model",
default=None,
help="Model id to select when the archive root contains multiple model directories.",
)
@click.option("--tier", type=click.Choice(["tier1", "tier2", "tier3", "tier4", "tier5"]))
@click.option("--task", "task_ids", multiple=True, help="Specific task IDs to include from the archive.")
@click.option(
"--output-dir",
type=click.Path(path_type=Path),
default=Path("results/offline_dynamics"),
show_default=True,
help="Directory where dynamics.json and plots will be written.",
)
@click.option(
"--no-plots",
is_flag=True,
help="Write only dynamics.json and skip plot rendering.",
)
def dynamics_report(
archive_dir: Path,
model: str | None,
tier: str | None,
task_ids: tuple[str, ...],
output_dir: Path,
no_plots: bool,
) -> None:
"""Generate dynamics plots and a JSON report from cached TaskRunResult archives."""
from clawbench.dynamics_archive import load_task_runs_archive
try:
task_runs = load_task_runs_archive(
archive_dir=archive_dir,
model=model,
task_ids=task_ids,
tier=tier,
)
except ValueError as exc:
raise click.ClickException(str(exc)) from exc
if not task_runs:
raise click.ClickException(f"No cached runs found under {archive_dir}")
report_path, plots, n_runs = _write_dynamics_report(
task_runs,
output_dir,
generate_plots=not no_plots,
)
click.echo(f"Loaded {n_runs} cached runs across {len(task_runs)} tasks")
click.echo(f"Dynamics report saved to {report_path}")
click.echo(f"Saved {len(plots)} plots to {output_dir}/")
def _write_dynamics_report(
task_runs: dict[str, list],
output_dir: Path,
*,
generate_plots: bool = True,
) -> tuple[Path, list[Path], int]:
from clawbench.dynamics_archive import write_dynamics_report
report_path, plots = write_dynamics_report(
task_runs,
output_dir,
generate_plots=generate_plots,
)
n_runs = sum(len(runs) for runs in task_runs.values())
return report_path, plots, n_runs
def _run_v05_diagnostic(
*,
profile_path: Path,
@ -693,5 +779,23 @@ def show(result_file: str) -> None:
)
def _run_dynamics_analysis(
task_runs: dict[str, list],
result_path: str,
) -> None:
"""Compute stratified dynamics from raw TaskRunResult objects."""
run_stem = Path(result_path).stem
dyn_dir = Path(result_path).parent / f"{run_stem}_dynamics"
try:
dyn_path, plots, n_runs = _write_dynamics_report(task_runs, dyn_dir)
except ValueError as exc:
click.echo(str(exc))
return
click.echo(f"\n[dynamics] Analysed {n_runs} cached runs")
click.echo(f" Dynamics report saved to {dyn_path}")
click.echo(f" Saved {len(plots)} plots to {dyn_dir}/")
def main() -> None:
cli()

View File

@ -8,7 +8,9 @@ import logging
import math
import os
import re
import shutil
import subprocess
import sys
import uuid
from dataclasses import dataclass, field
from typing import Any
@ -24,10 +26,10 @@ logger = logging.getLogger(__name__)
PROTOCOL_VERSION = 3
DEVICE_IDENTITY_HELPER_JS = r"""
const crypto = require("node:crypto");
const fs = require("node:fs");
const os = require("node:os");
const path = require("node:path");
const crypto = require("crypto");
const fs = require("fs");
const os = require("os");
const path = require("path");
const ED25519_SPKI_PREFIX = Buffer.from("302a300506032b6570032100", "hex");
@ -52,7 +54,7 @@ function fingerprintPublicKey(publicKeyPem) {
}
function generateIdentity() {
const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519");
const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519", {});
const publicKeyPem = publicKey.export({ type: "spki", format: "pem" }).toString();
const privateKeyPem = privateKey.export({ type: "pkcs8", format: "pem" }).toString();
return {
@ -445,12 +447,48 @@ class GatewayClient:
max_wait_seconds=2.0,
)
)
# Some gateway/provider paths persist assistant messages in session
# history without emitting complete streaming events. Backfill from
# sessions.get if stream capture appears incomplete.
history_messages = await self.get_session_messages(session_key)
collected_assistant = sum(
1 for msg in collected_messages if msg.role == "assistant"
)
history_assistant = sum(
1 for msg in history_messages if msg.role == "assistant"
)
if history_messages and (
len(history_messages) > len(collected_messages)
or history_assistant > collected_assistant
):
collected_messages = history_messages
finally:
self._event_queues.pop(chat_queue_key, None)
self._event_queues.pop(msg_queue_key, None)
return _correlate_transcript(Transcript(messages=collected_messages))
async def get_session_messages(self, session_key: str) -> list[TranscriptMessage]:
try:
response = await self._rpc("sessions.get", {"key": session_key})
except Exception:
return []
payload = response.get("payload", {})
raw_messages = payload.get("messages", [])
if not isinstance(raw_messages, list):
return []
parsed: list[TranscriptMessage] = []
for raw in raw_messages:
if not isinstance(raw, dict):
continue
msg = _parse_single_message(raw)
if msg is not None:
parsed.append(msg)
return parsed
async def _rpc(
self,
method: str,
@ -551,9 +589,17 @@ def _build_connect_device(
"deviceFamily": device_family or "",
}
)
node_executable = _resolve_node_executable()
if not node_executable:
logger.warning(
"Failed to build device identity payload: no Node executable found"
)
return None
try:
completed = subprocess.run(
["node", "-e", DEVICE_IDENTITY_HELPER_JS],
[node_executable, "-e", DEVICE_IDENTITY_HELPER_JS],
input=helper_input,
capture_output=True,
text=True,
@ -577,6 +623,25 @@ def _build_connect_device(
return payload
def _resolve_node_executable() -> str | None:
"""Resolve Node binary, preferring the active Python/conda environment."""
candidates: list[str] = []
# First try the same environment as the active Python interpreter.
candidates.append(os.path.join(os.path.dirname(sys.executable), "node"))
# Then try CONDA_PREFIX when available.
conda_prefix = os.environ.get("CONDA_PREFIX")
if conda_prefix:
candidates.append(os.path.join(conda_prefix, "bin", "node"))
for candidate in candidates:
if os.path.isfile(candidate) and os.access(candidate, os.X_OK):
return candidate
return shutil.which("node")
def _is_transient_gateway_connect_error(exc: Exception) -> bool:
if isinstance(exc, InvalidStatus):
return exc.response.status_code in {502, 503, 504}
@ -615,6 +680,9 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N
if block_type == "text":
text_parts.append(block.get("text", ""))
continue
if block_type == "output_text":
text_parts.append(block.get("text", ""))
continue
if block_type in {"tool_use", "toolCall"}:
arguments = block.get("input", block.get("arguments", {}))
if isinstance(arguments, str):
@ -641,6 +709,16 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N
if tool_result_content:
text_parts.append(tool_result_content)
# Some providers surface assistant failures in a dedicated error field
# with empty content blocks. Preserve that signal in transcript text.
error_message = message_data.get("errorMessage", "")
if isinstance(error_message, str) and error_message.strip():
text_parts.append(error_message.strip())
direct_text = message_data.get("text", "")
if isinstance(direct_text, str) and direct_text.strip():
text_parts.append(direct_text.strip())
if not text_parts and not tool_calls and not tool_result_for:
return None

695
clawbench/dynamics.py Normal file
View File

@ -0,0 +1,695 @@
"""Dynamics analysis for ClawBench agent trajectories.
Treats each agent run as a discrete dynamical system and computes step
embeddings, trajectory metrics, sensitivity analysis, regime classification,
Kaplan-Meier survival, non-Markov memory, and stratified assessment with
Bayesian importance-weight correction for distribution shift.
"""
from __future__ import annotations
import math
from collections import Counter
from dataclasses import dataclass, field
from enum import Enum
from typing import TYPE_CHECKING, Callable
import numpy as np
if TYPE_CHECKING:
from clawbench.schemas import TaskRunResult, Transcript
# ── Constants ──────────────────────────────────────────────────────────
TOOL_FAMILIES = ("browser", "edit", "execute", "memory", "read", "search")
_N_FAM = len(TOOL_FAMILIES)
# ── Types ──────────────────────────────────────────────────────────────
class Regime(str, Enum):
convergent = "convergent"
chaotic = "chaotic"
trapped = "trapped"
diffusive = "diffusive"
limit_cycle = "limit_cycle"
unknown = "unknown"
@dataclass
class Dynamics:
"""Computed dynamics for a single trajectory."""
n_steps: int
embeddings: np.ndarray # (n_steps, 10)
drift: np.ndarray # cosine distance from step 0
step_size: np.ndarray # cosine distance from step t-1
entropy_series: list[float] # running tool-family entropy
error_rate_series: list[float] # running error fraction
tokens_series: list[int]
latency_series: list[float]
tool_sequence: list[str] # primary family per step
markov: dict[str, dict[str, float]]
family_dist: dict[str, float]
regime: Regime
mean_drift: float
mean_step_size: float
tool_entropy: float
error_rate: float
constraint_index: float
pca_trajectory: np.ndarray | None = None # (n_steps, 2)
bigram_transitions: dict[str, dict[str, float]] = field(default_factory=dict)
memory_depth: float = 0.0 # I(X_t; X_{t-2} | X_{t-1})
@dataclass
class Sensitivity:
"""Pairwise comparison between two runs of the same task."""
task_id: str
score_delta: float
tool_edit_distance: int
family_js_divergence: float
embedding_divergence: np.ndarray # (min_steps,)
lyapunov_proxy: float
@dataclass
class SurvivalPoint:
time: float
survival: float
# ── Helpers ────────────────────────────────────────────────────────────
def _cosine_dist(a: np.ndarray, b: np.ndarray) -> float:
na, nb = np.linalg.norm(a), np.linalg.norm(b)
if na < 1e-12 or nb < 1e-12:
return 1.0
return float(1.0 - np.dot(a, b) / (na * nb))
def _entropy(counts: dict[str, int]) -> float:
total = sum(counts.values())
if total == 0:
return 0.0
return -sum(
(c / total) * math.log2(c / total) for c in counts.values() if c > 0
)
def _js_divergence(p: dict[str, int], q: dict[str, int]) -> float:
keys = set(p) | set(q)
if not keys:
return 0.0
tp, tq = sum(p.values()) or 1, sum(q.values()) or 1
jsd = 0.0
for k in keys:
pk, qk = p.get(k, 0) / tp, q.get(k, 0) / tq
mk = (pk + qk) / 2
if pk > 0 and mk > 0:
jsd += 0.5 * pk * math.log2(pk / mk)
if qk > 0 and mk > 0:
jsd += 0.5 * qk * math.log2(qk / mk)
return jsd
def _levenshtein(a: list, b: list) -> int:
if not a:
return len(b)
if not b:
return len(a)
prev = list(range(len(b) + 1))
for ca in a:
curr = [prev[0] + 1] + [0] * len(b)
for j, cb in enumerate(b):
curr[j + 1] = min(
prev[j] + (0 if ca == cb else 1),
prev[j + 1] + 1,
curr[j] + 1,
)
prev = curr
return prev[-1]
def _classify_tool(name: str) -> str:
lo = name.lower()
for fam in TOOL_FAMILIES:
if fam in lo:
return fam
_ALIASES = {
"edit": ("write_file", "create_file", "str_replace", "patch"),
"execute": ("bash", "terminal", "shell", "run", "exec"),
"browser": ("browse", "click", "navigate", "screenshot"),
"search": ("grep", "find", "glob", "semantic"),
"read": ("cat", "head", "tail", "view", "list_dir"),
}
for fam, keywords in _ALIASES.items():
if any(k in lo for k in keywords):
return fam
return "execute"
def _normalize_tool_family(name: str, family: str | None) -> str:
if family in TOOL_FAMILIES:
return family
return _classify_tool(name)
# ── Feature embedding ──────────────────────────────────────────────────
def _embed_transcript(
transcript: Transcript,
) -> tuple[np.ndarray, list[str], list[int], list[float], list[bool]]:
"""Build (n_steps, 10) feature matrix from assistant turns.
Features: [0:6] tool-family proportions, [6] error flag,
[7] normalised tokens, [8] normalised text length, [9] progress.
"""
msgs = transcript.assistant_messages
n = len(msgs)
if n == 0:
return np.empty((0, _N_FAM + 4)), [], [], [], []
X = np.zeros((n, _N_FAM + 4))
families: list[str] = []
tokens: list[int] = []
latencies: list[float] = []
errors: list[bool] = []
raw_tokens = np.zeros(n)
raw_text = np.zeros(n)
for i, msg in enumerate(msgs):
fam_counts: Counter = Counter()
has_err = False
for tc in msg.tool_calls:
fam = _normalize_tool_family(tc.name, tc.family)
fam_counts[fam] += 1
if tc.success is False or tc.error:
has_err = True
n_tc = sum(fam_counts.values()) or 1
for j, fam in enumerate(TOOL_FAMILIES):
X[i, j] = fam_counts.get(fam, 0) / n_tc
X[i, _N_FAM] = 1.0 if has_err else 0.0
X[i, _N_FAM + 3] = i / max(n - 1, 1)
families.append(
max(fam_counts, key=fam_counts.get) if fam_counts else "execute"
)
errors.append(has_err)
tokens.append(msg.usage.total_tokens)
raw_tokens[i] = float(msg.usage.total_tokens)
raw_text[i] = float(len(msg.text))
dt = msg.timestamp_ms - msgs[i - 1].timestamp_ms if i > 0 else 0
latencies.append(max(float(dt), 0.0))
mx_tok = raw_tokens.max() or 1
mx_txt = raw_text.max() or 1
X[:, _N_FAM + 1] = raw_tokens / mx_tok
X[:, _N_FAM + 2] = raw_text / mx_txt
return X, families, tokens, latencies, errors
# ── Non-Markov memory ────────────────────────────────────────────────
def _compute_bigram_transitions(seq: list[str]) -> dict[str, dict[str, float]]:
"""P(family_t | family_{t-1}, family_{t-2}) grouped by bigram context."""
if len(seq) < 3:
return {}
bigrams: dict[str, Counter] = {}
for a, b, c in zip(seq[:-2], seq[1:-1], seq[2:]):
ctx = f"{a}->{b}"
bigrams.setdefault(ctx, Counter())[c] += 1
return {
ctx: {k: v / sum(cnts.values()) for k, v in cnts.items()}
for ctx, cnts in bigrams.items()
}
def _conditional_mi(seq: list[str]) -> float:
"""I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator."""
if len(seq) < 3:
return 0.0
n = len(seq) - 2
triple = Counter(zip(seq[:-2], seq[1:-1], seq[2:]))
pair_01 = Counter(zip(seq[:-2], seq[1:-1]))
pair_12 = Counter(zip(seq[1:-1], seq[2:]))
single = Counter(seq[1:-1])
mi = 0.0
for (a, b, c), count in triple.items():
p_abc = count / n
p_ab, p_bc, p_b = pair_01[(a, b)] / n, pair_12[(b, c)] / n, single[b] / n
if p_ab > 0 and p_bc > 0 and p_b > 0:
mi += p_abc * math.log2((p_abc * p_b) / (p_ab * p_bc))
return max(mi, 0.0)
# ── Core analysis ──────────────────────────────────────────────────────
def compute_dynamics(transcript: Transcript) -> Dynamics:
"""Compute trajectory dynamics from a single run transcript."""
X, families, tokens, latencies, errors = _embed_transcript(transcript)
n = len(families)
drift = (
np.array([_cosine_dist(X[0], X[i]) for i in range(n)])
if n else np.array([])
)
step_sz = np.zeros(n)
for i in range(1, n):
step_sz[i] = _cosine_dist(X[i - 1], X[i])
fam_acc: Counter = Counter()
err_count = 0
entropy_s: list[float] = []
error_s: list[float] = []
for i, (fam, err) in enumerate(zip(families, errors)):
fam_acc[fam] += 1
err_count += int(err)
entropy_s.append(_entropy(dict(fam_acc)))
error_s.append(err_count / (i + 1))
total = sum(fam_acc.values()) or 1
fam_dist = {k: v / total for k, v in fam_acc.items()}
mc: dict[str, Counter] = {f: Counter() for f in TOOL_FAMILIES}
for a, b in zip(families[:-1], families[1:]):
mc[a][b] += 1
markov = {
src: ({dst: c / t for dst, c in cnts.items()} if (t := sum(cnts.values())) else {})
for src, cnts in mc.items()
}
ci = 0.5
if n > 2:
cov = np.cov(X.T)
eigvals = np.maximum(np.linalg.eigvalsh(cov), 0)
tv = eigvals.sum()
if tv > 1e-10:
p = eigvals / tv
pr = 1.0 / np.sum(p**2)
ci = 1.0 - (pr - 1) / (X.shape[1] - 1)
h = _entropy(dict(fam_acc))
er = err_count / n if n else 0
regime = _classify_regime(drift, step_sz, h, er, ci, n)
return Dynamics(
n_steps=n,
embeddings=X,
drift=drift,
step_size=step_sz,
entropy_series=entropy_s,
error_rate_series=error_s,
tokens_series=tokens,
latency_series=latencies,
tool_sequence=families,
markov=markov,
family_dist=fam_dist,
regime=regime,
mean_drift=float(np.mean(drift)) if n else 0,
mean_step_size=float(np.mean(step_sz)) if n else 0,
tool_entropy=h,
error_rate=er,
constraint_index=ci,
bigram_transitions=_compute_bigram_transitions(families),
memory_depth=_conditional_mi(families),
)
def _classify_regime(drift, step_sz, entropy, error_rate, ci, n) -> Regime:
if n < 3:
return Regime.unknown
if entropy < 0.5 or (error_rate > 0.6 and float(np.std(drift)) < 0.05):
return Regime.trapped
q = max(1, n // 4)
late_drift_std = float(np.std(drift[-q:]))
late_step_mean = float(np.mean(step_sz[-q:]))
if late_drift_std < 0.1 and late_step_mean < 0.15 and error_rate < 0.2:
return Regime.convergent
if entropy > 1.5 and error_rate < 0.15 and ci < 0.8:
return Regime.diffusive
step_var = float(np.var(step_sz[1:])) if n > 1 else 0
if entropy > 2.0 and step_var > 0.02:
return Regime.chaotic
if n > 6:
ss = step_sz[1:]
ss_c = ss - ss.mean()
norm = np.dot(ss_c, ss_c)
if norm > 1e-10:
ac = np.correlate(ss_c, ss_c, mode="full")
ac = ac[len(ac) // 2:] / norm
if len(ac) > 5 and max(ac[2:6]) > 0.3:
return Regime.limit_cycle
return Regime.unknown
# ── Sensitivity ────────────────────────────────────────────────────────
def compute_sensitivity(
run_a: TaskRunResult,
run_b: TaskRunResult,
task_id: str = "",
) -> Sensitivity:
"""Compare two runs of the same task for prompt sensitivity."""
Xa, fam_a, *_ = _embed_transcript(run_a.transcript)
Xb, fam_b, *_ = _embed_transcript(run_b.transcript)
min_n = min(len(Xa), len(Xb))
emb_div = (
np.array([_cosine_dist(Xa[i], Xb[i]) for i in range(min_n)])
if min_n else np.array([])
)
lyap = 0.0
if min_n > 1:
d0 = max(_cosine_dist(Xa[0], Xb[0]), 1e-6)
lyap = sum(
math.log(max(emb_div[t], 1e-6) / d0) / t for t in range(1, min_n)
) / (min_n - 1)
return Sensitivity(
task_id=task_id or run_a.task_id,
score_delta=abs(run_a.run_score - run_b.run_score),
tool_edit_distance=_levenshtein(fam_a, fam_b),
family_js_divergence=_js_divergence(dict(Counter(fam_a)), dict(Counter(fam_b))),
embedding_divergence=emb_div,
lyapunov_proxy=lyap,
)
# ── Survival analysis ─────────────────────────────────────────────────
def kaplan_meier(
event_times: list[float],
censored: list[bool] | None = None,
) -> list[SurvivalPoint]:
"""Kaplan-Meier survival estimator."""
n = len(event_times)
if n == 0:
return []
if censored is None:
censored = [False] * n
pairs = sorted(zip(event_times, censored))
pts = [SurvivalPoint(0.0, 1.0)]
at_risk = n
surv = 1.0
for t, cens in pairs:
if cens:
at_risk -= 1
continue
if at_risk > 0:
surv *= (at_risk - 1) / at_risk
at_risk -= 1
pts.append(SurvivalPoint(t, surv))
return pts
def find_event_step(transcript: Transcript, event: str) -> float | None:
"""Return step index of the first occurrence of *event*, or None."""
msgs = transcript.assistant_messages
if event == "first_error_recovery":
in_err = False
for i, m in enumerate(msgs):
any_err = any(tc.success is False or tc.error for tc in m.tool_calls)
if any_err:
in_err = True
elif in_err:
return float(i)
elif event == "first_correct_write":
for i, m in enumerate(msgs):
for tc in m.tool_calls:
fam = tc.family or _classify_tool(tc.name)
if fam == "edit" and tc.success is not False and not tc.error:
return float(i)
elif event == "task_completion":
if msgs:
last = msgs[-1]
if not any(tc.success is False or tc.error for tc in last.tool_calls):
return float(len(msgs) - 1)
elif event == "failure_absorption":
err_seen = False
for i, m in enumerate(msgs):
any_err = any(tc.success is False or tc.error for tc in m.tool_calls)
if any_err:
err_seen = True
elif err_seen and m.tool_calls:
return float(i)
return None
# ── PCA trajectory bundles ─────────────────────────────────────────────
def compute_pca_bundle(
dynamics_list: list[Dynamics],
) -> tuple[np.ndarray, list[np.ndarray]]:
"""Fit PCA on pooled embeddings, project each trajectory into PC1-PC2."""
non_empty = [d.embeddings for d in dynamics_list if d.n_steps > 0]
if not non_empty:
for d in dynamics_list:
d.pca_trajectory = np.empty((0, 2))
return np.zeros((2, _N_FAM + 4)), []
all_emb = np.vstack(non_empty)
mean = all_emb.mean(axis=0)
centred = all_emb - mean
_, _, Vt = np.linalg.svd(centred, full_matrices=False)
components = Vt[:2]
projections: list[np.ndarray] = []
for d in dynamics_list:
proj = (d.embeddings - mean) @ components.T if d.n_steps else np.empty((0, 2))
d.pca_trajectory = proj
projections.append(proj)
return components, projections
# ── Stratified assessment with Bayesian reweighting ───────────────────
@dataclass
class StratumStats:
"""Distributional statistics for one stratum of runs."""
name: str
n_runs: int
weight: float
# Score distribution
scores: np.ndarray
score_mean: float
score_std: float
score_quantiles: dict[str, float] # q10, q25, q50, q75, q90
# Dynamics distributions
entropy_dist: np.ndarray
error_rate_dist: np.ndarray
constraint_dist: np.ndarray
memory_depth_dist: np.ndarray
mean_drift_dist: np.ndarray
mean_step_size_dist: np.ndarray
# Time-series curves (aligned by step index)
drift_curve_mean: np.ndarray
drift_curve_std: np.ndarray
step_curve_mean: np.ndarray
step_curve_std: np.ndarray
regime_counts: dict[str, int]
sensitivity_deltas: np.ndarray
# Scalar fields on StratumStats that reweight() aggregates.
_REWEIGHT_FIELDS = [
("entropy", "entropy_dist"),
("error_rate", "error_rate_dist"),
("constraint", "constraint_dist"),
("memory_depth", "memory_depth_dist"),
("mean_drift", "mean_drift_dist"),
("mean_step_size", "mean_step_size_dist"),
]
@dataclass
class StratifiedAssessment:
"""Full stratified assessment with Bayesian reweighting.
Call ``reweight(target_weights)`` with a different task distribution
to obtain importance-weighted aggregate estimates.
"""
strata: list[StratumStats]
stratifier_name: str
total_runs: int
observed_mean_score: float
observed_std_score: float
def stratum_names(self) -> list[str]:
return [s.name for s in self.strata]
def reweight(self, target_weights: dict[str, float]) -> dict[str, float]:
"""Bayesian importance-weight correction.
w_k = p_target(k) / p_observed(k), then normalised.
"""
t_total = sum(target_weights.values()) or 1.0
p_target = {k: v / t_total for k, v in target_weights.items()}
by_name = {s.name: s for s in self.strata}
weights = {
name: pt / by_name[name].weight
for name, pt in p_target.items()
if name in by_name and by_name[name].weight > 1e-12
}
if not weights:
return {"score_mean": self.observed_mean_score,
"score_std": self.observed_std_score}
w_total = sum(weights.values())
w = {k: v / w_total for k, v in weights.items()}
# Reweight score (mean + law-of-total-variance)
score_mu = sum(w[k] * by_name[k].score_mean for k in w)
score_var = sum(
w[k] * (by_name[k].score_std ** 2 + (by_name[k].score_mean - score_mu) ** 2)
for k in w
)
result = {"score_mean": score_mu, "score_std": math.sqrt(max(score_var, 0.0))}
def _safe_mean(arr: np.ndarray) -> float:
return float(np.mean(arr)) if len(arr) > 0 else 0.0
for label, dist_attr in _REWEIGHT_FIELDS:
result[f"{label}_mean"] = sum(
w[k] * _safe_mean(getattr(by_name[k], dist_attr)) for k in w
)
return result
def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]:
"""Mean and std of variable-length arrays aligned at step 0."""
if not arrays:
return np.array([]), np.array([])
max_len = max(len(a) for a in arrays)
mat = np.full((len(arrays), max_len), np.nan)
for i, a in enumerate(arrays):
mat[i, :len(a)] = a
return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0)
def build_strata(
runs: list[TaskRunResult],
dynamics_list: list[Dynamics],
scores: list[float],
stratifier: Callable[[TaskRunResult, Dynamics], str],
stratifier_name: str = "custom",
sensitivities: list[Sensitivity] | None = None,
) -> StratifiedAssessment:
"""Group runs into strata and compute per-stratum distributions."""
assert len(runs) == len(dynamics_list) == len(scores)
groups: dict[str, list[int]] = {}
for idx, (r, d) in enumerate(zip(runs, dynamics_list)):
groups.setdefault(stratifier(r, d), []).append(idx)
total = len(runs)
all_scores = np.array(scores)
sens_by_task: dict[str, list[Sensitivity]] = {}
if sensitivities:
for s in sensitivities:
sens_by_task.setdefault(s.task_id, []).append(s)
strata: list[StratumStats] = []
for name, idxs in sorted(groups.items()):
n = len(idxs)
sc = np.array([scores[i] for i in idxs])
dyns = [dynamics_list[i] for i in idxs]
qs = {f"q{q}": float(np.percentile(sc, q)) if n else 0.0
for q in (10, 25, 50, 75, 90)}
drift_m, drift_s = _aligned_mean_std([d.drift for d in dyns])
step_m, step_s = _aligned_mean_std([d.step_size for d in dyns])
stratum_tasks = {runs[i].task_id for i in idxs}
sens_deltas = [
s.score_delta
for tid in stratum_tasks
for s in sens_by_task.get(tid, [])
]
strata.append(StratumStats(
name=name, n_runs=n, weight=n / total if total else 0.0,
scores=sc,
score_mean=float(np.mean(sc)) if n else 0.0,
score_std=float(np.std(sc)) if n else 0.0,
score_quantiles=qs,
entropy_dist=np.array([d.tool_entropy for d in dyns]),
error_rate_dist=np.array([d.error_rate for d in dyns]),
constraint_dist=np.array([d.constraint_index for d in dyns]),
memory_depth_dist=np.array([d.memory_depth for d in dyns]),
mean_drift_dist=np.array([d.mean_drift for d in dyns]),
mean_step_size_dist=np.array([d.mean_step_size for d in dyns]),
drift_curve_mean=drift_m, drift_curve_std=drift_s,
step_curve_mean=step_m, step_curve_std=step_s,
regime_counts=dict(Counter(d.regime.value for d in dyns)),
sensitivity_deltas=np.array(sens_deltas) if sens_deltas else np.array([]),
))
return StratifiedAssessment(
strata=strata,
stratifier_name=stratifier_name,
total_runs=total,
observed_mean_score=float(np.mean(all_scores)) if total else 0.0,
observed_std_score=float(np.std(all_scores)) if total else 0.0,
)
# ── Built-in stratifiers ──────────────────────────────────────────────
def stratify_by_regime(run: TaskRunResult, dyn: Dynamics) -> str:
return dyn.regime.value
def stratify_by_task(run: TaskRunResult, dyn: Dynamics) -> str:
return run.task_id
def stratify_by_tier(run: TaskRunResult, dyn: Dynamics) -> str:
tid = run.task_id.lower()
for i in range(1, 6):
if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"):
return f"tier{i}"
return "unknown"
def stratify_by_tool_mix(run: TaskRunResult, dyn: Dynamics) -> str:
if not dyn.family_dist:
return "unknown"
return max(dyn.family_dist, key=dyn.family_dist.get)
def stratify_by_prompt_style(run: TaskRunResult, dyn: Dynamics) -> str:
user_msgs = [m for m in run.transcript.messages if m.role == "user"]
if not user_msgs:
return "unknown"
wc = len(user_msgs[0].text.split())
return "terse" if wc <= 6 else ("medium" if wc <= 15 else "verbose")
def stratify_by_scenario(run: TaskRunResult, dyn: Dynamics) -> str:
return run.scenario or "unknown"
def stratify_by_family(run: TaskRunResult, dyn: Dynamics) -> str:
return run.family or "unknown"

View File

@ -0,0 +1,493 @@
"""Offline dynamics analysis helpers for cached ClawBench runs."""
from __future__ import annotations
import json
from itertools import combinations
from pathlib import Path
from typing import Iterable
import numpy as np
from clawbench.dynamics import (
build_strata,
compute_dynamics,
compute_pca_bundle,
compute_sensitivity,
find_event_step,
kaplan_meier,
stratify_by_regime,
stratify_by_scenario,
stratify_by_tier,
stratify_by_tool_mix,
)
from clawbench.dynamics_plots import generate_all_plots
from clawbench.schemas import TaskRunResult
_TIER_PREFIXES = {
"tier1": ("t1-", "t1_"),
"tier2": ("t2-", "t2_"),
"tier3": ("t3-", "t3_"),
"tier4": ("t4-", "t4_"),
"tier5": ("t5-", "t5_"),
}
def safe_model_name(model: str) -> str:
return model.replace("/", "_").replace(":", "_")
def _candidate_model_dir_names(model: str) -> set[str]:
return {
model,
safe_model_name(model),
model.replace("/", "_"),
model.replace("/", "-").replace(":", "-"),
}
def _has_run_files(path: Path) -> bool:
try:
for child in path.iterdir():
if child.is_file() and child.name.startswith("run") and child.suffix == ".json":
return True
except FileNotFoundError:
return False
return False
def _is_task_collection_root(path: Path) -> bool:
try:
for child in path.iterdir():
if child.is_dir() and _has_run_files(child):
return True
except FileNotFoundError:
return False
return False
def _resolve_model_roots(archive_dir: Path, model: str | None) -> list[Path]:
if _is_task_collection_root(archive_dir):
if model is not None and archive_dir.name not in _candidate_model_dir_names(model):
raise ValueError(
f"Archive dir {archive_dir} does not match requested model {model}."
)
return [archive_dir]
roots = [
child
for child in sorted(archive_dir.iterdir())
if child.is_dir() and _is_task_collection_root(child)
]
if model is not None:
candidates = _candidate_model_dir_names(model)
roots = [root for root in roots if root.name in candidates]
elif len(roots) > 1:
raise ValueError(
"Archive root contains multiple model directories. Pass --model or point "
"--archive-dir at a specific model directory."
)
return roots
def discover_model_roots(archive_dir: Path) -> dict[str, Path]:
"""Discover model directories inside an archive root.
Returns a mapping of model directory name to its path. If archive_dir is
itself a model cache root (contains task directories with run*.json), the
mapping contains a single entry.
"""
if not archive_dir.exists():
raise ValueError(f"Archive dir does not exist: {archive_dir}")
if _is_task_collection_root(archive_dir):
return {archive_dir.name: archive_dir}
roots = {
child.name: child
for child in sorted(archive_dir.iterdir())
if child.is_dir() and _is_task_collection_root(child)
}
return roots
def _matches_tier(task_id: str, tier: str | None) -> bool:
if tier is None:
return True
return task_id.lower().startswith(_TIER_PREFIXES[tier])
def load_task_runs_archive(
archive_dir: Path,
model: str | None = None,
task_ids: Iterable[str] | None = None,
tier: str | None = None,
) -> dict[str, list[TaskRunResult]]:
"""Load cached TaskRunResult objects from a run cache/archive directory."""
task_filter = set(task_ids or [])
task_runs: dict[str, list[TaskRunResult]] = {}
if not archive_dir.exists():
raise ValueError(f"Archive dir does not exist: {archive_dir}")
roots = _resolve_model_roots(archive_dir, model)
if not roots:
return {}
for root in roots:
for task_dir in sorted(child for child in root.iterdir() if child.is_dir()):
task_id = task_dir.name
if task_filter and task_id not in task_filter:
continue
if not _matches_tier(task_id, tier):
continue
runs = []
for run_file in sorted(task_dir.glob("run*.json")):
try:
run = TaskRunResult.model_validate_json(
run_file.read_text(encoding="utf-8")
)
except Exception:
continue
runs.append(run)
if runs:
task_runs.setdefault(task_id, []).extend(runs)
for task_id, runs in task_runs.items():
runs.sort(key=lambda run: run.run_index)
return task_runs
def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]:
if not arrays:
return np.array([]), np.array([])
max_len = max(len(arr) for arr in arrays)
if max_len == 0:
return np.array([]), np.array([])
mat = np.full((len(arrays), max_len), np.nan)
for idx, arr in enumerate(arrays):
mat[idx, :len(arr)] = arr
return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0)
def _round_list(values: np.ndarray, digits: int = 4) -> list[float]:
return [round(float(value), digits) for value in values.tolist()]
def _empty_sensitivity_summary() -> dict[str, object]:
return {
"n_pairs": 0,
"mean_score_delta": 0.0,
"mean_tool_edit_distance": 0.0,
"mean_family_js_divergence": 0.0,
"mean_lyapunov_proxy": 0.0,
"mean_initial_divergence": 0.0,
"mean_final_divergence": 0.0,
"mean_contraction_delta": 0.0,
"mean_contraction_ratio": 0.0,
"fraction_converging_pairs": 0.0,
"mean_divergence_curve": [],
"std_divergence_curve": [],
"pair_points": [],
}
def _summarize_sensitivity_group(pairs: list) -> dict[str, object]:
if not pairs:
return _empty_sensitivity_summary()
divergence_curves = [pair.embedding_divergence for pair in pairs if len(pair.embedding_divergence) > 0]
curve_mean, curve_std = _aligned_mean_std(divergence_curves)
pair_points = []
for pair in pairs:
if len(pair.embedding_divergence) > 0:
initial_divergence = float(pair.embedding_divergence[0])
final_divergence = float(pair.embedding_divergence[-1])
contraction_delta = final_divergence - initial_divergence
contraction_ratio = final_divergence / max(initial_divergence, 1e-6)
else:
initial_divergence = 0.0
final_divergence = 0.0
contraction_delta = 0.0
contraction_ratio = 0.0
pair_points.append(
{
"score_delta": round(float(pair.score_delta), 4),
"tool_edit_distance": int(pair.tool_edit_distance),
"family_js_divergence": round(float(pair.family_js_divergence), 4),
"lyapunov_proxy": round(float(pair.lyapunov_proxy), 4),
"initial_divergence": round(initial_divergence, 4),
"final_divergence": round(final_divergence, 4),
"contraction_delta": round(contraction_delta, 4),
"contraction_ratio": round(contraction_ratio, 4),
}
)
converging_pairs = sum(
1 for point in pair_points if point["final_divergence"] < point["initial_divergence"]
)
return {
"n_pairs": len(pairs),
"mean_score_delta": round(float(np.mean([pair.score_delta for pair in pairs])), 4),
"mean_tool_edit_distance": round(float(np.mean([pair.tool_edit_distance for pair in pairs])), 4),
"mean_family_js_divergence": round(float(np.mean([pair.family_js_divergence for pair in pairs])), 4),
"mean_lyapunov_proxy": round(float(np.mean([pair.lyapunov_proxy for pair in pairs])), 4),
"mean_initial_divergence": round(float(np.mean([point["initial_divergence"] for point in pair_points])), 4),
"mean_final_divergence": round(float(np.mean([point["final_divergence"] for point in pair_points])), 4),
"mean_contraction_delta": round(float(np.mean([point["contraction_delta"] for point in pair_points])), 4),
"mean_contraction_ratio": round(float(np.mean([point["contraction_ratio"] for point in pair_points])), 4),
"fraction_converging_pairs": round(converging_pairs / len(pair_points), 4),
"mean_divergence_curve": _round_list(curve_mean),
"std_divergence_curve": _round_list(curve_std),
"pair_points": pair_points,
}
def _build_sensitivity_sections(
valid_runs_by_task: dict[str, list[TaskRunResult]],
) -> tuple[list, dict[str, object]]:
same_task_pairs = []
per_task: dict[str, object] = {}
for task_id, runs in sorted(valid_runs_by_task.items()):
if len(runs) < 2:
continue
task_pairs = [
compute_sensitivity(run_a, run_b, task_id=task_id)
for run_a, run_b in combinations(runs, 2)
]
if task_pairs:
same_task_pairs.extend(task_pairs)
per_task[task_id] = _summarize_sensitivity_group(task_pairs)
same_task_summary = _summarize_sensitivity_group(same_task_pairs)
same_task_summary["per_task"] = per_task
perturbation_pairs = []
per_variant_group: dict[str, object] = {}
runs_by_variant_group: dict[str, list[TaskRunResult]] = {}
for runs in valid_runs_by_task.values():
for run in runs:
runs_by_variant_group.setdefault(run.variant_group or run.task_id, []).append(run)
for variant_group, runs in sorted(runs_by_variant_group.items()):
distinct_members = {
(run.task_id, run.prompt_variant, run.variant_id)
for run in runs
}
if len(distinct_members) < 2:
continue
group_pairs = []
for run_a, run_b in combinations(runs, 2):
if (
run_a.task_id == run_b.task_id
and run_a.prompt_variant == run_b.prompt_variant
and run_a.variant_id == run_b.variant_id
):
continue
group_pairs.append(compute_sensitivity(run_a, run_b, task_id=variant_group))
if not group_pairs:
continue
perturbation_pairs.extend(group_pairs)
group_summary = _summarize_sensitivity_group(group_pairs)
group_summary["members"] = [
{
"task_id": task_id,
"prompt_variant": prompt_variant,
"variant_id": variant_id,
}
for task_id, prompt_variant, variant_id in sorted(distinct_members)
]
per_variant_group[variant_group] = group_summary
perturbation_summary = _summarize_sensitivity_group(perturbation_pairs)
perturbation_summary["per_variant_group"] = per_variant_group
return same_task_pairs, {
"same_task": same_task_summary,
"prompt_perturbation": perturbation_summary,
}
def build_dynamics_report(
task_runs: dict[str, list[TaskRunResult]],
include_pca: bool = True,
) -> tuple[dict, list]:
"""Compute stratified dynamics report data from cached runs."""
all_runs = [run for runs in task_runs.values() for run in runs]
if not all_runs:
raise ValueError("No cached runs were loaded.")
dynamics_list = []
scores = []
valid_runs = []
for run in all_runs:
if not run.transcript.messages:
continue
dynamics_list.append(compute_dynamics(run.transcript))
scores.append(run.run_score)
valid_runs.append(run)
if not valid_runs:
raise ValueError("No runs with transcripts were found in the archive.")
valid_runs_by_task: dict[str, list[TaskRunResult]] = {}
for run in valid_runs:
valid_runs_by_task.setdefault(run.task_id, []).append(run)
same_task_sensitivities, sensitivity_summary = _build_sensitivity_sections(valid_runs_by_task)
stratifiers = {
"tier": stratify_by_tier,
"regime": stratify_by_regime,
"tool_mix": stratify_by_tool_mix,
"scenario": stratify_by_scenario,
}
report: dict[str, object] = {
"n_runs": len(valid_runs),
"n_tasks": len(task_runs),
"strata": {},
}
stratified = {}
for name, fn in stratifiers.items():
assessment = build_strata(
valid_runs,
dynamics_list,
scores,
fn,
name,
sensitivities=same_task_sensitivities,
)
stratified[name] = assessment
strata_summary = []
for stratum in assessment.strata:
strata_summary.append(
{
"name": stratum.name,
"n_runs": stratum.n_runs,
"weight": round(stratum.weight, 4),
"score_mean": round(stratum.score_mean, 4),
"score_std": round(stratum.score_std, 4),
"score_quantiles": {
key: round(value, 4)
for key, value in stratum.score_quantiles.items()
},
"entropy_mean": round(float(stratum.entropy_dist.mean()), 4)
if len(stratum.entropy_dist)
else 0.0,
"error_rate_mean": round(float(stratum.error_rate_dist.mean()), 4)
if len(stratum.error_rate_dist)
else 0.0,
"constraint_mean": round(float(stratum.constraint_dist.mean()), 4)
if len(stratum.constraint_dist)
else 0.0,
"memory_depth_mean": round(float(stratum.memory_depth_dist.mean()), 4)
if len(stratum.memory_depth_dist)
else 0.0,
"sensitivity_pairs": int(len(stratum.sensitivity_deltas)),
"sensitivity_mean_score_delta": round(float(stratum.sensitivity_deltas.mean()), 4)
if len(stratum.sensitivity_deltas)
else 0.0,
"regime_counts": stratum.regime_counts,
}
)
report["strata"][name] = {
"observed_mean_score": round(assessment.observed_mean_score, 4),
"observed_std_score": round(assessment.observed_std_score, 4),
"strata": strata_summary,
}
report["per_run"] = [
{
"task_id": run.task_id,
"run_index": run.run_index,
"score": round(run.run_score, 4),
"regime": dynamics.regime.value,
"entropy": round(dynamics.tool_entropy, 4),
"error_rate": round(dynamics.error_rate, 4),
"constraint_index": round(dynamics.constraint_index, 4),
"memory_depth": round(dynamics.memory_depth, 4),
"n_steps": dynamics.n_steps,
"mean_drift": round(dynamics.mean_drift, 4),
"mean_step_size": round(dynamics.mean_step_size, 4),
}
for run, dynamics in zip(valid_runs, dynamics_list)
]
report["sensitivity"] = sensitivity_summary
if include_pca:
compute_pca_bundle(dynamics_list)
events = []
censored = []
for run in valid_runs:
step = find_event_step(run.transcript, "first_correct_write")
if step is not None:
events.append(step)
censored.append(False)
else:
events.append(float(len(run.transcript.assistant_messages)))
censored.append(True)
km_points = kaplan_meier(events, censored)
return report, generate_all_plots, {
"valid_runs": valid_runs,
"dynamics_list": dynamics_list,
"stratified": stratified,
"km_points": km_points,
"sensitivity": sensitivity_summary,
}
def write_dynamics_report(
task_runs: dict[str, list[TaskRunResult]],
out_dir: Path,
report_name: str = "dynamics.json",
generate_plots: bool = True,
) -> tuple[Path, list[Path]]:
"""Write the dynamics report JSON and plots to an output directory."""
report, plotter, plot_data = build_dynamics_report(task_runs, include_pca=generate_plots)
out_dir.mkdir(parents=True, exist_ok=True)
report_path = out_dir / report_name
report_path.write_text(json.dumps(report, indent=2), encoding="utf-8")
plots: list[Path] = []
if generate_plots:
plots = plotter(
plot_data["dynamics_list"],
plot_data["valid_runs"],
plot_data["stratified"],
km_points=plot_data["km_points"],
event_name="first_correct_write",
out_dir=out_dir,
sensitivity_summary=plot_data["sensitivity"],
)
return report_path, plots
def load_task_runs_by_model(
archive_dir: Path,
tier: str | None = None,
task_ids: Iterable[str] | None = None,
) -> dict[str, dict[str, list[TaskRunResult]]]:
"""Load cached TaskRunResult objects grouped by model directory name."""
grouped: dict[str, dict[str, list[TaskRunResult]]] = {}
for model_name, model_dir in discover_model_roots(archive_dir).items():
task_runs = load_task_runs_archive(
archive_dir=model_dir,
model=None,
task_ids=task_ids,
tier=tier,
)
if task_runs:
grouped[model_name] = task_runs
return grouped

411
clawbench/dynamics_plots.py Normal file
View File

@ -0,0 +1,411 @@
"""Plotting utilities for dynamics analysis.
Generates publication-ready figures from dynamics data and saves to a
results directory. All plots use matplotlib with the Agg backend so they
work headlessly.
"""
from __future__ import annotations
from pathlib import Path
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
import numpy as np
from clawbench.dynamics import (
Dynamics,
StratifiedAssessment,
StratumStats,
SurvivalPoint,
)
def _savefig(fig: plt.Figure, path: Path) -> None:
fig.savefig(path, dpi=150, bbox_inches="tight")
plt.close(fig)
def _plot_series_curves(
dynamics_list: list[Dynamics],
labels: list[str],
out_path: Path,
*,
series_attr: str,
ylabel: str,
title: str,
) -> None:
"""Plot a step-aligned per-run series coloured by label."""
fig, ax = plt.subplots(figsize=(10, 5))
cmap = plt.cm.tab10
unique = sorted(set(labels))
colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)}
for d, lbl in zip(dynamics_list, labels):
series = np.asarray(getattr(d, series_attr), dtype=float)
if len(series) < 2:
continue
ax.plot(series, alpha=0.6, color=colour_map[lbl], linewidth=1)
for lbl in unique:
ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2)
ax.legend(fontsize=8, loc="upper left")
ax.set_xlabel("Step")
ax.set_ylabel(ylabel)
ax.set_title(title)
_savefig(fig, out_path)
def plot_drift_curves(
dynamics_list: list[Dynamics],
labels: list[str],
out_path: Path,
) -> None:
"""Drift-from-origin curves coloured by label (e.g. task_id or regime)."""
_plot_series_curves(
dynamics_list,
labels,
out_path,
series_attr="drift",
ylabel="Cosine distance from step 0",
title="Drift from Origin",
)
def plot_step_size_curves(
dynamics_list: list[Dynamics],
labels: list[str],
out_path: Path,
) -> None:
"""Step-to-step movement curves coloured by label."""
_plot_series_curves(
dynamics_list,
labels,
out_path,
series_attr="step_size",
ylabel="Cosine distance from previous step",
title="Step-to-Step Movement",
)
def plot_pca_trajectories(
dynamics_list: list[Dynamics],
labels: list[str],
out_path: Path,
) -> None:
"""PCA phase portraits (PC1 vs PC2) coloured by label."""
fig, ax = plt.subplots(figsize=(8, 8))
cmap = plt.cm.tab10
unique = sorted(set(labels))
colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)}
for d, lbl in zip(dynamics_list, labels):
if d.pca_trajectory is None or len(d.pca_trajectory) < 2:
continue
traj = d.pca_trajectory
ax.plot(traj[:, 0], traj[:, 1], alpha=0.5, color=colour_map[lbl], linewidth=1)
ax.scatter(traj[0, 0], traj[0, 1], color=colour_map[lbl], marker="o", s=30, zorder=5)
ax.scatter(traj[-1, 0], traj[-1, 1], color=colour_map[lbl], marker="x", s=30, zorder=5)
for lbl in unique:
ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2)
ax.legend(fontsize=8)
ax.set_xlabel("PC1")
ax.set_ylabel("PC2")
ax.set_title("PCA Phase Portrait (o=start, x=end)")
_savefig(fig, out_path)
def plot_regime_distribution(
strata: list[StratumStats],
stratifier_name: str,
out_path: Path,
) -> None:
"""Stacked bar chart of regime counts per stratum."""
fig, ax = plt.subplots(figsize=(10, 5))
all_regimes = sorted({r for s in strata for r in s.regime_counts})
x = np.arange(len(strata))
bottom = np.zeros(len(strata))
cmap = plt.cm.Set2
for j, regime in enumerate(all_regimes):
counts = [s.regime_counts.get(regime, 0) for s in strata]
ax.bar(x, counts, bottom=bottom, label=regime, color=cmap(j / max(len(all_regimes) - 1, 1)))
bottom += np.array(counts)
ax.set_xticks(x)
ax.set_xticklabels([s.name for s in strata], rotation=30, ha="right")
ax.set_ylabel("Count")
ax.set_title(f"Regime Distribution by {stratifier_name}")
ax.legend(fontsize=8)
_savefig(fig, out_path)
def plot_score_distributions(
strata: list[StratumStats],
stratifier_name: str,
out_path: Path,
) -> None:
"""Box plots of score distributions per stratum."""
fig, ax = plt.subplots(figsize=(10, 5))
data = [s.scores for s in strata if len(s.scores) > 0]
labels = [s.name for s in strata if len(s.scores) > 0]
if data:
ax.boxplot(data, labels=labels, patch_artist=True,
boxprops=dict(facecolor="lightblue", alpha=0.7))
ax.set_ylabel("Score")
ax.set_title(f"Score Distribution by {stratifier_name}")
plt.xticks(rotation=30, ha="right")
_savefig(fig, out_path)
def plot_survival_curve(
km_points: list[SurvivalPoint],
event_name: str,
out_path: Path,
) -> None:
"""Kaplan-Meier survival curve."""
if not km_points:
return
fig, ax = plt.subplots(figsize=(8, 5))
times = [p.time for p in km_points]
surv = [p.survival for p in km_points]
ax.step(times, surv, where="post", linewidth=2, color="steelblue")
ax.fill_between(times, surv, step="post", alpha=0.15, color="steelblue")
ax.set_xlabel("Step")
ax.set_ylabel("Survival probability")
ax.set_title(f"Kaplan-Meier: {event_name}")
ax.set_ylim(-0.05, 1.05)
_savefig(fig, out_path)
def plot_stratum_dynamics_heatmap(
strata: list[StratumStats],
stratifier_name: str,
out_path: Path,
) -> None:
"""Heatmap of mean dynamics metrics across strata."""
metrics = ["entropy", "error_rate", "constraint", "memory_depth", "mean_drift", "mean_step_size"]
data = np.zeros((len(strata), len(metrics)))
for i, s in enumerate(strata):
arrays = [s.entropy_dist, s.error_rate_dist, s.constraint_dist,
s.memory_depth_dist, s.mean_drift_dist, s.mean_step_size_dist]
for j, arr in enumerate(arrays):
data[i, j] = float(np.mean(arr)) if len(arr) > 0 else 0.0
fig, ax = plt.subplots(figsize=(10, max(3, len(strata) * 0.6)))
im = ax.imshow(data, aspect="auto", cmap="YlOrRd")
ax.set_xticks(range(len(metrics)))
ax.set_xticklabels(metrics, rotation=30, ha="right")
ax.set_yticks(range(len(strata)))
ax.set_yticklabels([s.name for s in strata])
for i in range(len(strata)):
for j in range(len(metrics)):
ax.text(j, i, f"{data[i, j]:.2f}", ha="center", va="center", fontsize=8)
fig.colorbar(im, ax=ax, shrink=0.8)
ax.set_title(f"Dynamics Metrics by {stratifier_name}")
_savefig(fig, out_path)
def plot_pairwise_divergence_curves(
per_task_sensitivity: dict[str, dict],
out_path: Path,
) -> bool:
"""Plot mean pairwise trajectory divergence over aligned steps."""
if not per_task_sensitivity:
return False
fig, ax = plt.subplots(figsize=(10, 5))
cmap = plt.cm.tab10
tasks = sorted(per_task_sensitivity)
colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)}
plotted = False
for task in tasks:
summary = per_task_sensitivity[task]
mean_curve = np.asarray(summary.get("mean_divergence_curve", []), dtype=float)
std_curve = np.asarray(summary.get("std_divergence_curve", []), dtype=float)
if len(mean_curve) == 0:
continue
steps = np.arange(len(mean_curve))
ax.plot(steps, mean_curve, linewidth=2, color=colour_map[task], label=task)
if len(std_curve) == len(mean_curve):
ax.fill_between(steps, mean_curve - std_curve, mean_curve + std_curve, color=colour_map[task], alpha=0.12)
plotted = True
if not plotted:
plt.close(fig)
return False
ax.set_xlabel("Aligned step")
ax.set_ylabel("Pairwise embedding divergence")
ax.set_title("Do Repeated Trajectories Converge or Diverge?")
ax.legend(fontsize=8)
_savefig(fig, out_path)
return True
def plot_pairwise_contraction_scatter(
per_task_sensitivity: dict[str, dict],
out_path: Path,
) -> bool:
"""Scatter initial vs final pairwise divergence; below diagonal means convergence."""
if not per_task_sensitivity:
return False
fig, ax = plt.subplots(figsize=(7, 6))
cmap = plt.cm.tab10
tasks = sorted(per_task_sensitivity)
colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)}
max_seen = 0.0
plotted = False
for task in tasks:
points = per_task_sensitivity[task].get("pair_points", [])
if not points:
continue
xs = [point["initial_divergence"] for point in points]
ys = [point["final_divergence"] for point in points]
max_seen = max(max_seen, *(xs + ys))
ax.scatter(xs, ys, s=60, alpha=0.8, color=colour_map[task], label=task)
plotted = True
if not plotted:
plt.close(fig)
return False
limit = max(max_seen, 0.1)
ax.plot([0, limit], [0, limit], linestyle="--", color="black", linewidth=1)
ax.set_xlabel("Initial pairwise divergence")
ax.set_ylabel("Final pairwise divergence")
ax.set_title("Pairwise Trajectory Contraction")
ax.legend(fontsize=8)
_savefig(fig, out_path)
return True
def plot_sensitivity_heatmap(
per_task_sensitivity: dict[str, dict],
out_path: Path,
) -> bool:
"""Heatmap of per-task sensitivity metrics."""
if not per_task_sensitivity:
return False
metrics = [
("mean_score_delta", "score_delta"),
("mean_tool_edit_distance", "tool_edit"),
("mean_family_js_divergence", "js_div"),
("mean_lyapunov_proxy", "lyapunov"),
("fraction_converging_pairs", "frac_converging"),
]
tasks = sorted(per_task_sensitivity)
data = np.zeros((len(tasks), len(metrics)))
for row_idx, task in enumerate(tasks):
summary = per_task_sensitivity[task]
for col_idx, (key, _label) in enumerate(metrics):
data[row_idx, col_idx] = float(summary.get(key, 0.0))
fig, ax = plt.subplots(figsize=(9, max(3, len(tasks) * 0.7)))
im = ax.imshow(data, aspect="auto", cmap="Blues")
ax.set_xticks(range(len(metrics)))
ax.set_xticklabels([label for _key, label in metrics], rotation=30, ha="right")
ax.set_yticks(range(len(tasks)))
ax.set_yticklabels(tasks)
for row_idx in range(len(tasks)):
for col_idx in range(len(metrics)):
ax.text(col_idx, row_idx, f"{data[row_idx, col_idx]:.2f}", ha="center", va="center", fontsize=8)
fig.colorbar(im, ax=ax, shrink=0.8)
ax.set_title("Pairwise Sensitivity by Task")
_savefig(fig, out_path)
return True
def generate_all_plots(
dynamics_list: list[Dynamics],
runs: list,
stratified: dict[str, StratifiedAssessment],
km_points: list[SurvivalPoint] | None = None,
event_name: str = "first_correct_write",
out_dir: Path = Path("results"),
sensitivity_summary: dict[str, dict] | None = None,
) -> list[Path]:
"""Generate all dynamics plots and return list of saved paths."""
out_dir.mkdir(parents=True, exist_ok=True)
saved: list[Path] = []
# Labels by regime
regime_labels = [d.regime.value for d in dynamics_list]
tier_labels = []
for r in runs:
tid = r.task_id.lower()
tier = "unknown"
for i in range(1, 6):
if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"):
tier = f"tier{i}"
break
tier_labels.append(tier)
# Drift curves by regime
p = out_dir / "drift_by_regime.png"
plot_drift_curves(dynamics_list, regime_labels, p)
saved.append(p)
# Drift curves by tier
p = out_dir / "drift_by_tier.png"
plot_drift_curves(dynamics_list, tier_labels, p)
saved.append(p)
p = out_dir / "step_size_by_regime.png"
plot_step_size_curves(dynamics_list, regime_labels, p)
saved.append(p)
p = out_dir / "step_size_by_tier.png"
plot_step_size_curves(dynamics_list, tier_labels, p)
saved.append(p)
# PCA trajectories
has_pca = any(d.pca_trajectory is not None for d in dynamics_list)
if has_pca:
p = out_dir / "pca_by_regime.png"
plot_pca_trajectories(dynamics_list, regime_labels, p)
saved.append(p)
p = out_dir / "pca_by_tier.png"
plot_pca_trajectories(dynamics_list, tier_labels, p)
saved.append(p)
# Per-stratifier plots
for name, sa in stratified.items():
p = out_dir / f"regimes_by_{name}.png"
plot_regime_distribution(sa.strata, name, p)
saved.append(p)
p = out_dir / f"scores_by_{name}.png"
plot_score_distributions(sa.strata, name, p)
saved.append(p)
p = out_dir / f"dynamics_heatmap_{name}.png"
plot_stratum_dynamics_heatmap(sa.strata, name, p)
saved.append(p)
# Survival curve
if km_points:
p = out_dir / f"survival_{event_name}.png"
plot_survival_curve(km_points, event_name, p)
saved.append(p)
per_task_sensitivity = (sensitivity_summary or {}).get("same_task", {}).get("per_task", {})
p = out_dir / "pairwise_divergence_by_task.png"
if plot_pairwise_divergence_curves(per_task_sensitivity, p):
saved.append(p)
p = out_dir / "pairwise_contraction_scatter.png"
if plot_pairwise_contraction_scatter(per_task_sensitivity, p):
saved.append(p)
p = out_dir / "sensitivity_heatmap.png"
if plot_sensitivity_heatmap(per_task_sensitivity, p):
saved.append(p)
return saved

View File

@ -103,6 +103,7 @@ class BenchmarkHarness:
self.concurrency = max(1, int(concurrency))
self.browser_concurrency = max(1, int(browser_concurrency))
self.repo_root = Path(__file__).parent.parent
self.last_task_runs: dict[str, list[TaskRunResult]] = {}
async def run(self) -> BenchmarkResult:
tasks = load_all_tasks(
@ -148,6 +149,7 @@ class BenchmarkHarness:
f"({mean_run:.1f}s avg, concurrency={self.concurrency})[/dim]"
)
self.last_task_runs = all_results
return self._aggregate(tasks, all_results)
async def _execute_runs(

View File

@ -0,0 +1,147 @@
"""Preset model catalog and selection helpers for the Space submit UI."""
from __future__ import annotations
from dataclasses import dataclass
CUSTOM_PRESET_LABEL = "(custom)"
PRESET_AUDIENCE_ALL = "All Presets"
PRESET_AUDIENCE_CLAW = "Claw Users"
PRESET_AUDIENCE_BUDGET = "Budget Researchers"
PRESET_AUDIENCE_CHOICES = (
PRESET_AUDIENCE_ALL,
PRESET_AUDIENCE_CLAW,
PRESET_AUDIENCE_BUDGET,
)
@dataclass(frozen=True)
class PresetModel:
label: str
model_id: str
provider: str
audiences: tuple[str, ...]
PRESET_MODELS = (
PresetModel(
label="GPT-OSS 20B (Ollama)",
model_id="ollama/gpt-oss:20b",
provider="ollama",
audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
),
PresetModel(
label="Qwen 3.5 27B (Ollama)",
model_id="ollama/qwen3.5:27b",
provider="ollama",
audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
),
PresetModel(
label="Qwen3 32B",
model_id="huggingface/Qwen/Qwen3-32B",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
),
PresetModel(
label="Gemma 4 26B MoE",
model_id="huggingface/google/gemma-4-26B-A4B-it",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
),
PresetModel(
label="GLM 5.1 (754B MoE)",
model_id="huggingface/zai-org/GLM-5.1",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW,),
),
PresetModel(
label="GLM 5 (400B MoE)",
model_id="huggingface/zai-org/GLM-5",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW,),
),
PresetModel(
label="DeepSeek R1",
model_id="huggingface/deepseek-ai/DeepSeek-R1",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW,),
),
PresetModel(
label="Kimi K2 Instruct",
model_id="huggingface/moonshotai/Kimi-K2-Instruct",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW,),
),
PresetModel(
label="MiniMax M2.5",
model_id="huggingface/MiniMaxAI/MiniMax-M2.5",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW,),
),
PresetModel(
label="Llama 3.3 70B",
model_id="huggingface/meta-llama/Llama-3.3-70B-Instruct",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW,),
),
PresetModel(
label="Llama 3.1 70B",
model_id="huggingface/meta-llama/Llama-3.1-70B-Instruct",
provider="huggingface",
audiences=(PRESET_AUDIENCE_CLAW,),
),
PresetModel(
label="Claude Sonnet 4.6",
model_id="anthropic/claude-sonnet-4-6",
provider="anthropic",
audiences=(PRESET_AUDIENCE_CLAW,),
),
PresetModel(
label="Claude Opus 4.6",
model_id="anthropic/claude-opus-4-6",
provider="anthropic",
audiences=(PRESET_AUDIENCE_CLAW,),
),
)
PRESET_MODEL_MAP = {preset.label: preset.model_id for preset in PRESET_MODELS}
_PRESET_BY_LABEL = {preset.label: preset for preset in PRESET_MODELS}
def infer_provider(model_id: str) -> str:
normalized = model_id.strip()
if not normalized or "/" not in normalized:
return ""
return normalized.split("/", 1)[0].strip().lower()
def preset_models_for_audience(audience: str | None) -> list[PresetModel]:
if not audience or audience == PRESET_AUDIENCE_ALL:
return list(PRESET_MODELS)
return [preset for preset in PRESET_MODELS if audience in preset.audiences]
def preset_labels_for_audience(audience: str | None) -> list[str]:
return [preset.label for preset in preset_models_for_audience(audience)]
def resolve_model_selection(
model: str,
preset_label: str,
provider: str = "",
) -> tuple[str, str]:
selected_model = model.strip()
selected_provider = provider.strip()
preset = _PRESET_BY_LABEL.get(preset_label)
if preset is not None:
selected_model = preset.model_id
if not selected_provider:
selected_provider = preset.provider
if not selected_provider:
selected_provider = infer_provider(selected_model)
return selected_model, selected_provider

View File

@ -1,140 +1,112 @@
"""Classify each archived run's dynamical regime from its turn trajectory.
#!/usr/bin/env python3
"""Classify posterior run trajectories into dynamical regimes.
Following "When LLMs Are Dreaming..." §What We Expect to See:
We embed each assistant turn using bag-of-words text plus tool-call summaries,
then compute simple geometric proxies:
TRAPPED/ATTRACTOR low support (Vol_log), high recurrence, high BOPS.
Agent converged to a point; may be good (solved it)
or bad (got stuck in a loop on a single idea).
drift_mean = mean ||x_t - x_{t-1}||
from_start = max ||x_t - x_0||
recurrence = max cosine(x_i, x_j) for non-adjacent turns
vol_log = log det(Sigma + eps I)
LIMIT-CYCLE high recurrence + bounded drift + quasi-periodic revisits.
Agent loops between a few states.
DIFFUSIVE/WANDERING growing support, rising drift, low recurrence.
Agent explores without converging; often "goal drift".
SENSITIVE (requires paraphrased-pair runs; skip here.)
TOO-SHORT trajectory < 3 assistant turns; can't classify dynamics.
We work in a TF-IDF bag-of-words embedding space (same vocab as C(q)),
with each turn's state vector = its assistant text + tool-call args.
Metrics per run:
- drift_mean: mean ||e_t e_{t1}|| across turns
- from_start: max ||e_t e_0|| (farthest the run drifted from origin)
- recurrence: max_{i<j, ji2} cos(e_i, e_j) best return-after-gap match
- vol_log: log det(Σ + εI) over turn states support volume proxy
Classifier rules (tuned empirically on the distribution):
if n_turns < 3 too_short
elif drift_mean < 0.15 and vol_log < 6 trapped
elif recurrence > 0.80 and drift_mean < 0.25 limit_cycle
elif drift_mean > 0.35 and vol_log > 3 diffusive
else mixed
Output: reports/regimes.json with per-run classification.
Usage:
.venv/bin/python3 scripts/classify_regimes.py
Runs are then bucketed into coarse regimes such as trapped, limit_cycle, and
diffusive using quartile-based thresholds estimated from the observed archive.
"""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter, defaultdict
from pathlib import Path
import numpy as np
ROOT = Path(__file__).resolve().parent.parent
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
MODELS = [
"anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7",
"anthropic_claude-sonnet-4-6", "openai_gpt-5.4",
"google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1",
"openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5",
"openrouter_qwen_qwen3.6-plus",
]
from clawbench.dynamics_archive import load_task_runs_by_model
WORD_RE = re.compile(r"[a-z]{3,}")
STOPWORDS = set("the and that with this have from what your will can but not "
"was will are been one would there been they will their has "
"had its were only some than about these which into also each "
"when where them how who them very much more most other then "
"here such does like just make many like want need take".split())
STOPWORDS = set(
"the and that with this have from what your will can but not "
"was are been one would there they their has had its were only some "
"than about these which into also each when where them how who very "
"much more most other then here such does like just make many want need take".split()
)
def tokenize(text: str) -> list[str]:
return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS]
def build_vocab(all_turn_texts: list[str], top_k: int = 500) -> dict[str, int]:
c = Counter()
for t in all_turn_texts:
c.update(set(tokenize(t)))
return {w: i for i, (w, _) in enumerate(c.most_common(top_k))}
def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]:
counter = Counter()
for text in texts:
counter.update(set(tokenize(text)))
return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))}
def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray:
v = np.zeros(len(vocab), dtype=np.float32)
for w, c in Counter(tokenize(text)).items():
if w in vocab:
v[vocab[w]] = c
n = np.linalg.norm(v)
return v / n if n > 0 else v
vec = np.zeros(len(vocab), dtype=np.float32)
for word, cnt in Counter(tokenize(text)).items():
if word in vocab:
vec[vocab[word]] = cnt
norm = np.linalg.norm(vec)
return vec / norm if norm > 0 else vec
def turn_texts(run_data: dict) -> list[str]:
"""Extract one text string per assistant turn (text + tool-call summary)."""
def turn_texts(run, fallback_any_message: bool = False) -> list[str]:
source = run.transcript.messages if fallback_any_message else run.transcript.assistant_messages
out = []
for m in run_data.get("transcript", {}).get("messages", []):
if m.get("role") != "assistant":
continue
for msg in source:
parts = []
if m.get("text"):
parts.append(m["text"])
for tc in (m.get("tool_calls") or []):
name = tc.get("name", "")
args_str = json.dumps(tc.get("arguments", {}))[:200]
parts.append(f"{name} {args_str}")
if msg.text:
parts.append(msg.text)
for tc in msg.tool_calls:
parts.append(tc.name)
if tc.input:
parts.append(json.dumps(tc.input, sort_keys=True)[:200])
if parts:
out.append(" ".join(parts))
return out
def trajectory_metrics(vecs: np.ndarray) -> dict:
"""Compute dynamical metrics over a (n_turns, d) trajectory matrix."""
def trajectory_metrics(vecs: np.ndarray) -> dict[str, float]:
"""Compute drift, recurrence, and support-volume proxies for one run."""
n = vecs.shape[0]
if n < 2:
return {"n_turns": n, "drift_mean": 0.0, "from_start": 0.0,
"recurrence": 0.0, "vol_log": -12.0}
# Drift: consecutive distances
return {
"n_turns": float(n),
"drift_mean": 0.0,
"from_start": 0.0,
"recurrence": 0.0,
"vol_log": -12.0,
}
diffs = np.linalg.norm(np.diff(vecs, axis=0), axis=1)
drift_mean = float(diffs.mean())
# From start: max distance from turn 0
dists_from_0 = np.linalg.norm(vecs - vecs[0:1], axis=1)
from_start = float(dists_from_0.max())
# Recurrence: best non-adjacent cosine similarity (ignoring immediate neighbors)
from_start = float(np.linalg.norm(vecs - vecs[0:1], axis=1).max())
recurrence = 0.0
for i in range(n):
for j in range(i + 2, n):
ni, nj = np.linalg.norm(vecs[i]), np.linalg.norm(vecs[j])
ni = np.linalg.norm(vecs[i])
nj = np.linalg.norm(vecs[j])
if ni > 0 and nj > 0:
c = float(vecs[i] @ vecs[j] / (ni * nj))
if c > recurrence:
recurrence = c
# Vol_log: log det of turn-state covariance
sim = float(vecs[i] @ vecs[j] / (ni * nj))
recurrence = max(recurrence, sim)
if n >= 3:
Sigma = np.cov(vecs.T)
# Use log|Σ + εI|; since d is large (500) we take eigenvalues + clip
eigs = np.linalg.eigvalsh(Sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32))
sigma = np.cov(vecs.T)
eigs = np.linalg.eigvalsh(sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32))
vol_log = float(np.log(np.clip(eigs, 1e-12, None)).sum())
else:
vol_log = -12.0
return {
"n_turns": n,
"n_turns": float(n),
"drift_mean": drift_mean,
"from_start": from_start,
"recurrence": recurrence,
@ -142,109 +114,105 @@ def trajectory_metrics(vecs: np.ndarray) -> dict:
}
def classify(m: dict, thresholds: dict) -> str:
"""Classify based on quartile thresholds of the actual distribution.
Thresholds (set empirically from observed distribution):
drift_low = p25 drift_hi = p75
vol_low = p25 vol_hi = p75
rec_hi = p75
Rules (priority order):
n_turns < 3 too_short
drift < drift_low AND vol < vol_low trapped
rec > rec_hi AND drift < median limit_cycle
drift > drift_hi AND vol > vol_hi diffusive
else mixed
"""
n = m["n_turns"]
if n < 3:
def classify(metrics: dict[str, float], thresholds: dict[str, float]) -> str:
"""Map trajectory metrics to a coarse regime label."""
n_turns = int(metrics["n_turns"])
if n_turns < 3:
return "too_short"
d = m["drift_mean"]
rec = m["recurrence"]
vol = m["vol_log"]
if d < thresholds["drift_low"] and vol < thresholds["vol_low"]:
drift = metrics["drift_mean"]
recurrence = metrics["recurrence"]
vol = metrics["vol_log"]
if drift < thresholds["drift_low"] and vol < thresholds["vol_low"]:
return "trapped"
if rec > thresholds["rec_hi"] and d < thresholds["drift_med"]:
if recurrence > thresholds["rec_hi"] and drift < thresholds["drift_med"]:
return "limit_cycle"
if d > thresholds["drift_hi"] and vol > thresholds["vol_hi"]:
if drift > thresholds["drift_hi"] and vol > thresholds["vol_hi"]:
return "diffusive"
return "mixed"
def main() -> None:
# First pass: collect turn texts to build vocab
parser = argparse.ArgumentParser(description="Classify cached run regimes")
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
args = parser.parse_args()
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
if not grouped:
raise SystemExit(f"No cached runs found under {args.archive_dir}")
all_turn_texts: list[str] = []
run_turns: dict[tuple, list[str]] = {}
for model in MODELS:
for rf in (ARCH / model).rglob("run*.json"):
try:
d = json.loads(rf.read_text())
except Exception:
continue
task = rf.parent.name
run_idx = int(re.match(r"run(\d+)", rf.stem).group(1))
ts = turn_texts(d)
run_turns[(model, task, run_idx)] = ts
all_turn_texts.extend(ts)
run_turns: dict[str, list[str]] = {}
for model_name, task_runs in grouped.items():
for task_id, runs in task_runs.items():
for run in runs:
ts = turn_texts(run, fallback_any_message=False)
key = f"{model_name}/{task_id}/run{run.run_index}"
run_turns[key] = ts
all_turn_texts.extend(ts)
used_fallback_messages = False
if not all_turn_texts:
used_fallback_messages = True
all_turn_texts = []
run_turns = {}
for model_name, task_runs in grouped.items():
for task_id, runs in task_runs.items():
for run in runs:
ts = turn_texts(run, fallback_any_message=True)
key = f"{model_name}/{task_id}/run{run.run_index}"
run_turns[key] = ts
all_turn_texts.extend(ts)
if not all_turn_texts:
raise SystemExit("No usable turn text found in archive.")
vocab = build_vocab(all_turn_texts, top_k=500)
print(f"Runs collected: {len(run_turns)} vocab size: {len(vocab)}")
# Second pass: vectorize + compute metrics
per_run: dict[str, dict] = {}
per_run: dict[str, dict[str, float | str]] = {}
for key, ts in run_turns.items():
model, task, run_idx = key
if not ts:
continue
vecs = np.stack([vectorize(t, vocab) for t in ts])
m = trajectory_metrics(vecs)
per_run[f"{model}/{task}/run{run_idx}"] = m
vecs = np.stack([vectorize(text, vocab) for text in ts])
per_run[key] = trajectory_metrics(vecs)
# Derive thresholds from actual distribution of n_turns>=3 runs
drifts = np.array([v["drift_mean"] for v in per_run.values() if v["n_turns"] >= 3])
recs = np.array([v["recurrence"] for v in per_run.values() if v["n_turns"] >= 3])
vols = np.array([v["vol_log"] for v in per_run.values() if v["n_turns"] >= 3])
thresholds = {
"drift_low": float(np.percentile(drifts, 25)),
"drift_med": float(np.percentile(drifts, 50)),
"drift_hi": float(np.percentile(drifts, 75)),
"vol_low": float(np.percentile(vols, 25)),
"vol_hi": float(np.percentile(vols, 75)),
"rec_hi": float(np.percentile(recs, 75)),
}
print(f"\nThresholds (quartile-based from observed distribution):")
for k, v in thresholds.items():
print(f" {k:<12} {v:>10.3f}")
eligible = [r for r in per_run.values() if int(r["n_turns"]) >= 3]
if eligible:
drifts = np.array([float(v["drift_mean"]) for v in eligible])
recs = np.array([float(v["recurrence"]) for v in eligible])
vols = np.array([float(v["vol_log"]) for v in eligible])
thresholds = {
"drift_low": float(np.percentile(drifts, 25)),
"drift_med": float(np.percentile(drifts, 50)),
"drift_hi": float(np.percentile(drifts, 75)),
"vol_low": float(np.percentile(vols, 25)),
"vol_hi": float(np.percentile(vols, 75)),
"rec_hi": float(np.percentile(recs, 75)),
}
else:
thresholds = {
"drift_low": 0.15,
"drift_med": 0.25,
"drift_hi": 0.35,
"vol_low": -6.0,
"vol_hi": -3.0,
"rec_hi": 0.8,
}
# Apply classifier with thresholds
for key in per_run:
per_run[key]["regime"] = classify(per_run[key], thresholds)
for key, metrics in per_run.items():
metrics["regime"] = classify(metrics, thresholds)
metrics["turn_source"] = "any_message" if used_fallback_messages else "assistant"
# Summary by regime
counts = Counter(v["regime"] for v in per_run.values())
print(f"\nRegime distribution (n={len(per_run)} runs):")
for regime, n in counts.most_common():
print(f" {regime:<14} {n:>4} ({100*n/len(per_run):>4.1f}%)")
args.reports_dir.mkdir(parents=True, exist_ok=True)
out = args.reports_dir / "regimes.json"
out.write_text(json.dumps(per_run, indent=2), encoding="utf-8")
# Per-model regime breakdown
print(f"\n{'Model':<10} " + " ".join(f"{r:>11}" for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]))
print("-" * 70)
pm_counts = defaultdict(Counter)
for key, v in per_run.items():
model = key.split("/")[0]
pm_counts[model][v["regime"]] += 1
for model in MODELS:
row = [f"{model.split('_')[-1][:9]:<10}"]
for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]:
row.append(f"{pm_counts[model][r]:>11}")
print(" ".join(row))
# Write output
out = ROOT / "reports" / "regimes.json"
out.parent.mkdir(exist_ok=True)
out.write_text(json.dumps(per_run, indent=2))
print(f"\nWrote: {out}")
counts = Counter(str(v["regime"]) for v in per_run.values())
print(f"Wrote: {out}")
print(f"Regime counts: {dict(counts)}")
if __name__ == "__main__":

View File

@ -1,145 +1,127 @@
"""Compute Constraint Index C(q) per task from existing v4-19-full archive.
#!/usr/bin/env python3
"""Compute posterior Constraint Index C(q) from cached runs.
Following "When LLMs Are Dreaming..." paper §Query-design:
Task-level constraint index:
C(q) = z(PR(q)) + z(entropy(q)) + z(BOPS(q))
C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q))
Where:
- PR(q): participation ratio = (tr Σ)² / tr(Σ²) of response embeddings
across all (model, run) responses to query q. Low PR = everyone
writes similar thing (prompt is constrained). High PR = responses
spread out (prompt is open-ended).
- entropy(q): Shannon entropy of (discretized) response-feature distribution.
- BOPS(q): Bayesian Optimal Prediction Score how well can we predict
response given q? Proxied here as inter-run cosine similarity
for the same model (high similarity = high predictability).
Since we don't have sentence-transformers, we use TF-IDF-style bag-of-words
from the final assistant message per run. This is crude but measures the
same signal whether models produce similar vs divergent output.
PR(q) = participation ratio of the task response covariance
H(q) = Shannon entropy of the covariance eigenspectrum
BOPS(q) = within-model inter-run predictability proxy
Output: reports/constraint_index.json with per-task C(q) components +
combined z-score.
High C(q) means a task is more constrained: models and repeated runs tend to
land in a narrower response manifold. Low C(q) means the task is more open or
stylistically underconstrained.
Usage:
.venv/bin/python3 scripts/compute_constraint_index.py
This implementation uses a normalized bag-of-words representation built from
the full assistant trajectory text plus tool-call names and compacted inputs.
"""
from __future__ import annotations
import argparse
import json
import re
import glob
import sys
from collections import Counter, defaultdict
from pathlib import Path
import numpy as np
from scipy.stats import entropy as shannon_entropy
ROOT = Path(__file__).resolve().parent.parent
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
MODELS = [
"anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7",
"anthropic_claude-sonnet-4-6", "openai_gpt-5.4",
"google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1",
"openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5",
"openrouter_qwen_qwen3.6-plus",
]
from clawbench.dynamics_archive import load_task_runs_by_model
WORD_RE = re.compile(r"[a-z]{3,}")
STOPWORDS = set("the and that with this have from what your will can but not "
"was will are been one would there been they will their has "
"had its were only some than about these which into also each "
"when where them how who them very much more most other then "
"here such does like just make many like want need take".split())
STOPWORDS = set(
"the and that with this have from what your will can but not "
"was are been one would there they their has had its were only some "
"than about these which into also each when where them how who very "
"much more most other then here such does like just make many want need take".split()
)
def final_assistant_text(run_path: Path, max_chars: int = 4000) -> str:
"""Extract the last assistant message text + tool-call arg summary."""
try:
d = json.loads(run_path.read_text())
except Exception:
return ""
msgs = d.get("transcript", {}).get("messages", [])
texts = []
for m in msgs:
if m.get("role") != "assistant":
continue
if m.get("text"):
texts.append(m["text"])
for tc in (m.get("tool_calls") or []):
name = tc.get("name", "")
args_str = json.dumps(tc.get("arguments", {}))[:200]
texts.append(f"{name} {args_str}")
blob = " ".join(texts)[:max_chars]
return blob
def _assistant_trajectory_text(run, max_chars: int = 4000) -> str:
parts = []
for message in run.transcript.assistant_messages:
if message.text:
parts.append(message.text)
for call in message.tool_calls:
parts.append(call.name)
if call.input:
parts.append(json.dumps(call.input, sort_keys=True)[:200])
return " ".join(p for p in parts if p).strip()[:max_chars]
def _fallback_text_from_any_message(run) -> str:
for msg in reversed(run.transcript.messages):
parts = []
if msg.text:
parts.append(msg.text)
for call in msg.tool_calls:
parts.append(call.name)
if call.input:
parts.append(json.dumps(call.input, sort_keys=True)[:200])
if parts:
return " ".join(parts).strip()
return ""
def tokenize(text: str) -> list[str]:
return [w for w in WORD_RE.findall(text.lower()) if w not in STOPWORDS]
return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS]
def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]:
"""Build a vocab of the top-k most common tokens across all texts."""
counter = Counter()
for t in texts:
counter.update(set(tokenize(t)))
return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))}
counts = Counter()
for text in texts:
counts.update(set(tokenize(text)))
return {word: idx for idx, (word, _) in enumerate(counts.most_common(top_k))}
def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray:
"""TF-IDF-ish: token frequency normalized to unit L2 for cosine geometry."""
v = np.zeros(len(vocab), dtype=np.float32)
vec = np.zeros(len(vocab), dtype=np.float32)
toks = tokenize(text)
if not toks:
return v
return vec
counts = Counter(toks)
for w, c in counts.items():
if w in vocab:
v[vocab[w]] = c
n = np.linalg.norm(v)
return v / n if n > 0 else v
for word, cnt in counts.items():
if word in vocab:
vec[vocab[word]] = cnt
norm = np.linalg.norm(vec)
return vec / norm if norm > 0 else vec
def participation_ratio(X: np.ndarray) -> float:
"""PR(X) = (tr Σ)² / tr(Σ²). Measures effective dimensionality 1d."""
"""PR(X) = (tr Sigma)^2 / tr(Sigma^2), an effective dimensionality proxy."""
if X.shape[0] < 2:
return 1.0
Sigma = np.cov(X.T)
if Sigma.ndim == 0:
sigma = np.cov(X.T)
if sigma.ndim == 0:
return 1.0
tr = np.trace(Sigma)
tr_sq = np.trace(Sigma @ Sigma)
tr = np.trace(sigma)
tr_sq = np.trace(sigma @ sigma)
if tr_sq < 1e-12:
return 1.0
return float(tr ** 2 / tr_sq)
return float((tr**2) / tr_sq)
def response_entropy(X: np.ndarray, n_clusters: int = 8) -> float:
"""Entropy of a k-means-like discretization of responses.
Since we have small n per task (~27 responses), we cluster by nearest-
centroid using the top-few PCA directions. Simpler: use normalized
eigenvalues of covariance as a proxy for entropy over principal modes.
"""
def response_entropy(X: np.ndarray) -> float:
"""Entropy over normalized covariance eigenvalues, in bits."""
if X.shape[0] < 2:
return 0.0
Sigma = np.cov(X.T)
eigs = np.linalg.eigvalsh(Sigma)
sigma = np.cov(X.T)
eigs = np.linalg.eigvalsh(sigma)
eigs = np.clip(eigs, 1e-12, None)
eigs = eigs / eigs.sum()
return float(shannon_entropy(eigs, base=2))
probs = eigs / eigs.sum()
return float(-np.sum(probs * np.log2(probs)))
def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> float:
"""BOPS proxy: inter-run cosine similarity within same model.
High similarity = predictable (high BOPS). Low similarity = novel each run.
Returns mean cosine across all pairs within each model, averaged across models.
"""
"""Mean within-model pairwise cosine similarity across repeated runs."""
per_model_means = []
for _model, vecs in run_vecs.items():
for vecs in run_vecs.values():
if len(vecs) < 2:
continue
sims = []
@ -154,91 +136,88 @@ def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> floa
return float(np.mean(per_model_means)) if per_model_means else 0.0
def zscore(value: float, arr: np.ndarray) -> float:
std = arr.std()
return float((value - arr.mean()) / std) if std > 1e-12 else 0.0
def main() -> None:
# Gather: per-task list of texts + per-model list of per-run vectors
parser = argparse.ArgumentParser(description="Compute posterior constraint index per task")
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
args = parser.parse_args()
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
if not grouped:
raise SystemExit(f"No cached runs found under {args.archive_dir}")
per_task_texts: dict[str, list[str]] = defaultdict(list)
per_task_model_runs: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
for model in MODELS:
model_dir = ARCH / model
if not model_dir.exists():
continue
for task_dir in model_dir.iterdir():
if not task_dir.is_dir():
continue
task = task_dir.name
for rf in sorted(task_dir.glob("run*.json")):
text = final_assistant_text(rf)
per_task_model_texts: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
use_fallback_messages = False
for model_name, task_runs in grouped.items():
for task_id, runs in task_runs.items():
for run in runs:
text = _assistant_trajectory_text(run)
if text:
per_task_texts[task].append(text)
per_task_model_runs[task][model].append(text)
per_task_texts[task_id].append(text)
per_task_model_texts[task_id][model_name].append(text)
print(f"Tasks with responses: {len(per_task_texts)}")
all_texts = [text for texts in per_task_texts.values() for text in texts]
if not all_texts:
use_fallback_messages = True
for model_name, task_runs in grouped.items():
for task_id, runs in task_runs.items():
for run in runs:
text = _fallback_text_from_any_message(run)
if text:
per_task_texts[task_id].append(text)
per_task_model_texts[task_id][model_name].append(text)
all_texts = [text for texts in per_task_texts.values() for text in texts]
if not all_texts:
raise SystemExit("No usable text found in cached transcripts.")
# Build a GLOBAL vocab across all tasks for comparable vector spaces
all_texts = [t for ts in per_task_texts.values() for t in ts]
vocab = build_vocab(all_texts, top_k=500)
print(f"Global vocab size: {len(vocab)}")
# Compute per-task metrics
per_task: dict[str, dict] = {}
for task, texts in sorted(per_task_texts.items()):
if len(texts) < 5:
continue
X = np.stack([vectorize(t, vocab) for t in texts]) # (n_responses, vocab_dim)
per_task: dict[str, dict[str, float | str]] = {}
for task_id, texts in sorted(per_task_texts.items()):
X = np.stack([vectorize(text, vocab) for text in texts])
pr = participation_ratio(X)
ent = response_entropy(X)
# BOPS: within-model run predictability
model_vecs: dict[str, list[np.ndarray]] = {}
for m, ts in per_task_model_runs[task].items():
model_vecs[m] = [vectorize(t, vocab) for t in ts]
model_vecs = {
model_name: [vectorize(text, vocab) for text in model_texts]
for model_name, model_texts in per_task_model_texts[task_id].items()
}
bops = bops_inter_run_predictability(model_vecs)
per_task[task] = {
per_task[task_id] = {
"n_responses": len(texts),
"PR": pr,
"entropy": ent,
"BOPS": bops,
"data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",
}
# Z-score each component across tasks → combine into C(q)
if not per_task:
raise SystemExit("Not enough data to compute C(q).")
prs = np.array([v["PR"] for v in per_task.values()])
ents = np.array([v["entropy"] for v in per_task.values()])
bopss = np.array([v["BOPS"] for v in per_task.values()])
def z(x, arr):
return float((x - arr.mean()) / (arr.std() or 1.0))
for task_id, v in per_task.items():
z_pr = zscore(v["PR"], prs)
z_ent = zscore(v["entropy"], ents)
z_bops = zscore(v["BOPS"], bopss)
v["z_PR"] = z_pr
v["z_entropy"] = z_ent
v["z_BOPS"] = z_bops
v["C_q"] = -z_pr - z_ent + z_bops
for task, v in per_task.items():
zpr = z(v["PR"], prs)
zent = z(v["entropy"], ents)
zbops = z(v["BOPS"], bopss)
# Paper: higher PR/entropy = MORE open-ended. Higher BOPS = MORE predictable.
# "Constraint" = opposite of openness. C(q) high ⇒ constrained task.
# So: C(q) = z(PR) z(entropy) + z(BOPS)
v["z_PR"] = zpr
v["z_entropy"] = zent
v["z_BOPS"] = zbops
v["C_q"] = -zpr - zent + zbops
# Sort + print
ranked = sorted(per_task.items(), key=lambda kv: -kv[1]["C_q"])
print(f"\n{'Task':<38} {'n':>3} {'PR':>5} {'H':>5} {'BOPS':>5} {'C(q)':>6} (constraint level)")
print("-" * 78)
for task, v in ranked:
print(f"{task:<38} {v['n_responses']:>3} {v['PR']:>5.2f} {v['entropy']:>5.2f} "
f"{v['BOPS']:>5.2f} {v['C_q']:>+6.2f}")
out_path = ROOT / "reports" / "constraint_index.json"
out_path.parent.mkdir(exist_ok=True)
out_path.write_text(json.dumps(per_task, indent=2))
print(f"\nWrote: {out_path}")
# Bucket summary
highs = [t for t, v in per_task.items() if v["C_q"] > 0.5]
lows = [t for t, v in per_task.items() if v["C_q"] < -0.5]
mids = [t for t, v in per_task.items() if -0.5 <= v["C_q"] <= 0.5]
print(f"\nHigh-constraint (C>+0.5): {len(highs)} tasks (responses converge)")
print(f"Mid: {len(mids)} tasks")
print(f"Low-constraint (C<-0.5): {len(lows)} tasks (responses diverge — open-ended)")
args.reports_dir.mkdir(parents=True, exist_ok=True)
out_path = args.reports_dir / "constraint_index.json"
out_path.write_text(json.dumps(per_task, indent=2), encoding="utf-8")
print(f"Wrote: {out_path}")
if __name__ == "__main__":

View File

@ -1,221 +1,144 @@
"""Assemble a combined dynamical-systems report integrating:
- Constraint Index C(q) per task
- Regime classification per run
- Seed vs capability variance
- Survival / hazard analysis
#!/usr/bin/env python3
"""Assemble a combined posterior dynamical-systems markdown report.
Requires: reports/constraint_index.json, reports/regimes.json,
reports/variance_decomposition.json, reports/survival_analysis.json
Inputs:
- constraint_index.json
- regimes.json
- variance_decomposition.json
- survival_analysis.json
- snr_weighted_ranking.json (optional)
Output: reports/EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md
Output:
- EVAL_REPORT_DYNAMICAL.md
The goal is to keep a compact human-readable summary next to the machine
outputs produced by the posterior analysis pipeline.
"""
from __future__ import annotations
import argparse
import json
from collections import Counter, defaultdict
from pathlib import Path
from statistics import mean
ROOT = Path(__file__).resolve().parent.parent
REPORTS = ROOT / "reports"
MODEL_MAP = {
"opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
"opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
"sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
"gpt54": ("openai_gpt-5.4", "GPT 5.4"),
"gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
"glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
"minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
"kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
"qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
}
def _read_json(path: Path):
if not path.exists():
raise SystemExit(f"Missing required report file: {path}")
return json.loads(path.read_text(encoding="utf-8"))
def main() -> None:
cq = json.loads((REPORTS / "constraint_index.json").read_text())
regimes = json.loads((REPORTS / "regimes.json").read_text())
variance = json.loads((REPORTS / "variance_decomposition.json").read_text())
survival = json.loads((REPORTS / "survival_analysis.json").read_text())
lines = []
L = lines.append
L("# ClawBench — Dynamical Systems Analysis (v2026-4-19-full)")
L("")
L("Inspired by *\"When LLMs Are Dreaming, Where Do They Go?\"* — treats")
L("agent runs as dynamical systems and extracts signal ClawBench's flat")
L("run_score can't: task constraint level, per-run regime, noise vs")
L("signal ratio, and per-turn survival curves.")
L("")
# ----------------- 1. Constraint Index summary -----------------
L("## 1. Constraint Index C(q) per task")
L("")
L("C(q) = z(PR) z(entropy) + z(BOPS). High C(q) = task is constrained")
L("(responses converge); low C(q) = open-ended (responses diverge).")
L("")
high = sorted([(t, v) for t, v in cq.items() if v["C_q"] > 0.5],
key=lambda kv: -kv[1]["C_q"])
low = sorted([(t, v) for t, v in cq.items() if v["C_q"] < -0.5],
key=lambda kv: kv[1]["C_q"])
mid = [t for t, v in cq.items() if -0.5 <= v["C_q"] <= 0.5]
L(f"- **High-constraint ({len(high)} tasks, C>+0.5):** {', '.join(t for t, _ in high[:5])}, …")
L(f"- **Low-constraint ({len(low)} tasks, C<0.5):** {', '.join(t for t, _ in low[:5])}, …")
L(f"- **Middle ({len(mid)} tasks):** {', '.join(mid[:5])}, …")
L("")
L("Top 5 most-constrained and most-divergent tasks:")
L("")
L("| Constraint | Task | PR | Entropy | BOPS | C(q) |")
L("|---|---|:---:|:---:|:---:|:---:|")
for t, v in high[:5]:
L(f"| HIGH | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |")
for t, v in low[:5]:
L(f"| LOW | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |")
L("")
# ----------------- 2. Regime distribution -----------------
L("## 2. Dynamical regime per run")
L("")
L("Each run's turn-by-turn trajectory classified by drift, recurrence,")
L("and support volume thresholds (quartile-based).")
L("")
pm = defaultdict(Counter)
for key, v in regimes.items():
model_sub = key.split("/")[0]
# Reverse-map to label
label = next((l for l, (s, _) in MODEL_MAP.items() if s == model_sub), None)
if label:
pm[label][v["regime"]] += 1
L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |")
L("|---|:---:|:---:|:---:|:---:|:---:|")
for label, (_sub, pretty) in MODEL_MAP.items():
c = pm[label]
L(f"| {pretty} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | "
f"{c['diffusive']} | {c['mixed']} |")
L("")
L("**Interpretation:**")
L("- `trapped` = low drift + small support: agent converges to a point.")
L(" Often good on constrained tasks, sometimes 'stuck'.")
L("- `limit_cycle` = repeats similar states non-consecutively: tool-use loop.")
L("- `diffusive` = keeps exploring without converging. Goal drift risk.")
L("- `mixed` = no strong signature.")
L("")
L("Notable findings:")
L("")
# Find outliers
trap_counts = [(label, pm[label]["trapped"]) for label in MODEL_MAP]
cycle_counts = [(label, pm[label]["limit_cycle"]) for label in MODEL_MAP]
trap_counts.sort(key=lambda x: -x[1])
cycle_counts.sort(key=lambda x: -x[1])
L(f"- Most `trapped` runs: **{MODEL_MAP[trap_counts[0][0]][1]}** ({trap_counts[0][1]} runs) —")
L(f" converges aggressively; often one-shot answer without iteration.")
L(f"- Most `limit_cycle` runs: **{MODEL_MAP[cycle_counts[0][0]][1]}** ({cycle_counts[0][1]} runs) —")
L(f" repeats tool patterns between turns; check for productive vs stuck loops.")
L("")
# ----------------- 3. Variance decomposition -----------------
L("## 3. Seed-noise vs capability-signal")
L("")
agg = variance["aggregate"]
L(f"- **Seed-noise variance** (same model, 3 runs): **{agg['mean_seed_var']:.4f}**")
L(f"- **Capability variance** (across models): **{agg['mean_cap_var']:.4f}**")
L(f"- **Capability fraction: {agg['capability_fraction']:.1%}**")
L(f" (= fraction of benchmark variance that reflects real model differences)")
L("")
L("**The other ~47% is seed noise.** Any ranking gap < √(2·seed_var) ≈")
L(f"0.20 between two models is within noise. Top-5 models' gap is 0.02 →")
L("**statistically indistinguishable.**")
L("")
L("### SNR tiers across 40 tasks")
L("")
per_task = variance["per_task"]
hi = [r for r in per_task if r["snr"] >= 5]
mid = [r for r in per_task if 1 <= r["snr"] < 5]
lo = [r for r in per_task if r["snr"] < 1]
L(f"- **High-SNR ({len(hi)} tasks, SNR ≥ 5):** reliably discriminate models")
for r in hi[:3]:
L(f" - `{r['task']}` (SNR={r['snr']:.1f})")
L(f"- **Mid-SNR ({len(mid)} tasks, 1 ≤ SNR < 5):** moderate signal")
L(f"- **Low-SNR ({len(lo)} tasks, SNR < 1):** seed noise dominates; these")
L(f" tasks give essentially random rankings")
for r in sorted(lo, key=lambda x: x['snr'])[:3]:
L(f" - `{r['task']}` (SNR={r['snr']:.2f}) — random")
L("")
# ----------------- 4. Survival analysis -----------------
L("## 4. Per-turn survival: when do runs fail?")
L("")
L("T_F = first turn where agent emits empty response or run ends in failure.")
L("S(t) = fraction of runs still on-track past turn t. Low = dies early.")
L("")
L("| Model | Median fail turn | S(3) | S(5) | S(8) | S(12) | S(20) |")
L("|---|:---:|:---:|:---:|:---:|:---:|:---:|")
for label, (_sub, pretty) in MODEL_MAP.items():
d = survival.get(label, {})
surv = d.get("survival", [0]*20)
med = d.get("median_fail_turn", "")
med_str = f"{med:.1f}" if isinstance(med, (int, float)) and med != float("inf") else str(med)
L(f"| {pretty} | {med_str} | {surv[2]:.2f} | {surv[4]:.2f} | "
f"{surv[7]:.2f} | {surv[11]:.2f} | {surv[19]:.2f} |")
L("")
# Narrative
surv_rank_t8 = sorted(
[(label, survival[label]["survival"][7])
for label in MODEL_MAP if label in survival],
key=lambda x: -x[1]
parser = argparse.ArgumentParser(description="Generate a combined dynamical report markdown")
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
parser.add_argument(
"--output",
type=Path,
default=None,
help="Markdown output path; defaults to <reports-dir>/EVAL_REPORT_DYNAMICAL.md",
)
best = MODEL_MAP[surv_rank_t8[0][0]][1]
worst = MODEL_MAP[surv_rank_t8[-1][0]][1]
L(f"- **{best}** survives longest — {surv_rank_t8[0][1]:.0%} of runs still")
L(f" producing output at turn 8.")
L(f"- **{worst}** dies earliest — only {surv_rank_t8[-1][1]:.0%} make it to turn 8.")
args = parser.parse_args()
reports = args.reports_dir
output_path = args.output or (reports / "EVAL_REPORT_DYNAMICAL.md")
cq = _read_json(reports / "constraint_index.json")
regimes = _read_json(reports / "regimes.json")
variance = _read_json(reports / "variance_decomposition.json")
survival = _read_json(reports / "survival_analysis.json")
ranking_path = reports / "snr_weighted_ranking.json"
ranking = json.loads(ranking_path.read_text(encoding="utf-8")) if ranking_path.exists() else None
lines: list[str] = []
L = lines.append
L("# ClawBench Posterior Dynamical Report")
L("")
L("This is signal invisible in flat run_score: two models can score")
L("similarly but have very different failure profiles. Pick accordingly")
L("for long-horizon deployments.")
L("This report combines posterior-only diagnostics from cached run artifacts.")
L("")
# ----------------- 5. Integrated view -----------------
L("## 5. Integrated view — combining all four lenses")
L("## 1. Constraint Index C(q)")
L("")
L("For a model to be **reliably good** at a task, we need:")
L("- (a) It scores well (run_score high)")
L("- (b) Variance across seeds is low (predictable)")
L("- (c) It doesn't exhibit pathological regime (trapped on wrong answer / cycling)")
L("- (d) It survives multi-turn without dying early")
values = [(task, float(data.get("C_q", 0.0))) for task, data in cq.items()]
values.sort(key=lambda row: row[1], reverse=True)
highs = [row for row in values if row[1] > 0.5]
lows = [row for row in values if row[1] < -0.5]
L(f"- High-constraint tasks (C > 0.5): {len(highs)}")
L(f"- Low-constraint tasks (C < -0.5): {len(lows)}")
L("")
L("These lenses disagree constructively:")
if values:
L("Top tasks by C(q):")
L("")
L("| Task | C(q) |")
L("|---|---:|")
for task, c_q in values[:10]:
L(f"| {task} | {c_q:+.3f} |")
L("")
L("## 2. Regime Classification")
L("")
L("- **Opus 4.6** tops flat run_score but median failure at turn 5.5 (earlier than Opus 4.7's 7).")
L("- **GPT 5.4** is mid-pack on flat score but has highest S(8)=0.60 — long-horizon champion.")
L("- **Sonnet 4.6** most `trapped` runs — it commits early and sticks. Good on")
L(" constrained tasks, bad on open-ended (cf. memory-recall-continuation 0.15).")
L("- **GLM 5.1** most balanced regime distribution; justifies broad performance.")
L("- **Kimi K2.5** median fail at turn 3 — it's not just low-scoring, it's")
L(" specifically fragile under multi-turn execution.")
by_model = defaultdict(Counter)
for key, row in regimes.items():
model = key.split("/")[0]
regime = row.get("regime", "unknown")
by_model[model][regime] += 1
L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |")
L("|---|---:|---:|---:|---:|---:|")
for model in sorted(by_model):
c = by_model[model]
L(
f"| {model} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | "
f"{c['diffusive']} | {c['mixed']} |"
)
L("")
# ----------------- 6. What to do next -----------------
L("## 6. Implications for the benchmark")
L("## 3. Variance Decomposition")
L("")
agg = variance.get("aggregate", {})
L(f"- Mean seed variance: {agg.get('mean_seed_var', 0.0):.6f}")
L(f"- Mean capability variance: {agg.get('mean_cap_var', 0.0):.6f}")
L(f"- Capability fraction: {agg.get('capability_fraction', 0.0):.1%}")
L(f"- High-SNR tasks: {agg.get('high_snr_tasks', 0)}")
L(f"- Mid-SNR tasks: {agg.get('mid_snr_tasks', 0)}")
L(f"- Low-SNR tasks: {agg.get('low_snr_tasks', 0)}")
L("")
L("- **47% seed noise** means any gap < 0.02 is meaningless. Treat top-5")
L(" as a statistical tie. Dropping the 21 low-SNR tasks would sharpen")
L(" remaining rankings considerably.")
L("- **Weight tasks by SNR × |C(q)|** instead of flat mean. High-SNR,")
L(" high-|C(q)| tasks give the cleanest capability signal.")
L("- **Report survival curves alongside run_score** to surface long-horizon")
L(" capability that single-number metrics hide.")
L("- **Flag 'trapped' runs that scored high** — the model may have")
L(" guessed-and-committed rather than reasoned; not same reliability.")
L("- **Add a Tier 6 long-horizon (100+ turn) task set** to actually")
L(" measure the dynamical regimes the paper proposes — current")
L(" trajectories are too short (median 6 assistant turns) for clean")
L(" Lyapunov or attractor diagnostics.")
out = REPORTS / "EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md"
out.write_text("\n".join(lines) + "\n")
print(f"Wrote: {out}")
L("## 4. Survival Analysis")
L("")
L("| Model | Runs | Events | Median failure turn | S(3) | S(5) | S(8) |")
L("|---|---:|---:|---:|---:|---:|---:|")
for model in sorted(survival):
row = survival[model]
surv = row.get("survival", [0.0] * 8)
med = row.get("median_fail_turn", "inf")
if isinstance(med, float) and med == float("inf"):
med_display = "inf"
else:
med_display = f"{float(med):.1f}"
L(
f"| {model} | {row.get('n_runs', 0)} | {row.get('n_events', 0)} | "
f"{med_display} | {surv[2] if len(surv) > 2 else 0.0:.2f} | "
f"{surv[4] if len(surv) > 4 else 0.0:.2f} | {surv[7] if len(surv) > 7 else 0.0:.2f} |"
)
L("")
if ranking is not None:
L("## 5. SNR-weighted Ranking")
L("")
L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |")
L("|---:|---|---:|---:|---:|---:|")
for idx, row in enumerate(ranking.get("results", []), start=1):
L(
f"| {idx} | {row.get('model', '')} | {row.get('flat', 0.0):.4f} | "
f"{row.get('snr_x_abs_cq', 0.0):.4f} | {row.get('snr_x_abs_cq_winsorized', 0.0):.4f} | "
f"{row.get('coverage', 0)} |"
)
L("")
output_path.parent.mkdir(parents=True, exist_ok=True)
output_path.write_text("\n".join(lines) + "\n", encoding="utf-8")
print(f"Wrote: {output_path}")
if __name__ == "__main__":

View File

@ -0,0 +1,89 @@
#!/usr/bin/env python3
"""Run the full posterior dynamical analysis pipeline."""
from __future__ import annotations
import argparse
import subprocess
import sys
from pathlib import Path
REPO_ROOT = Path(__file__).resolve().parent.parent
sys.path.insert(0, str(REPO_ROOT))
from clawbench.dynamics_archive import discover_model_roots, load_task_runs_archive, write_dynamics_report
def _run(cmd: list[str]) -> None:
print("$", " ".join(cmd))
result = subprocess.run(cmd, cwd=REPO_ROOT)
if result.returncode != 0:
raise SystemExit(result.returncode)
def _resolve_path(path: Path) -> Path:
return path if path.is_absolute() else (REPO_ROOT / path)
def _write_dynamics_reports(
archive_dir: Path,
output_dir: Path,
tier: str | None,
) -> None:
roots = discover_model_roots(archive_dir)
if not roots:
raise SystemExit(f"No cached runs found under {archive_dir}")
multiple_models = len(roots) > 1
wrote_any = False
for model_name, model_dir in roots.items():
task_runs = load_task_runs_archive(model_dir, tier=tier)
if not task_runs:
continue
wrote_any = True
model_output_dir = output_dir / model_name if multiple_models else output_dir
report_path, plots = write_dynamics_report(task_runs, model_output_dir)
n_runs = sum(len(runs) for runs in task_runs.values())
print(f"[dynamics] {model_name}: loaded {n_runs} cached runs across {len(task_runs)} tasks")
print(f"[dynamics] {model_name}: wrote {report_path}")
print(f"[dynamics] {model_name}: saved {len(plots)} plots to {model_output_dir}/")
if not wrote_any:
raise SystemExit(f"No cached runs found under {archive_dir}")
def main() -> None:
parser = argparse.ArgumentParser(description="Run posterior dynamics pipeline end to end")
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
parser.add_argument("--output-dir", type=Path, default=Path("results/posterior_dynamics"))
parser.add_argument(
"--include-dynamics-report",
action="store_true",
help="Also build per-model dynamics.json files and plots from the archive.",
)
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
args = parser.parse_args()
py = sys.executable
archive_dir = _resolve_path(args.archive_dir)
reports_dir = _resolve_path(args.reports_dir)
output_dir = _resolve_path(args.output_dir)
tier_args = ["--tier", args.tier] if args.tier else []
scripts_dir = REPO_ROOT / "scripts"
_run([py, str(scripts_dir / "compute_constraint_index.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
_run([py, str(scripts_dir / "classify_regimes.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
_run([py, str(scripts_dir / "variance_decomp.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
_run([py, str(scripts_dir / "survival_analysis.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
_run([py, str(scripts_dir / "snr_weighted_ranking.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
_run([py, str(scripts_dir / "generate_dynamical_report.py"), "--reports-dir", str(reports_dir)])
if args.include_dynamics_report:
_write_dynamics_reports(archive_dir, output_dir, args.tier)
if __name__ == "__main__":
main()

View File

@ -1,148 +1,130 @@
"""SNR × |C(q)|-weighted ranking — the dynamical-systems-informed metric.
#!/usr/bin/env python3
"""SNR x |C(q)| weighted ranking from posterior cached runs.
Motivation: from variance_decomp.py we know 47% of run_score variance is
seed noise. From compute_constraint_index.py we know some tasks are
high-constraint (everyone converges) and others are open-ended (responses
diverge for style reasons, not capability).
Weighted headline score:
Weighted mean:
w(task) = SNR(task) × |C(q)(task)|
score(model) = Σ_task w(task) · mean_run_score(task, model) / Σ_task w(task)
w(q) = max(0, SNR(q)) * |C(q)|
score(model) = sum_q w(q) * mean_run_score(model, q) / sum_q w(q)
Why:
- High SNR tasks contribute more than low-SNR tasks (noise-weighted)
- |C(q)| amplifies tasks that are either strongly constrained OR strongly
open-ended (i.e. measures what they're supposed to measure, regardless
of polarity)
- Moderate C(q) tasks (C near 0) are inherently ambiguous down-weighted
We also report:
Outputs:
- Per-model weighted score
- Comparison against flat-mean ranking
- Published to reports/snr_weighted_ranking.json
snr_only = SNR-weighted mean
snr_x_abs_cq = SNR x |C(q)| weighted mean
snr_x_abs_cq_winsorized = same, but top task weights are clamped at p95
This keeps noisy low-SNR tasks from dominating and upweights tasks whose
response geometry suggests a stronger capability signal.
"""
from __future__ import annotations
import glob
import argparse
import json
import sys
from collections import defaultdict
from pathlib import Path
from statistics import mean
import numpy as np
ROOT = Path(__file__).resolve().parent.parent
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
REPORTS = ROOT / "reports"
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
MODELS = {
"opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
"opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
"sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
"gpt54": ("openai_gpt-5.4", "GPT 5.4"),
"gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
"glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
"minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
"kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
"qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
}
from clawbench.dynamics_archive import load_task_runs_by_model
def main() -> None:
cq = json.loads((REPORTS / "constraint_index.json").read_text())
var = json.loads((REPORTS / "variance_decomposition.json").read_text())
snr_by_task = {r["task"]: r["snr"] for r in var["per_task"]}
parser = argparse.ArgumentParser(description="Compute SNR-weighted posterior model ranking")
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
args = parser.parse_args()
# Per (model, task): mean run_score over the 3 runs
per_mt: dict[str, dict[str, list[float]]] = defaultdict(dict)
for label, (sub, _) in MODELS.items():
for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
try:
d = json.loads(Path(p).read_text())
except Exception:
continue
task = p.split("/")[-2]
per_mt[label].setdefault(task, []).append(d.get("run_score", 0))
per_mt_mean = {
m: {t: mean(v) for t, v in d.items() if v} for m, d in per_mt.items()
cq_path = args.reports_dir / "constraint_index.json"
var_path = args.reports_dir / "variance_decomposition.json"
if not cq_path.exists() or not var_path.exists():
raise SystemExit("Missing prerequisite reports: run compute_constraint_index.py and variance_decomp.py first.")
cq = json.loads(cq_path.read_text(encoding="utf-8"))
var = json.loads(var_path.read_text(encoding="utf-8"))
snr_by_task = {row["task"]: row["snr"] for row in var.get("per_task", [])}
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
if not grouped:
raise SystemExit(f"No cached runs found under {args.archive_dir}")
per_model_task_scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
for model_name, task_runs in grouped.items():
for task_id, runs in task_runs.items():
per_model_task_scores[model_name][task_id] = [float(run.run_score) for run in runs]
per_model_task_mean = {
model_name: {
task_id: mean(vals)
for task_id, vals in task_scores.items()
if vals
}
for model_name, task_scores in per_model_task_scores.items()
}
# Only consider tasks present in both C(q) and SNR
common_tasks = sorted(set(cq) & set(snr_by_task))
print(f"Using {len(common_tasks)} tasks with both C(q) and SNR.")
if not common_tasks:
raise SystemExit("No overlap between constraint_index and variance_decomposition task sets.")
# Compute weights w(task) = SNR × |C(q)|, clamped to [0, ∞)
weights = {}
for t in common_tasks:
w = max(0.0, snr_by_task[t]) * abs(cq[t]["C_q"])
weights[t] = w
# Also: SNR-only weighting (simpler, no C(q))
snr_weights = {t: max(0.0, snr_by_task[t]) for t in common_tasks}
# Also: Winsorize — clamp top-1 task's weight to 95th percentile to
# prevent single task from dominating
import numpy as _np
_w95 = float(_np.percentile(list(weights.values()), 95))
weights_wins = {t: min(w, _w95) for t, w in weights.items()}
wsum = sum(weights.values())
if wsum == 0:
print("All weights zero — bail.")
return
weights = {task: max(0.0, snr_by_task[task]) * abs(cq[task].get("C_q", 0.0)) for task in common_tasks}
snr_weights = {task: max(0.0, snr_by_task[task]) for task in common_tasks}
# Compute per-model scores under 3 variants
results = []
w95 = float(np.percentile(list(weights.values()), 95)) if weights else 0.0
winsorized = {task: min(weight, w95) for task, weight in weights.items()}
w_sum = sum(weights.values())
snr_sum = sum(snr_weights.values())
wins_sum = sum(weights_wins.values())
for label, (sub, pretty) in MODELS.items():
task_means = per_mt_mean.get(label, {})
if not task_means:
wins_sum = sum(winsorized.values())
results = []
for model_name, task_means in per_model_task_mean.items():
covered = [task for task in common_tasks if task in task_means]
if not covered:
continue
num_cq = sum(weights[t] * task_means.get(t, 0) for t in common_tasks)
num_snr = sum(snr_weights[t] * task_means.get(t, 0) for t in common_tasks)
num_wins = sum(weights_wins[t] * task_means.get(t, 0) for t in common_tasks)
wscore = num_cq / wsum
snr_only = num_snr / snr_sum if snr_sum > 0 else 0
wins_score = num_wins / wins_sum if wins_sum > 0 else 0
flat = mean(task_means[t] for t in common_tasks if t in task_means)
results.append((label, pretty, flat, wscore, snr_only, wins_score))
print()
print(f"{'Model':<16} {'Flat':>7} {'SNR×|C|':>8} {'Winsorized':>11} {'SNR-only':>9}")
print("-" * 66)
# Rank by winsorized variant (primary)
for label, pretty, flat, w, snr_only, wins in sorted(results, key=lambda x: -x[5]):
print(f"{pretty:<16} {flat:>7.4f} {w:>8.4f} {wins:>11.4f} {snr_only:>9.4f}")
flat = mean(task_means[task] for task in covered)
weighted = (
sum(weights[task] * task_means.get(task, 0.0) for task in common_tasks) / w_sum
if w_sum > 1e-12
else 0.0
)
snr_only = (
sum(snr_weights[task] * task_means.get(task, 0.0) for task in common_tasks) / snr_sum
if snr_sum > 1e-12
else 0.0
)
wins_score = (
sum(winsorized[task] * task_means.get(task, 0.0) for task in common_tasks) / wins_sum
if wins_sum > 1e-12
else 0.0
)
# Rank comparisons
print("\n=== Ranking shifts vs flat-mean (winsorized) ===")
flat_rank_order = sorted(results, key=lambda x: -x[2])
flat_rank = {r[0]: i + 1 for i, r in enumerate(flat_rank_order)}
wins_rank_order = sorted(results, key=lambda x: -x[5])
print(f"{'Rank':<5}{'Model':<16} {'Flat':>8} {'Winsorized':>11} {'Δrank':>6}")
for i, (label, pretty, flat, _w, _snr, wins) in enumerate(wins_rank_order, 1):
fr = flat_rank[label]
move = ""
if fr > i: move = f"{fr-i}"
elif fr < i: move = f"{i-fr}"
print(f"{i:<5}{pretty:<16} {flat:>8.4f} {wins:>11.4f} {move:>6}")
results.append(
{
"model": model_name,
"flat": float(flat),
"snr_x_abs_cq": float(weighted),
"snr_only": float(snr_only),
"snr_x_abs_cq_winsorized": float(wins_score),
"coverage": len(covered),
}
)
results.sort(key=lambda row: row["snr_x_abs_cq_winsorized"], reverse=True)
# Save
out = {
"flat_score": {r[0]: r[2] for r in results},
"snr_x_cq_weighted": {r[0]: r[3] for r in results},
"snr_x_cq_winsorized": {r[0]: r[5] for r in results},
"snr_only_weighted": {r[0]: r[4] for r in results},
"weights_per_task": weights,
"common_tasks": common_tasks,
"weights_per_task": weights,
"results": results,
}
(REPORTS / "snr_weighted_ranking.json").write_text(json.dumps(out, indent=2))
print(f"\nWrote reports/snr_weighted_ranking.json")
# Show top-5 contributing tasks (highest weight) for context
print()
print("Top-10 tasks by weight (SNR × |C(q)|):")
for t, w in sorted(weights.items(), key=lambda kv: -kv[1])[:10]:
print(f" {t:<38} SNR={snr_by_task[t]:>5.1f} |C(q)|={abs(cq[t]['C_q']):>5.2f} w={w:>6.2f}")
out_path = args.reports_dir / "snr_weighted_ranking.json"
out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
print(f"Wrote: {out_path}")
if __name__ == "__main__":

View File

@ -1,164 +1,118 @@
"""Per-turn survival analysis: when do agent runs fail?
#!/usr/bin/env python3
"""Per-turn survival analysis on posterior cached runs.
Following paper §Latent-state survival:
T_F = inf { t 0 : failure at time t }
S(t) = P(T_F > t) survival function
h(t) = P(T_F = t | T_F t) hazard rate
For each run, define a failure time T_F as the first assistant turn where the
agent emits neither text nor tool calls, or the final assistant turn of an
unsuccessful run with delivery outcome in {fail, partial}.
For each run, we define FAILURE as the first turn where:
(a) the assistant emits no text AND no tool calls, OR
(b) the run's delivery_outcome is 'fail'/'partial' AND the transcript
ended at this turn (no more assistant turns follow).
We then estimate:
T_F = assistant-turn index of first failure (starting at 1).
If the run succeeded (run_score 0.7), T_F is right-censored at the
final turn count N (i.e. survived the whole trajectory).
S(t) = P(T_F > t)
h(t) = P(T_F = t | T_F >= t)
Output per model:
- Median turn-to-failure
- Empirical survival curve S(t) for t = 1..20
- Hazard profile h(t)
- Stratified by task-constraint bucket (using C(q) from earlier)
Usage:
.venv/bin/python3 scripts/survival_analysis.py
This exposes long-horizon fragility that is easy to hide in flat mean scores.
"""
from __future__ import annotations
import glob
import argparse
import json
import re
from collections import defaultdict
import sys
from pathlib import Path
from statistics import median
import numpy as np
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
ROOT = Path(__file__).resolve().parent.parent
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
MODELS = {
"opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
"opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
"sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
"gpt54": ("openai_gpt-5.4", "GPT 5.4"),
"gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
"glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
"minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
"kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
"qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
}
from clawbench.dynamics_archive import load_task_runs_by_model
SUCCESS_THRESHOLD = 0.7
def assistant_turns(d: dict) -> list[dict]:
return [m for m in d.get("transcript", {}).get("messages", [])
if m.get("role") == "assistant"]
def assistant_turns(run) -> list:
return run.transcript.assistant_messages
def find_failure_turn(d: dict) -> tuple[int, bool]:
"""Return (T_F, is_event). T_F is 1-indexed turn of failure.
is_event=True means failure actually happened; False means the run was
censored (survived to end without failing).
"""
turns = assistant_turns(d)
def find_failure_turn(run) -> tuple[int, bool]:
"""Return (failure_turn, is_event) with 1-indexed assistant turns."""
turns = assistant_turns(run)
n = len(turns)
run_score = d.get("run_score", 0) or 0
delivery = d.get("delivery_outcome", "")
# Scan for first empty-turn
for i, t in enumerate(turns, 1):
has_text = bool((t.get("text") or "").strip())
has_tool_call = bool(t.get("tool_calls"))
for idx, turn in enumerate(turns, 1):
has_text = bool((turn.text or "").strip())
has_tool_call = bool(turn.tool_calls)
if not has_text and not has_tool_call:
return i, True # failure event
return idx, True
# If run was unsuccessful and ended early, mark last turn as failure
if run_score < SUCCESS_THRESHOLD and delivery in ("fail", "partial"):
if run.run_score < SUCCESS_THRESHOLD and run.delivery_outcome.value in {"fail", "partial"}:
return max(n, 1), True
# Survived: right-censored at n
return max(n, 1), False
def empirical_survival(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]:
"""Kaplan-Meier-like survival curve, non-parametric.
S(t) = fraction of runs that survived past turn t.
"""
survival = []
"""Empirical survival curve S(t) over assistant-turn index."""
total = len(times_events)
if total == 0:
return [0.0] * max_t
survival = []
for t in range(1, max_t + 1):
# Survived past t = either censored at ≥t or event at >t
survived = sum(1 for tf, is_event in times_events
if (not is_event and tf >= t) or (is_event and tf > t))
survival.append(survived / total if total > 0 else 0.0)
survived = sum(
1
for tf, is_event in times_events
if (not is_event and tf >= t) or (is_event and tf > t)
)
survival.append(survived / total)
return survival
def hazard(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]:
"""Hazard rate h(t) = events at t / at-risk at t."""
h = []
"""Discrete hazard h(t) = events_at_t / at_risk_at_t."""
hazard_vals = []
for t in range(1, max_t + 1):
at_risk = sum(1 for tf, _ in times_events if tf >= t)
events_at_t = sum(1 for tf, is_event in times_events
if is_event and tf == t)
h.append(events_at_t / at_risk if at_risk > 0 else 0.0)
return h
events_at_t = sum(1 for tf, is_event in times_events if is_event and tf == t)
hazard_vals.append(events_at_t / at_risk if at_risk > 0 else 0.0)
return hazard_vals
def main() -> None:
per_model: dict[str, list[tuple[int, bool]]] = defaultdict(list)
for label, (sub, _) in MODELS.items():
for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
try:
d = json.loads(Path(p).read_text())
except Exception:
continue
tf, is_event = find_failure_turn(d)
per_model[label].append((tf, is_event))
parser = argparse.ArgumentParser(description="Survival analysis on cached runs")
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
parser.add_argument("--max-turn", type=int, default=20)
args = parser.parse_args()
# Load C(q) to stratify
cq_path = ROOT / "reports" / "constraint_index.json"
cq_by_task = {}
if cq_path.exists():
cq = json.loads(cq_path.read_text())
cq_by_task = {t: v["C_q"] for t, v in cq.items()}
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
if not grouped:
raise SystemExit(f"No cached runs found under {args.archive_dir}")
# Print summary
print(f"{'Model':<14} {'n_runs':>6} {'events':>6} {'med_tf':>8} "
f"{'S(3)':>6} {'S(5)':>6} {'S(8)':>6} {'S(12)':>6} {'S(20)':>6}")
print("-" * 90)
out = {}
for label, (_sub, pretty) in MODELS.items():
evs = per_model[label]
n = len(evs)
n_events = sum(1 for _, e in evs if e)
tfs_events = [tf for tf, e in evs if e]
med = median(tfs_events) if tfs_events else float("inf")
surv = empirical_survival(evs, max_t=20)
haz = hazard(evs, max_t=20)
print(f"{pretty:<14} {n:>6} {n_events:>6} {med:>8.1f} "
f"{surv[2]:>6.2f} {surv[4]:>6.2f} {surv[7]:>6.2f} "
f"{surv[11]:>6.2f} {surv[19]:>6.2f}")
out[label] = {
"pretty": pretty,
"n_runs": n,
for model_name, task_runs in grouped.items():
events = []
for runs in task_runs.values():
for run in runs:
events.append(find_failure_turn(run))
n_runs = len(events)
n_events = sum(1 for _, is_event in events if is_event)
event_times = [t for t, is_event in events if is_event]
med = median(event_times) if event_times else float("inf")
out[model_name] = {
"pretty": model_name,
"n_runs": n_runs,
"n_events": n_events,
"median_fail_turn": med,
"survival": surv,
"hazard": haz,
"survival": empirical_survival(events, max_t=args.max_turn),
"hazard": hazard(events, max_t=args.max_turn),
}
print("\n(Interpretation: S(t) = fraction of runs still on-track past turn t.")
print(" Lower values = more frequent early failure.)")
out_path = ROOT / "reports" / "survival_analysis.json"
out_path.write_text(json.dumps(out, indent=2))
print(f"\nWrote: {out_path}")
args.reports_dir.mkdir(parents=True, exist_ok=True)
out_path = args.reports_dir / "survival_analysis.json"
out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
print(f"Wrote: {out_path}")
if __name__ == "__main__":

View File

@ -1,132 +1,118 @@
"""Decompose run_score variance into seed-noise vs capability-signal.
#!/usr/bin/env python3
"""Decompose posterior run_score variance into seed noise and capability signal.
Each task has 3 runs per model (same prompt, different random seed).
σ²_seed(task, model) = variance across the 3 runs of (task, model)
σ²_capability(task) = variance across model means for the task
Each task has repeated runs per model.
sigma^2_seed(task, model) = variance across repeated runs for one model
sigma^2_capability(task) = variance across model means for that task
Signal-to-noise ratio per task:
SNR(task) = σ²_capability / σ²_seed
High SNR differences between models on this task are REAL (not noise).
Low SNR the 3-run variance per model is so large that cross-model gaps
are indistinguishable from seed noise. These tasks don't
discriminate models reliably.
SNR(task) = sigma^2_capability / mean_model sigma^2_seed
Aggregated over all 40 tasks, we also decompose TOTAL variance:
total_var = mean_capability_var + mean_seed_var
capability_fraction = mean_capability_var / total_var
High SNR means cross-model differences are likely real. Low SNR means the
benchmark signal is dominated by run-to-run variance rather than capability.
This answers "what fraction of the benchmark signal is real model
capability vs. run-to-run luck?"
Aggregate decomposition:
Usage:
.venv/bin/python3 scripts/variance_decomp.py
total_var = mean_task seed_var + mean_task cap_var
capability_fraction = mean_task cap_var / total_var
This script keeps the posterior/archive-based workflow used by the current
pipeline, but the statistical meaning is the same as the earlier analysis.
"""
from __future__ import annotations
import glob
import argparse
import json
import re
import sys
from collections import defaultdict
from pathlib import Path
from statistics import mean, variance
import numpy as np
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
ROOT = Path(__file__).resolve().parent.parent
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
MODELS = {
"opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
"opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
"sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
"gpt54": ("openai_gpt-5.4", "GPT 5.4"),
"gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
"glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
"minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
"kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
"qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
}
from clawbench.dynamics_archive import load_task_runs_by_model
def main() -> None:
# {task: {model: [run_scores]}}
scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
for label, (sub, _) in MODELS.items():
for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
task = p.split("/")[-2]
try:
d = json.loads(Path(p).read_text())
except Exception:
continue
scores[task].setdefault(label, []).append(d.get("run_score", 0))
parser = argparse.ArgumentParser(description="Variance decomposition on cached runs")
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
args = parser.parse_args()
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
if not grouped:
raise SystemExit(f"No cached runs found under {args.archive_dir}")
# Collect repeated run scores as {task -> {model -> [run_scores]}}.
scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
for model_name, task_runs in grouped.items():
for task_id, runs in task_runs.items():
vals = [float(run.run_score) for run in runs]
if vals:
scores[task_id][model_name] = vals
# Per-task: seed var per model, cross-model var of means, SNR
task_stats = []
for task, per_model in scores.items():
# Only use models with all 3 runs for clean seed-variance estimate
for task_id, per_model in scores.items():
model_vars = []
model_means = []
for m, runs in per_model.items():
for runs in per_model.values():
if len(runs) >= 2:
model_vars.append(variance(runs))
if runs:
model_means.append(mean(runs))
if len(model_means) < 2 or not model_vars:
continue
mean_seed_var = mean(model_vars) # noise
cap_var = variance(model_means) # signal
# Mean within-model variance is the seed-noise term.
mean_seed_var = mean(model_vars) if model_vars else 0.0
# Variance of model means is the capability-signal term.
cap_var = variance(model_means) if len(model_means) >= 2 else 0.0
snr = cap_var / (mean_seed_var + 1e-9)
task_stats.append({
"task": task,
"seed_var": mean_seed_var,
"cap_var": cap_var,
"snr": snr,
"n_models": len(model_means),
})
task_stats.append(
{
"task": task_id,
"seed_var": float(mean_seed_var),
"cap_var": float(cap_var),
"snr": float(snr),
"n_models": len(model_means),
"limited_model_diversity": len(model_means) < 2,
}
)
# Sort by SNR
task_stats.sort(key=lambda x: -x["snr"])
task_stats.sort(key=lambda row: row["snr"], reverse=True)
if not task_stats:
raise SystemExit("No task-level scores found in archive.")
print(f"{'Task':<38} {'seed_var':>9} {'cap_var':>9} {'SNR':>8}")
print("-" * 70)
for r in task_stats:
print(f"{r['task']:<38} {r['seed_var']:>9.4f} {r['cap_var']:>9.4f} "
f"{r['snr']:>8.2f}")
# Aggregate decomposition
total_seed = mean(r["seed_var"] for r in task_stats)
total_cap = mean(r["cap_var"] for r in task_stats)
# Aggregate over tasks to estimate how much of benchmark variance is real
# capability signal versus run-to-run noise.
total_seed = mean(row["seed_var"] for row in task_stats)
total_cap = mean(row["cap_var"] for row in task_stats)
total = total_seed + total_cap
cap_frac = total_cap / (total + 1e-9)
capability_fraction = total_cap / total if total > 1e-12 else 0.0
print("\n=== AGGREGATE VARIANCE DECOMPOSITION ===")
print(f" Mean seed variance (noise): {total_seed:.5f}")
print(f" Mean capability variance (signal): {total_cap:.5f}")
print(f" Capability fraction: {cap_frac:.1%}")
print(f" (= what % of run_score variance comes from real model differences)")
# Coarse SNR buckets help downstream reporting and task weighting.
high_snr = [row for row in task_stats if row["snr"] >= 5]
mid_snr = [row for row in task_stats if 1 <= row["snr"] < 5]
low_snr = [row for row in task_stats if row["snr"] < 1]
# Classify tasks by SNR tiers
high_snr = [r for r in task_stats if r["snr"] >= 5]
mid_snr = [r for r in task_stats if 1 <= r["snr"] < 5]
low_snr = [r for r in task_stats if r["snr"] < 1]
print(f"\n=== SNR TIERS ===")
print(f" High SNR (≥5): {len(high_snr)} tasks — differentiate models reliably")
print(f" Mid SNR (15): {len(mid_snr)} tasks — moderate signal")
print(f" Low SNR (<1): {len(low_snr)} tasks — seed noise ≥ capability signal")
print(f" (these tasks give random-ish results; weight down)")
# Write output
out_path = ROOT / "reports" / "variance_decomposition.json"
out_path.write_text(json.dumps({
out = {
"per_task": task_stats,
"aggregate": {
"mean_seed_var": total_seed,
"mean_cap_var": total_cap,
"capability_fraction": cap_frac,
"mean_seed_var": float(total_seed),
"mean_cap_var": float(total_cap),
"capability_fraction": float(capability_fraction),
"high_snr_tasks": len(high_snr),
"mid_snr_tasks": len(mid_snr),
"low_snr_tasks": len(low_snr),
},
}, indent=2))
print(f"\nWrote: {out_path}")
}
args.reports_dir.mkdir(parents=True, exist_ok=True)
out_path = args.reports_dir / "variance_decomposition.json"
out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
print(f"Wrote: {out_path}")
if __name__ == "__main__":

356
tests/test_dynamics.py Normal file
View File

@ -0,0 +1,356 @@
"""Tests for clawbench.dynamics."""
from __future__ import annotations
import math
import numpy as np
import pytest
from clawbench.dynamics import (
TOOL_FAMILIES,
Dynamics,
Regime,
Sensitivity,
SurvivalPoint,
StratumStats,
StratifiedAssessment,
_classify_tool,
_cosine_dist,
_entropy,
_js_divergence,
_levenshtein,
build_strata,
compute_dynamics,
compute_sensitivity,
find_event_step,
kaplan_meier,
stratify_by_regime,
stratify_by_tier,
)
from clawbench.schemas import (
TokenUsage,
ToolCall,
Transcript,
TranscriptMessage,
TaskRunResult,
)
# ── helpers ──────────────────────────────────────────────────────────
def _msg(role, text="", family=None, success=True, error="", ts=0, tok=100):
tcs = []
if family:
tcs.append(ToolCall(
name=f"tool_{family}", family=family,
success=success, error=error, mutating=family == "edit",
))
return TranscriptMessage(
role=role, text=text, tool_calls=tcs, timestamp_ms=ts,
usage=TokenUsage(input_tokens=tok, output_tokens=tok // 2,
total_tokens=tok + tok // 2),
)
def _simple_transcript(families, errors=None):
if errors is None:
errors = [False] * len(families)
msgs = [_msg("user", "task")]
for i, (fam, err) in enumerate(zip(families, errors)):
msgs.append(_msg("assistant", f"step {i}", family=fam,
success=not err, error="err" if err else "",
ts=(i + 1) * 1000, tok=100 + i * 10))
return Transcript(messages=msgs)
def _run(transcript, score=0.5, task_id="t1"):
return TaskRunResult(
task_id=task_id, run_index=0, transcript=transcript,
run_score=score, duration_ms=10000,
token_usage=transcript.total_usage,
)
# ── _cosine_dist ─────────────────────────────────────────────────────
def test_cosine_dist_identical():
a = np.array([1.0, 0.0, 0.5])
assert _cosine_dist(a, a) == pytest.approx(0.0, abs=1e-9)
def test_cosine_dist_orthogonal():
assert _cosine_dist(np.array([1, 0, 0.0]), np.array([0, 1, 0.0])) == pytest.approx(1.0)
def test_cosine_dist_zero_vector():
assert _cosine_dist(np.zeros(3), np.array([1, 2, 3.0])) == 1.0
# ── _entropy ─────────────────────────────────────────────────────────
def test_entropy_uniform():
assert _entropy({"a": 10, "b": 10}) == pytest.approx(1.0)
def test_entropy_single():
assert _entropy({"a": 100}) == pytest.approx(0.0)
def test_entropy_empty():
assert _entropy({}) == 0.0
# ── _js_divergence ───────────────────────────────────────────────────
def test_jsd_identical():
d = {"a": 5, "b": 5}
assert _js_divergence(d, d) == pytest.approx(0.0, abs=1e-9)
def test_jsd_disjoint():
assert _js_divergence({"a": 10}, {"b": 10}) > 0.5
# ── _levenshtein ────────────────────────────────────────────────────
def test_levenshtein_equal():
assert _levenshtein([1, 2, 3], [1, 2, 3]) == 0
def test_levenshtein_empty():
assert _levenshtein([], [1, 2]) == 2
def test_levenshtein_different():
assert _levenshtein(["a", "b"], ["c", "d"]) == 2
# ── _classify_tool ──────────────────────────────────────────────────
@pytest.mark.parametrize("name,expected", [
("bash_execute", "execute"),
("file_read", "read"),
("tool_edit", "edit"),
("web_browser", "browser"),
("grep_search", "search"),
("write_file", "edit"),
("run_tests", "execute"),
])
def test_classify_tool(name, expected):
assert _classify_tool(name) == expected
# ── compute_dynamics ─────────────────────────────────────────────────
def test_dynamics_basic():
t = _simple_transcript(["read", "edit", "execute", "read", "edit"])
d = compute_dynamics(t)
assert d.n_steps == 5
assert len(d.drift) == 5
assert len(d.step_size) == 5
assert len(d.entropy_series) == 5
assert len(d.tool_sequence) == 5
assert d.tool_entropy > 0
def test_dynamics_empty():
t = Transcript(messages=[_msg("user", "hi")])
d = compute_dynamics(t)
assert d.n_steps == 0
assert d.regime == Regime.unknown
def test_dynamics_trapped():
t = _simple_transcript(["execute"] * 15, errors=[True] * 15)
d = compute_dynamics(t)
assert d.regime == Regime.trapped
assert d.error_rate > 0.5
def test_dynamics_convergent():
cycle = ["read", "search", "edit", "read", "execute"] * 6
t = _simple_transcript(cycle[:30])
d = compute_dynamics(t)
assert d.regime in (Regime.convergent, Regime.limit_cycle, Regime.diffusive, Regime.unknown)
assert d.error_rate == 0.0
def test_dynamics_markov_keys():
t = _simple_transcript(["read", "edit", "read"])
d = compute_dynamics(t)
assert "read" in d.markov
assert "edit" in d.markov["read"]
def test_dynamics_constraint_index_range():
t = _simple_transcript(["read", "edit", "search", "execute", "browser", "memory"] * 3)
d = compute_dynamics(t)
assert 0 <= d.constraint_index <= 1
def test_dynamics_memory_depth():
t = _simple_transcript(["read", "edit", "read", "edit", "read", "edit"] * 3)
d = compute_dynamics(t)
assert d.memory_depth >= 0
def test_dynamics_normalizes_unknown_tool_family():
transcript = Transcript(
messages=[
_msg("user", "task"),
TranscriptMessage(
role="assistant",
text="searching",
tool_calls=[
ToolCall(
name="grep_search",
family="unknown",
success=True,
error="",
mutating=False,
)
],
timestamp_ms=1000,
usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
),
_msg("assistant", "next", family="read", ts=2000),
_msg("assistant", "done", family="edit", ts=3000),
]
)
dynamics = compute_dynamics(transcript)
assert dynamics.tool_sequence[0] == "search"
assert "search" in dynamics.markov
# ── compute_sensitivity ──────────────────────────────────────────────
def test_sensitivity_identical_runs():
t = _simple_transcript(["read", "edit", "execute"])
ra = _run(t, score=0.8)
rb = _run(t, score=0.8)
s = compute_sensitivity(ra, rb)
assert s.score_delta == pytest.approx(0.0)
assert s.tool_edit_distance == 0
def test_sensitivity_different_runs():
ta = _simple_transcript(["read", "edit", "execute"])
tb = _simple_transcript(["search", "browser", "memory"])
ra = _run(ta, score=0.9)
rb = _run(tb, score=0.3)
s = compute_sensitivity(ra, rb)
assert s.score_delta == pytest.approx(0.6)
assert s.tool_edit_distance > 0
assert s.family_js_divergence > 0
# ── kaplan_meier ─────────────────────────────────────────────────────
def test_km_basic():
pts = kaplan_meier([1, 2, 3])
assert pts[0].time == 0.0
assert pts[0].survival == 1.0
assert pts[-1].survival == pytest.approx(0.0)
def test_km_with_censoring():
pts = kaplan_meier([1, 5, 3], censored=[False, True, False])
assert len(pts) == 3
assert pts[-1].survival > 0
def test_km_empty():
assert kaplan_meier([]) == []
# ── find_event_step ──────────────────────────────────────────────────
def test_find_first_correct_write():
t = _simple_transcript(["read", "search", "edit", "execute"])
assert find_event_step(t, "first_correct_write") == 2.0
def test_find_first_error_recovery():
t = _simple_transcript(
["read", "execute", "read"],
errors=[False, True, False],
)
assert find_event_step(t, "first_error_recovery") == 2.0
def test_find_task_completion():
t = _simple_transcript(["read", "edit"])
assert find_event_step(t, "task_completion") == 1.0
def test_find_event_none():
t = _simple_transcript(["read", "read"])
assert find_event_step(t, "first_correct_write") is None
# ── build_strata + reweight ──────────────────────────────────────────
def test_build_strata_by_tier():
runs, dyns, scores = [], [], []
for tid, sc in [("t1-a", 0.8), ("t1-b", 0.6), ("t2-a", 0.4), ("t2-b", 0.3)]:
t = _simple_transcript(["read", "edit", "execute"])
r = _run(t, score=sc, task_id=tid)
runs.append(r)
dyns.append(compute_dynamics(t))
scores.append(sc)
sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
assert sa.total_runs == 4
names = sa.stratum_names()
assert "tier1" in names
assert "tier2" in names
for s in sa.strata:
assert s.n_runs == 2
assert s.weight == pytest.approx(0.5)
def test_reweight_shifts_mean():
runs, dyns, scores = [], [], []
for tid, sc in [("t1-a", 0.9), ("t1-b", 0.8), ("t2-a", 0.2), ("t2-b", 0.1)]:
t = _simple_transcript(["read", "edit", "execute"])
r = _run(t, score=sc, task_id=tid)
runs.append(r)
dyns.append(compute_dynamics(t))
scores.append(sc)
sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
# Reweight towards tier1 (high scores)
high = sa.reweight({"tier1": 0.9, "tier2": 0.1})
# Reweight towards tier2 (low scores)
low = sa.reweight({"tier1": 0.1, "tier2": 0.9})
assert high["score_mean"] > low["score_mean"]
def test_reweight_unknown_stratum():
runs, dyns, scores = [], [], []
t = _simple_transcript(["read", "edit"])
r = _run(t, score=0.5, task_id="t1-x")
runs.append(r)
dyns.append(compute_dynamics(t))
scores.append(0.5)
sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
# Reweight with a stratum that doesn't exist — should fall back
result = sa.reweight({"nonexistent": 1.0})
assert "score_mean" in result

View File

@ -0,0 +1,115 @@
"""Tests for offline dynamics archive helpers."""
from __future__ import annotations
import json
from pathlib import Path
from clawbench.dynamics_archive import build_dynamics_report, load_task_runs_archive, safe_model_name, write_dynamics_report
from clawbench.schemas import TaskRunResult, TokenUsage, ToolCall, Transcript, TranscriptMessage
def _msg(role: str, text: str = "", family: str | None = None, ts: int = 0) -> TranscriptMessage:
tool_calls = []
if family is not None:
tool_calls.append(
ToolCall(
name=f"tool_{family}",
family=family,
success=True,
error="",
mutating=family == "edit",
)
)
return TranscriptMessage(
role=role,
text=text,
tool_calls=tool_calls,
timestamp_ms=ts,
usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
)
def _run(task_id: str, score: float = 0.5, run_index: int = 0) -> TaskRunResult:
transcript = Transcript(
messages=[
_msg("user", f"Solve {task_id}"),
_msg("assistant", "inspect", family="read", ts=1000),
_msg("assistant", "edit", family="edit", ts=2000),
_msg("assistant", "verify", family="execute", ts=3000),
]
)
return TaskRunResult(
task_id=task_id,
run_index=run_index,
transcript=transcript,
run_score=score,
duration_ms=3000,
token_usage=transcript.total_usage,
)
def test_load_task_runs_archive_filters_model_and_tier(tmp_path: Path):
model_dir = tmp_path / safe_model_name("ollama/gpt-oss:20b")
other_dir = tmp_path / safe_model_name("openai/gpt-5.4")
for root, task_id in ((model_dir, "t1-demo-task"), (other_dir, "t2-other-task")):
task_dir = root / task_id
task_dir.mkdir(parents=True)
run = _run(task_id)
(task_dir / "run0.json").write_text(run.model_dump_json(indent=2), encoding="utf-8")
loaded = load_task_runs_archive(
archive_dir=tmp_path,
model="ollama/gpt-oss:20b",
tier="tier1",
)
assert list(loaded) == ["t1-demo-task"]
assert loaded["t1-demo-task"][0].task_id == "t1-demo-task"
def test_write_dynamics_report_creates_report_without_plots(tmp_path: Path):
task_runs = {
"t1-demo-task": [_run("t1-demo-task", score=0.8)],
"t2-demo-task": [_run("t2-demo-task", score=0.4)],
}
report_path, plots = write_dynamics_report(task_runs, tmp_path, generate_plots=False)
assert report_path.exists()
assert report_path.name == "dynamics.json"
assert plots == []
report = json.loads(report_path.read_text(encoding="utf-8"))
assert "sensitivity" in report
assert report["sensitivity"]["same_task"]["n_pairs"] == 0
def test_build_dynamics_report_includes_pairwise_sensitivity():
task_runs = {
"t1-demo-task": [
_run("t1-demo-task", score=0.8, run_index=0),
TaskRunResult(
task_id="t1-demo-task",
run_index=1,
transcript=Transcript(
messages=[
_msg("user", "Solve t1-demo-task"),
_msg("assistant", "inspect", family="search", ts=1000),
_msg("assistant", "edit", family="edit", ts=2000),
_msg("assistant", "verify", family="execute", ts=3000),
]
),
run_score=0.5,
duration_ms=3000,
token_usage=TokenUsage(input_tokens=30, output_tokens=15, total_tokens=45),
),
]
}
report, _plotter, _plot_data = build_dynamics_report(task_runs, include_pca=False)
same_task = report["sensitivity"]["same_task"]
assert same_task["n_pairs"] == 1
assert "t1-demo-task" in same_task["per_task"]
assert same_task["per_task"]["t1-demo-task"]["mean_score_delta"] > 0

View File

@ -0,0 +1,76 @@
from pathlib import Path
from click.testing import CliRunner
from clawbench.cli import cli
from clawbench.dynamics_archive import safe_model_name
from clawbench.schemas import TaskRunResult, TokenUsage, ToolCall, Transcript, TranscriptMessage
def _msg(role: str, text: str = "", family: str | None = None, ts: int = 0) -> TranscriptMessage:
tool_calls = []
if family is not None:
tool_calls.append(
ToolCall(
name=f"tool_{family}",
family=family,
success=True,
error="",
mutating=family == "edit",
)
)
return TranscriptMessage(
role=role,
text=text,
tool_calls=tool_calls,
timestamp_ms=ts,
usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
)
def _run(task_id: str, run_index: int = 0) -> TaskRunResult:
transcript = Transcript(
messages=[
_msg("user", f"Solve {task_id}"),
_msg("assistant", "inspect", family="read", ts=1000),
_msg("assistant", "edit", family="edit", ts=2000),
_msg("assistant", "verify", family="execute", ts=3000),
]
)
return TaskRunResult(
task_id=task_id,
run_index=run_index,
transcript=transcript,
run_score=0.8,
duration_ms=3000,
token_usage=transcript.total_usage,
)
def test_dynamics_report_cli_supports_no_plots(tmp_path: Path):
model_dir = tmp_path / safe_model_name("ollama/gpt-oss:20b") / "t1-demo-task"
model_dir.mkdir(parents=True)
run = _run("t1-demo-task")
(model_dir / "run0.json").write_text(run.model_dump_json(indent=2), encoding="utf-8")
runner = CliRunner()
output_dir = tmp_path / "out"
result = runner.invoke(
cli,
[
"dynamics-report",
"--archive-dir",
str(tmp_path),
"--model",
"ollama/gpt-oss:20b",
"--output-dir",
str(output_dir),
"--no-plots",
],
)
assert result.exit_code == 0, result.output
assert "Loaded 1 cached runs across 1 tasks" in result.output
assert "Saved 0 plots" in result.output
assert (output_dir / "dynamics.json").exists()
assert list(output_dir.glob("*.png")) == []

View File

@ -0,0 +1,44 @@
from clawbench.submission_models import (
CUSTOM_PRESET_LABEL,
PRESET_AUDIENCE_BUDGET,
PRESET_AUDIENCE_CLAW,
infer_provider,
preset_labels_for_audience,
resolve_model_selection,
)
def test_budget_audience_keeps_budget_friendly_presets():
labels = preset_labels_for_audience(PRESET_AUDIENCE_BUDGET)
assert "GPT-OSS 20B (Ollama)" in labels
assert "Qwen 3.5 27B (Ollama)" in labels
assert "Claude Opus 4.6" not in labels
def test_claw_audience_keeps_full_catalog():
labels = preset_labels_for_audience(PRESET_AUDIENCE_CLAW)
assert "GPT-OSS 20B (Ollama)" in labels
assert "Claude Opus 4.6" in labels
def test_resolve_model_selection_prefers_preset_provider():
model_id, provider = resolve_model_selection("", "GPT-OSS 20B (Ollama)")
assert model_id == "ollama/gpt-oss:20b"
assert provider == "ollama"
def test_resolve_model_selection_infers_custom_provider():
model_id, provider = resolve_model_selection(
"huggingface/Qwen/Qwen3-32B",
CUSTOM_PRESET_LABEL,
)
assert model_id == "huggingface/Qwen/Qwen3-32B"
assert provider == "huggingface"
def test_infer_provider_requires_provider_prefix():
assert infer_provider("qwen3.5:27b") == ""