Add archive dynamics pipeline and audience-based model presets

2026-04-21 20:24:41 -07:00 · 2026-04-21 20:24:41 -07:00 · c209612d46
commit c209612d46
parent 5b50814dfc
21 changed files with 3446 additions and 928 deletions
--- a/README.md
+++ b/README.md
@ -104,20 +104,56 @@ Core v1 drops the noisy tasks and reports variance decomposition alongside ranki

 Inspired by *"When LLMs Are Dreaming, Where Do They Go?"* — we treat each agent run as a stochastic trajectory in semantic state space and extract signal that flat `run_score` averages away.

-| Diagnostic | Formula / Method | Reveals |
-|---|---|---|
-| **Constraint Index C(q)** | `-z(PR) - z(entropy) + z(BOPS)` over response embeddings | Which tasks converge to one answer vs diverge openly |
-| **Regime classification** | Trajectory drift / recurrence / support-volume thresholds | Per-run dynamical signature (trapped / limit-cycle / diffusive) |
-| **Survival analysis** | `S(t) = P(T_F > t)` where T_F = first empty assistant turn | Per-turn failure rates; long-horizon capability |
-| **SNR-weighted ranking** | `w(task) = SNR × |C(q)|`, winsorized at p95 | Headline metric that weights tasks by their signal density |
-| **Variance decomposition** | `Var(score) = Var_seeds + Var_models` per task | Separate capability signal from coin-flip noise |
+Current code-path formulas:
+
+```text
+Per assistant step t:
+x_t = [tool_family_proportions(6), error_flag, normalized_tokens, normalized_text_len, progress]
+drift_t = cosine_distance(x_0, x_t)
+step_t = cosine_distance(x_{t-1}, x_t)
+
+Task-level Constraint Index:
+PR(q) = tr(Σ_q)^2 / tr(Σ_q^2)
+H(q) = -Σ_i p_i log2 p_i,   p_i = λ_i / Σ_j λ_j,   λ = eigvals(Σ_q)
+BOPS(q) = mean_m mean_{i<j} cos(v_{q,m,i}, v_{q,m,j})
+C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q))
+
+Per-run constraint index used inside the regime classifier:
+PR_run = 1 / Σ_i p_i^2
+constraint_index_run = 1 - (PR_run - 1) / (d - 1)
+
+Variance decomposition:
+seed_var(q) = mean_m Var(run_score_{q,m,*})
+cap_var(q) = Var_m Mean(run_score_{q,m,*})
+SNR(q) = cap_var(q) / (seed_var(q) + 1e-9)
+capability_fraction = mean_q cap_var(q) / (mean_q cap_var(q) + mean_q seed_var(q))
+
+Survival:
+T_F = first assistant turn with empty text and no tool calls,
+      else final assistant turn if run_score < 0.7 and delivery_outcome in {fail, partial}
+S(t) = P(T_F > t)
+h(t) = P(T_F = t | T_F >= t)
+```
+
+Implemented regime classifier in `clawbench/dynamics.py`:
+
+```text
+trapped      if H_tools < 0.5 or (error_rate > 0.6 and std(drift) < 0.05)
+convergent   if std(drift_last_quartile) < 0.1 and mean(step_last_quartile) < 0.15 and error_rate < 0.2
+diffusive    if H_tools > 1.5 and error_rate < 0.15 and constraint_index_run < 0.8
+chaotic      if H_tools > 2.0 and var(step[1:]) > 0.02
+limit_cycle  if max autocorr(centered step[1:], lags 2..5) > 0.3
+unknown      otherwise, or <3 assistant turns
+```
+
+The task-level `C(q)` uses a normalized bag-of-words response vector built from the full assistant trajectory text plus tool-call names and compacted inputs, not just the last assistant turn.

 From the v4-19 sweep data:
 - **Gemini 3.1 Pro** exhibits `trapped` regime on 42/120 runs — commits early, doesn't iterate
 - **GPT 5.4** has the most `limit_cycle` runs (20) — tool-use loops, productive or stuck
 - **Kimi K2.5** dies at median turn 3 (worst survival); **GPT 5.4** survives to turn 8 at 60% rate (best)

-All scripts under `scripts/` — pure numpy + scipy, no torch / sentence-transformers required, runs on any archive dir.
+All scripts under `scripts/` run on cached per-run JSONs with plain numpy-based tooling; no torch or sentence-transformers required.

 ### 4. We ablate configurations, not just models

@ -264,9 +300,12 @@ The `1/y_i^2` term means the worst score dominates. A configuration scoring 0.85
 Flat-mean compresses frontier model gaps. An alternative that weights tasks by their signal density:

 ```
-weight(task) = max(0, SNR(task)) × |C(q)(task)|            # unbounded
-weight_winsorized(task) = min(weight(task), p95)            # prevent single-task dominance
-score(model) = Σ weight × mean_run_score / Σ weight
+w_q = max(0, SNR(q)) × |C(q)|
+w_q^wins = min(w_q, p95({w_q}))
+
+flat_score(model) = mean_q mean_run_score(model, q) over covered tasks
+weighted_score(model) = Σ_q w_q mean_run_score(model, q) / Σ_q w_q
+winsorized_score(model) = Σ_q w_q^wins mean_run_score(model, q) / Σ_q w_q^wins
 ```

 Under SNR × |C(q)| winsorized on the same 1,080-run archive, **Opus 4.7 ranks #1** (instead of Opus 4.6 under flat mean) and **GPT 5.4 drops from #3 to #7** — its task-specific cliffs (0.16 on `t3-feature-export`) fall on the highest-signal tasks. This exposes what the flat mean averages away.
@ -349,27 +388,48 @@ clawbench run \
  -o results/opus46_core_v1.json
 ```

-### Analyze an archive with the diagnostic suite
+### Analyze a real archive

 ```bash
-# 1. Aggregate coverage + fair-comparison audit
+# Fair-comparison audit
 python3 scripts/audit_runs.py
-
-# 2. Rejudge any judge-infrastructure failures via direct Anthropic API
-python3 scripts/rejudge_all.py \
-  --drift-dir data/drift_2026-04-19-full \
-  --archive-dir data/run_cache_archive/v2026-4-19-full
-
-# 3. Generate the fair comparison report
 python3 scripts/generate_fair_report.py --tag v2026-4-19-full

-# 4. Dynamical-systems diagnostics (C(q), regimes, survival, SNR-weighted)
-.venv/bin/python3 scripts/compute_constraint_index.py
-.venv/bin/python3 scripts/classify_regimes.py
-.venv/bin/python3 scripts/variance_decomp.py
-.venv/bin/python3 scripts/survival_analysis.py
-.venv/bin/python3 scripts/snr_weighted_ranking.py
-.venv/bin/python3 scripts/generate_dynamical_report.py
+# Posterior dynamics + ranking from cached per-run JSONs
+python3 scripts/run_posterior_dynamics_pipeline.py \
+  --archive-dir .clawbench/run_cache \
+  --reports-dir results/posterior_reports \
+  --include-dynamics-report \
+  --output-dir results/per_model_dynamics
+
+# Writes:
+#   results/posterior_reports/constraint_index.json
+#   results/posterior_reports/regimes.json
+#   results/posterior_reports/variance_decomposition.json
+#   results/posterior_reports/survival_analysis.json
+#   results/posterior_reports/snr_weighted_ranking.json
+#   results/posterior_reports/EVAL_REPORT_DYNAMICAL.md
+#   results/per_model_dynamics/<safe_model_name>/dynamics.json
+#   results/per_model_dynamics/<safe_model_name>/*.png
+```
+
+If you only want one model's offline dynamics bundle:
+
+```bash
+clawbench dynamics-report \
+  --archive-dir .clawbench/run_cache \
+  --model ollama/gpt-oss:20b \
+  --output-dir results/gptoss_dynamics
+
+# Quick CI path: skip plot rendering
+clawbench dynamics-report \
+  --archive-dir .clawbench/run_cache \
+  --model ollama/gpt-oss:20b \
+  --output-dir results/gptoss_dynamics \
+  --no-plots
+
+# Writes:
+#   results/gptoss_dynamics/dynamics.json
 ```

 ### Running locally with small models (Ollama)
@ -379,7 +439,24 @@ A single consumer GPU running an open-weight model is enough to develop plugin p
 ```bash
 ollama pull gpt-oss:20b
 export OPENCLAW_GATEWAY_TOKEN=<your-gateway-token>
-clawbench run --model ollama/gpt-oss:20b --task t1-fs-quick-note --runs 1
+export CLAWBENCH_RUN_CACHE_DIR=$PWD/.clawbench/run_cache
+
+# Real benchmark run + immediate per-run dynamics bundle
+clawbench run \
+  --model ollama/gpt-oss:20b \
+  --task t1-fs-quick-note \
+  --runs 1 \
+  --dynamics \
+  -o results/ollama_smoke.json
+
+# Optional second local model
+ollama pull qwen3.5:27b
+
+# Offline posterior analysis reads CLAWBENCH_RUN_CACHE_DIR
+python3 scripts/run_posterior_dynamics_pipeline.py \
+  --archive-dir .clawbench/run_cache \
+  --reports-dir results/posterior_reports
+
 clawbench diagnose profiles/local_ollama_gpt_oss.yaml
 ```

@ -415,6 +492,9 @@ clawbench/
 │   ├── profile.py                  # v0.5 plugin fingerprinting
 │   ├── diagnostic.py               # Configuration Diagnostic report
 │   ├── factor_analysis.py          # fANOVA factor importance
+│   ├── dynamics.py                 # Trajectory metrics + sensitivity analysis
+│   ├── dynamics_archive.py         # Cached-run loading + offline report assembly
+│   ├── dynamics_plots.py           # Offline dynamics visualizations
 │   └── cli.py                      # CLI entry points
 │
 ├── tasks-public/                   # Core v1 PUBLIC release (19 tasks)
@ -431,6 +511,7 @@ clawbench/
 │   ├── audit_per_run.py            # Per-run cross-model audit
 │   ├── rejudge_all.py              # Direct-API rejudge for broken gateway judges
 │   ├── generate_fair_report.py     # Fair N-model comparison report
+│   ├── run_posterior_dynamics_pipeline.py # One-shot posterior analysis driver
 │   ├── compute_constraint_index.py # C(q) per task
 │   ├── classify_regimes.py         # Per-run dynamical regime classifier
 │   ├── variance_decomp.py          # Seed-noise vs capability-signal decomposition
@ -439,7 +520,7 @@ clawbench/
 │   └── generate_dynamical_report.py # Combined dynamical-systems report
 │
 ├── profiles/                       # v0.5 plugin profile YAMLs
-├── tests/                          # 107 tests
+├── tests/                          # Test suite
 ├── Dockerfile                      # Layered on ghcr.io/openclaw/openclaw:latest
 ├── CLAWBENCH_V0_4_SPEC.md          # Full specification
 └── PARTNER_TRACE_SPEC.md           # Trace interchange format
@ -469,7 +550,7 @@ clawbench/
 ## Testing

 ```bash
-python -m pytest -q     # 107 tests
+python -m pytest -q
 ```

 Key test invariants:
--- a/SPACE_README.md
+++ b/SPACE_README.md
@ -136,6 +136,15 @@ submission

 Important rule: browser tasks stay serialized on one dedicated lane to avoid Chromium and port-range collisions.

+## Submission presets
+
+The Submit tab now exposes two preset audiences so the Space can serve both general Claw users and lower-budget exploratory runs:
+
+- `Claw Users` keeps the full preset catalog, including provider-backed frontier models.
+- `Budget Researchers` narrows the list to local or lower-cost presets such as `ollama/gpt-oss:20b`, `ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and `huggingface/google/gemma-4-26B-A4B-it`.
+
+You can still enter any custom model ID directly; the preset audience only filters the shortcut catalog and the bulk-submit action.
+
 ## Task inventory

 | Task | Tier | Family | Main verification |
--- a/app.py
+++ b/app.py
@ -26,6 +26,15 @@ from clawbench.hub import (
    load_submission_rows_from_parquet,
    resolve_dataset_repo,
 )
+from clawbench.submission_models import (
+    CUSTOM_PRESET_LABEL,
+    PRESET_AUDIENCE_ALL,
+    PRESET_AUDIENCE_CHOICES,
+    PRESET_MODEL_MAP,
+    preset_labels_for_audience,
+    preset_models_for_audience,
+    resolve_model_selection,
+)

 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
 logger = logging.getLogger("clawbench.app")
@ -51,31 +60,6 @@ def _env_int(name: str, default: int, *, minimum: int, maximum: int) -> int:
 DEFAULT_RUNS_PER_TASK = _env_int("CLAWBENCH_DEFAULT_RUNS_PER_TASK", 3, minimum=1, maximum=10)
 DEFAULT_PARALLEL_LANES = _env_int("CLAWBENCH_DEFAULT_PARALLEL_LANES", 1, minimum=1, maximum=4)

-# ---------------------------------------------------------------------------
-# Preset models for quick submission
-# ---------------------------------------------------------------------------
-
-PRESET_MODELS = {
-    # All models verified working on HF Inference API (free with HF_TOKEN)
-    # Tested 2026-04-07 via router.huggingface.co/v1/chat/completions
-    #
-    # --- Chinese open-source ---
-    "GLM 5.1 (754B MoE)": "huggingface/zai-org/GLM-5.1",
-    "GLM 5 (400B MoE)": "huggingface/zai-org/GLM-5",
-    "Qwen3 32B": "huggingface/Qwen/Qwen3-32B",
-    "DeepSeek R1": "huggingface/deepseek-ai/DeepSeek-R1",
-    "Kimi K2 Instruct": "huggingface/moonshotai/Kimi-K2-Instruct",
-    "MiniMax M2.5": "huggingface/MiniMaxAI/MiniMax-M2.5",
-    # --- Google open-source ---
-    "Gemma 4 26B MoE": "huggingface/google/gemma-4-26B-A4B-it",
-    # --- Meta open-source ---
-    "Llama 3.3 70B": "huggingface/meta-llama/Llama-3.3-70B-Instruct",
-    "Llama 3.1 70B": "huggingface/meta-llama/Llama-3.1-70B-Instruct",
-    # --- Proprietary models (require runtime auth configured for the model provider) ---
-    "Claude Sonnet 4.6": "anthropic/claude-sonnet-4-6",
-    "Claude Opus 4.6": "anthropic/claude-opus-4-6",
-}
-
 # ---------------------------------------------------------------------------
 # Background worker (starts in a thread)
 # ---------------------------------------------------------------------------
@ -271,15 +255,14 @@ def submit_model(
    prompt_variant: str,
    submitter: str,
 ) -> str:
-    # Use preset if selected, otherwise use custom model ID
-    model_id = PRESET_MODELS.get(preset, "") or model.strip()
+    model_id, provider_id = resolve_model_selection(model, preset, provider)
    if not model_id:
        return "Please enter a model ID or select a preset."

    selected_tier = tier if tier != "all" else None
    request = SubmissionRequest(
        model=model_id,
-        provider=provider.strip(),
+        provider=provider_id,
        judge_model=judge_model.strip(),
        runs_per_task=int(runs),
        max_parallel_lanes=int(max_parallel_lanes),
@ -292,20 +275,38 @@ def submit_model(
    return f"Submitted [{model_id}]! Job ID: {job.job_id}. Check the Queue tab."


-def submit_all_presets(runs: int, max_parallel_lanes: int, submitter: str) -> str:
-    """Submit all preset models at once."""
+def submit_all_presets(
+    preset_audience: str,
+    runs: int,
+    max_parallel_lanes: int,
+    submitter: str,
+) -> str:
+    """Submit all preset models from the selected audience track."""
+    presets = preset_models_for_audience(preset_audience)
+    if not presets:
+        return f"No presets configured for {preset_audience}."
+
    submitted = []
-    for name, model_id in PRESET_MODELS.items():
+    for preset in presets:
        request = SubmissionRequest(
-            model=model_id,
-            provider="",
+            model=preset.model_id,
+            provider=preset.provider,
            runs_per_task=int(runs),
            max_parallel_lanes=int(max_parallel_lanes),
            submitter=submitter.strip(),
        )
        job = asyncio.run(queue.submit(request))
-        submitted.append(f"{name} ({job.job_id})")
-    return f"Submitted {len(submitted)} models:\n" + "\n".join(f"  - {s}" for s in submitted)
+        submitted.append(f"{preset.label} ({job.job_id})")
+    return f"Submitted {len(submitted)} models from {preset_audience}:\n" + "\n".join(
+        f"  - {item}" for item in submitted
+    )
+
+
+def update_preset_choices(preset_audience: str):
+    return gr.update(
+        choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(preset_audience),
+        value=CUSTOM_PRESET_LABEL,
+    )


 # ---------------------------------------------------------------------------
@ -952,7 +953,7 @@ STAT_JUDGE = (
 )
 STAT_PRESETS = (
    '<div class="stat-pill"><div class="label">Presets</div><div class="value teal">'
-    + str(len(PRESET_MODELS))
+    + str(len(PRESET_MODEL_MAP))
    + "</div></div>"
 )

@ -986,12 +987,28 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
            "run via HuggingFace Inference API. You can also use locally hosted models "
            "(for example Ollama) when your OpenClaw runtime has them configured."
        )
+        gr.Markdown(
+            "Use `Preset Audience` to switch between the full Claw catalog and a smaller budget track. "
+            "The budget track keeps local and lower-cost options upfront, including `ollama/gpt-oss:20b`, "
+            "`ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and "
+            "`huggingface/google/gemma-4-26B-A4B-it`."
+        )

+        preset_audience_input = gr.Dropdown(
+            choices=list(PRESET_AUDIENCE_CHOICES),
+            value=PRESET_AUDIENCE_ALL,
+            label="Preset Audience",
+        )
        preset_input = gr.Dropdown(
-            choices=["(custom)"] + list(PRESET_MODELS.keys()),
-            value="(custom)",
+            choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(PRESET_AUDIENCE_ALL),
+            value=CUSTOM_PRESET_LABEL,
            label="Preset models",
        )
+        preset_audience_input.change(
+            fn=update_preset_choices,
+            inputs=preset_audience_input,
+            outputs=preset_input,
+        )
        with gr.Row():
            model_input = gr.Textbox(
                label="Custom Model ID (if not using preset)",
@ -1074,26 +1091,35 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
        )
        submit_all_btn.click(
            fn=submit_all_presets,
-            inputs=[runs_input, max_parallel_lanes_input, submitter_input],
+            inputs=[preset_audience_input, runs_input, max_parallel_lanes_input, submitter_input],
            outputs=submit_output,
        )

        gr.Markdown("""
-**All presets verified working on HF Inference API (free):**
+**Preset audiences:**

-| Model | Provider | Size | Runtime |
-|-------|----------|------|---------|
-| GLM 5.1 | Z.ai | 754B MoE | HF free |
-| GLM 5 | Z.ai | 400B MoE | HF free |
-| Qwen3 32B | Alibaba | 32B | HF free |
-| DeepSeek R1 | DeepSeek | 671B MoE | HF free |
-| Kimi K2 Instruct | Moonshot AI | MoE | HF free |
-| MiniMax M2.5 | MiniMax | MoE | HF free |
-| Gemma 4 26B MoE | Google | 26B MoE | HF free |
-| Llama 3.3 70B | Meta | 70B | HF free |
-| Llama 3.1 70B | Meta | 70B | HF free |
-| Claude Sonnet 4.6 | Anthropic | - | configured auth |
-| Claude Opus 4.6 | Anthropic | - | configured auth |
+| Audience | What it optimizes for | Presets |
+|---|---|---|
+| Claw Users | Full preset catalog, including provider-backed frontier options | Anthropic, HF open-weight, and Ollama presets |
+| Budget Researchers | Smaller local/free-friendly track | GPT-OSS 20B, Qwen 3.5 27B, Qwen3 32B, Gemma 4 26B |
+
+**Current preset catalog:**
+
+| Model | Provider | Audience |
+|---|---|---|
+| GPT-OSS 20B (Ollama) | Ollama | Claw Users, Budget Researchers |
+| Qwen 3.5 27B (Ollama) | Ollama | Claw Users, Budget Researchers |
+| Qwen3 32B | HuggingFace | Claw Users, Budget Researchers |
+| Gemma 4 26B MoE | HuggingFace | Claw Users, Budget Researchers |
+| GLM 5.1 | HuggingFace | Claw Users |
+| GLM 5 | HuggingFace | Claw Users |
+| DeepSeek R1 | HuggingFace | Claw Users |
+| Kimi K2 Instruct | HuggingFace | Claw Users |
+| MiniMax M2.5 | HuggingFace | Claw Users |
+| Llama 3.3 70B | HuggingFace | Claw Users |
+| Llama 3.1 70B | HuggingFace | Claw Users |
+| Claude Sonnet 4.6 | Anthropic | Claw Users |
+| Claude Opus 4.6 | Anthropic | Claw Users |
 """)

    with gr.Tab("Queue"):
--- a/clawbench/cli.py
+++ b/clawbench/cli.py
@ -116,6 +116,11 @@ def cli(verbose: bool) -> None:
    show_default=True,
    help="Where to write ecosystem insight files after a --profile run.",
 )
+@click.option(
+    "--dynamics",
+    is_flag=True,
+    help="Run quick post-benchmark dynamics analysis. Prefer dynamics-report for offline cache/archive analysis.",
+)
 def run(
    model: str,
    gateway_token: str,
@ -137,6 +142,7 @@ def run(
    browser_concurrency: int,
    profile: Path | None,
    insights_dir: Path,
+    dynamics: bool,
 ) -> None:
    gateway_config = GatewayConfig(token=gateway_token)
    harness = BenchmarkHarness(
@ -165,6 +171,9 @@ def run(
        json.dump(result.model_dump(), handle, indent=2)
    click.echo(f"\nResults saved to {out_path}")

+    if dynamics:
+        _run_dynamics_analysis(harness.last_task_runs, out_path)
+
    if profile is not None:
        _run_v05_diagnostic(
            profile_path=profile,
@ -179,6 +188,83 @@ def run(
        asyncio.run(upload_result(result))


+@cli.command("dynamics-report")
+@click.option(
+    "--archive-dir",
+    type=click.Path(exists=True, file_okay=False, path_type=Path),
+    required=True,
+    help="Path to a run cache/archive root or a single model cache directory.",
+)
+@click.option(
+    "--model",
+    default=None,
+    help="Model id to select when the archive root contains multiple model directories.",
+)
+@click.option("--tier", type=click.Choice(["tier1", "tier2", "tier3", "tier4", "tier5"]))
+@click.option("--task", "task_ids", multiple=True, help="Specific task IDs to include from the archive.")
+@click.option(
+    "--output-dir",
+    type=click.Path(path_type=Path),
+    default=Path("results/offline_dynamics"),
+    show_default=True,
+    help="Directory where dynamics.json and plots will be written.",
+)
+@click.option(
+    "--no-plots",
+    is_flag=True,
+    help="Write only dynamics.json and skip plot rendering.",
+)
+def dynamics_report(
+    archive_dir: Path,
+    model: str | None,
+    tier: str | None,
+    task_ids: tuple[str, ...],
+    output_dir: Path,
+    no_plots: bool,
+) -> None:
+    """Generate dynamics plots and a JSON report from cached TaskRunResult archives."""
+    from clawbench.dynamics_archive import load_task_runs_archive
+
+    try:
+        task_runs = load_task_runs_archive(
+            archive_dir=archive_dir,
+            model=model,
+            task_ids=task_ids,
+            tier=tier,
+        )
+    except ValueError as exc:
+        raise click.ClickException(str(exc)) from exc
+
+    if not task_runs:
+        raise click.ClickException(f"No cached runs found under {archive_dir}")
+
+    report_path, plots, n_runs = _write_dynamics_report(
+        task_runs,
+        output_dir,
+        generate_plots=not no_plots,
+    )
+    click.echo(f"Loaded {n_runs} cached runs across {len(task_runs)} tasks")
+    click.echo(f"Dynamics report saved to {report_path}")
+    click.echo(f"Saved {len(plots)} plots to {output_dir}/")
+
+
+def _write_dynamics_report(
+    task_runs: dict[str, list],
+    output_dir: Path,
+    *,
+    generate_plots: bool = True,
+) -> tuple[Path, list[Path], int]:
+    from clawbench.dynamics_archive import write_dynamics_report
+
+    report_path, plots = write_dynamics_report(
+        task_runs,
+        output_dir,
+        generate_plots=generate_plots,
+    )
+    n_runs = sum(len(runs) for runs in task_runs.values())
+    return report_path, plots, n_runs
+
+
 def _run_v05_diagnostic(
    *,
    profile_path: Path,
@ -693,5 +779,23 @@ def show(result_file: str) -> None:
        )


+def _run_dynamics_analysis(
+    task_runs: dict[str, list],
+    result_path: str,
+) -> None:
+    """Compute stratified dynamics from raw TaskRunResult objects."""
+    run_stem = Path(result_path).stem
+    dyn_dir = Path(result_path).parent / f"{run_stem}_dynamics"
+    try:
+        dyn_path, plots, n_runs = _write_dynamics_report(task_runs, dyn_dir)
+    except ValueError as exc:
+        click.echo(str(exc))
+        return
+
+    click.echo(f"\n[dynamics] Analysed {n_runs} cached runs")
+    click.echo(f"  Dynamics report saved to {dyn_path}")
+    click.echo(f"  Saved {len(plots)} plots to {dyn_dir}/")
+
+
 def main() -> None:
    cli()
--- a/clawbench/client.py
+++ b/clawbench/client.py
@ -8,7 +8,9 @@ import logging
 import math
 import os
 import re
+import shutil
 import subprocess
+import sys
 import uuid
 from dataclasses import dataclass, field
 from typing import Any
@ -24,10 +26,10 @@ logger = logging.getLogger(__name__)

 PROTOCOL_VERSION = 3
 DEVICE_IDENTITY_HELPER_JS = r"""
-const crypto = require("node:crypto");
-const fs = require("node:fs");
-const os = require("node:os");
-const path = require("node:path");
+const crypto = require("crypto");
+const fs = require("fs");
+const os = require("os");
+const path = require("path");

 const ED25519_SPKI_PREFIX = Buffer.from("302a300506032b6570032100", "hex");

@ -52,7 +54,7 @@ function fingerprintPublicKey(publicKeyPem) {
 }

 function generateIdentity() {
-  const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519");
+    const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519", {});
  const publicKeyPem = publicKey.export({ type: "spki", format: "pem" }).toString();
  const privateKeyPem = privateKey.export({ type: "pkcs8", format: "pem" }).toString();
  return {
@ -445,12 +447,48 @@ class GatewayClient:
                    max_wait_seconds=2.0,
                )
            )
+
+            # Some gateway/provider paths persist assistant messages in session
+            # history without emitting complete streaming events. Backfill from
+            # sessions.get if stream capture appears incomplete.
+            history_messages = await self.get_session_messages(session_key)
+            collected_assistant = sum(
+                1 for msg in collected_messages if msg.role == "assistant"
+            )
+            history_assistant = sum(
+                1 for msg in history_messages if msg.role == "assistant"
+            )
+            if history_messages and (
+                len(history_messages) > len(collected_messages)
+                or history_assistant > collected_assistant
+            ):
+                collected_messages = history_messages
        finally:
            self._event_queues.pop(chat_queue_key, None)
            self._event_queues.pop(msg_queue_key, None)

        return _correlate_transcript(Transcript(messages=collected_messages))

+    async def get_session_messages(self, session_key: str) -> list[TranscriptMessage]:
+        try:
+            response = await self._rpc("sessions.get", {"key": session_key})
+        except Exception:
+            return []
+
+        payload = response.get("payload", {})
+        raw_messages = payload.get("messages", [])
+        if not isinstance(raw_messages, list):
+            return []
+
+        parsed: list[TranscriptMessage] = []
+        for raw in raw_messages:
+            if not isinstance(raw, dict):
+                continue
+            msg = _parse_single_message(raw)
+            if msg is not None:
+                parsed.append(msg)
+        return parsed
+
    async def _rpc(
        self,
        method: str,
@ -551,9 +589,17 @@ def _build_connect_device(
            "deviceFamily": device_family or "",
        }
    )
+
+    node_executable = _resolve_node_executable()
+    if not node_executable:
+        logger.warning(
+            "Failed to build device identity payload: no Node executable found"
+        )
+        return None
+
    try:
        completed = subprocess.run(
-            ["node", "-e", DEVICE_IDENTITY_HELPER_JS],
+            [node_executable, "-e", DEVICE_IDENTITY_HELPER_JS],
            input=helper_input,
            capture_output=True,
            text=True,
@ -577,6 +623,25 @@ def _build_connect_device(
    return payload


+def _resolve_node_executable() -> str | None:
+    """Resolve Node binary, preferring the active Python/conda environment."""
+    candidates: list[str] = []
+
+    # First try the same environment as the active Python interpreter.
+    candidates.append(os.path.join(os.path.dirname(sys.executable), "node"))
+
+    # Then try CONDA_PREFIX when available.
+    conda_prefix = os.environ.get("CONDA_PREFIX")
+    if conda_prefix:
+        candidates.append(os.path.join(conda_prefix, "bin", "node"))
+
+    for candidate in candidates:
+        if os.path.isfile(candidate) and os.access(candidate, os.X_OK):
+            return candidate
+
+    return shutil.which("node")
+
+
 def _is_transient_gateway_connect_error(exc: Exception) -> bool:
    if isinstance(exc, InvalidStatus):
        return exc.response.status_code in {502, 503, 504}
@ -615,6 +680,9 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N
            if block_type == "text":
                text_parts.append(block.get("text", ""))
                continue
+            if block_type == "output_text":
+                text_parts.append(block.get("text", ""))
+                continue
            if block_type in {"tool_use", "toolCall"}:
                arguments = block.get("input", block.get("arguments", {}))
                if isinstance(arguments, str):
@ -641,6 +709,16 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N
                if tool_result_content:
                    text_parts.append(tool_result_content)

+    # Some providers surface assistant failures in a dedicated error field
+    # with empty content blocks. Preserve that signal in transcript text.
+    error_message = message_data.get("errorMessage", "")
+    if isinstance(error_message, str) and error_message.strip():
+        text_parts.append(error_message.strip())
+
+    direct_text = message_data.get("text", "")
+    if isinstance(direct_text, str) and direct_text.strip():
+        text_parts.append(direct_text.strip())
+
    if not text_parts and not tool_calls and not tool_result_for:
        return None

--- a/clawbench/dynamics.py
+++ b/clawbench/dynamics.py
@ -0,0 +1,695 @@
+"""Dynamics analysis for ClawBench agent trajectories.
+
+Treats each agent run as a discrete dynamical system and computes step
+embeddings, trajectory metrics, sensitivity analysis, regime classification,
+Kaplan-Meier survival, non-Markov memory, and stratified assessment with
+Bayesian importance-weight correction for distribution shift.
+"""
+
+from __future__ import annotations
+
+import math
+from collections import Counter
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import TYPE_CHECKING, Callable
+
+import numpy as np
+
+if TYPE_CHECKING:
+    from clawbench.schemas import TaskRunResult, Transcript
+
+# ── Constants ──────────────────────────────────────────────────────────
+
+TOOL_FAMILIES = ("browser", "edit", "execute", "memory", "read", "search")
+_N_FAM = len(TOOL_FAMILIES)
+
+# ── Types ──────────────────────────────────────────────────────────────
+
+
+class Regime(str, Enum):
+    convergent = "convergent"
+    chaotic = "chaotic"
+    trapped = "trapped"
+    diffusive = "diffusive"
+    limit_cycle = "limit_cycle"
+    unknown = "unknown"
+
+
+@dataclass
+class Dynamics:
+    """Computed dynamics for a single trajectory."""
+
+    n_steps: int
+    embeddings: np.ndarray          # (n_steps, 10)
+    drift: np.ndarray               # cosine distance from step 0
+    step_size: np.ndarray           # cosine distance from step t-1
+    entropy_series: list[float]     # running tool-family entropy
+    error_rate_series: list[float]  # running error fraction
+    tokens_series: list[int]
+    latency_series: list[float]
+    tool_sequence: list[str]        # primary family per step
+    markov: dict[str, dict[str, float]]
+    family_dist: dict[str, float]
+    regime: Regime
+    mean_drift: float
+    mean_step_size: float
+    tool_entropy: float
+    error_rate: float
+    constraint_index: float
+    pca_trajectory: np.ndarray | None = None  # (n_steps, 2)
+    bigram_transitions: dict[str, dict[str, float]] = field(default_factory=dict)
+    memory_depth: float = 0.0       # I(X_t; X_{t-2} | X_{t-1})
+
+
+@dataclass
+class Sensitivity:
+    """Pairwise comparison between two runs of the same task."""
+
+    task_id: str
+    score_delta: float
+    tool_edit_distance: int
+    family_js_divergence: float
+    embedding_divergence: np.ndarray  # (min_steps,)
+    lyapunov_proxy: float
+
+
+@dataclass
+class SurvivalPoint:
+    time: float
+    survival: float
+
+
+# ── Helpers ────────────────────────────────────────────────────────────
+
+
+def _cosine_dist(a: np.ndarray, b: np.ndarray) -> float:
+    na, nb = np.linalg.norm(a), np.linalg.norm(b)
+    if na < 1e-12 or nb < 1e-12:
+        return 1.0
+    return float(1.0 - np.dot(a, b) / (na * nb))
+
+
+def _entropy(counts: dict[str, int]) -> float:
+    total = sum(counts.values())
+    if total == 0:
+        return 0.0
+    return -sum(
+        (c / total) * math.log2(c / total) for c in counts.values() if c > 0
+    )
+
+
+def _js_divergence(p: dict[str, int], q: dict[str, int]) -> float:
+    keys = set(p) | set(q)
+    if not keys:
+        return 0.0
+    tp, tq = sum(p.values()) or 1, sum(q.values()) or 1
+    jsd = 0.0
+    for k in keys:
+        pk, qk = p.get(k, 0) / tp, q.get(k, 0) / tq
+        mk = (pk + qk) / 2
+        if pk > 0 and mk > 0:
+            jsd += 0.5 * pk * math.log2(pk / mk)
+        if qk > 0 and mk > 0:
+            jsd += 0.5 * qk * math.log2(qk / mk)
+    return jsd
+
+
+def _levenshtein(a: list, b: list) -> int:
+    if not a:
+        return len(b)
+    if not b:
+        return len(a)
+    prev = list(range(len(b) + 1))
+    for ca in a:
+        curr = [prev[0] + 1] + [0] * len(b)
+        for j, cb in enumerate(b):
+            curr[j + 1] = min(
+                prev[j] + (0 if ca == cb else 1),
+                prev[j + 1] + 1,
+                curr[j] + 1,
+            )
+        prev = curr
+    return prev[-1]
+
+
+def _classify_tool(name: str) -> str:
+    lo = name.lower()
+    for fam in TOOL_FAMILIES:
+        if fam in lo:
+            return fam
+    _ALIASES = {
+        "edit": ("write_file", "create_file", "str_replace", "patch"),
+        "execute": ("bash", "terminal", "shell", "run", "exec"),
+        "browser": ("browse", "click", "navigate", "screenshot"),
+        "search": ("grep", "find", "glob", "semantic"),
+        "read": ("cat", "head", "tail", "view", "list_dir"),
+    }
+    for fam, keywords in _ALIASES.items():
+        if any(k in lo for k in keywords):
+            return fam
+    return "execute"
+
+
+def _normalize_tool_family(name: str, family: str | None) -> str:
+    if family in TOOL_FAMILIES:
+        return family
+    return _classify_tool(name)
+
+
+# ── Feature embedding ──────────────────────────────────────────────────
+
+
+def _embed_transcript(
+    transcript: Transcript,
+) -> tuple[np.ndarray, list[str], list[int], list[float], list[bool]]:
+    """Build (n_steps, 10) feature matrix from assistant turns.
+
+    Features: [0:6] tool-family proportions, [6] error flag,
+    [7] normalised tokens, [8] normalised text length, [9] progress.
+    """
+    msgs = transcript.assistant_messages
+    n = len(msgs)
+    if n == 0:
+        return np.empty((0, _N_FAM + 4)), [], [], [], []
+
+    X = np.zeros((n, _N_FAM + 4))
+    families: list[str] = []
+    tokens: list[int] = []
+    latencies: list[float] = []
+    errors: list[bool] = []
+    raw_tokens = np.zeros(n)
+    raw_text = np.zeros(n)
+
+    for i, msg in enumerate(msgs):
+        fam_counts: Counter = Counter()
+        has_err = False
+        for tc in msg.tool_calls:
+            fam = _normalize_tool_family(tc.name, tc.family)
+            fam_counts[fam] += 1
+            if tc.success is False or tc.error:
+                has_err = True
+        n_tc = sum(fam_counts.values()) or 1
+        for j, fam in enumerate(TOOL_FAMILIES):
+            X[i, j] = fam_counts.get(fam, 0) / n_tc
+        X[i, _N_FAM] = 1.0 if has_err else 0.0
+        X[i, _N_FAM + 3] = i / max(n - 1, 1)
+
+        families.append(
+            max(fam_counts, key=fam_counts.get) if fam_counts else "execute"
+        )
+        errors.append(has_err)
+        tokens.append(msg.usage.total_tokens)
+        raw_tokens[i] = float(msg.usage.total_tokens)
+        raw_text[i] = float(len(msg.text))
+        dt = msg.timestamp_ms - msgs[i - 1].timestamp_ms if i > 0 else 0
+        latencies.append(max(float(dt), 0.0))
+
+    mx_tok = raw_tokens.max() or 1
+    mx_txt = raw_text.max() or 1
+    X[:, _N_FAM + 1] = raw_tokens / mx_tok
+    X[:, _N_FAM + 2] = raw_text / mx_txt
+
+    return X, families, tokens, latencies, errors
+
+
+# ── Non-Markov memory ────────────────────────────────────────────────
+
+
+def _compute_bigram_transitions(seq: list[str]) -> dict[str, dict[str, float]]:
+    """P(family_t | family_{t-1}, family_{t-2}) grouped by bigram context."""
+    if len(seq) < 3:
+        return {}
+    bigrams: dict[str, Counter] = {}
+    for a, b, c in zip(seq[:-2], seq[1:-1], seq[2:]):
+        ctx = f"{a}->{b}"
+        bigrams.setdefault(ctx, Counter())[c] += 1
+    return {
+        ctx: {k: v / sum(cnts.values()) for k, v in cnts.items()}
+        for ctx, cnts in bigrams.items()
+    }
+
+
+def _conditional_mi(seq: list[str]) -> float:
+    """I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator."""
+    if len(seq) < 3:
+        return 0.0
+    n = len(seq) - 2
+    triple = Counter(zip(seq[:-2], seq[1:-1], seq[2:]))
+    pair_01 = Counter(zip(seq[:-2], seq[1:-1]))
+    pair_12 = Counter(zip(seq[1:-1], seq[2:]))
+    single = Counter(seq[1:-1])
+
+    mi = 0.0
+    for (a, b, c), count in triple.items():
+        p_abc = count / n
+        p_ab, p_bc, p_b = pair_01[(a, b)] / n, pair_12[(b, c)] / n, single[b] / n
+        if p_ab > 0 and p_bc > 0 and p_b > 0:
+            mi += p_abc * math.log2((p_abc * p_b) / (p_ab * p_bc))
+    return max(mi, 0.0)
+
+
+# ── Core analysis ──────────────────────────────────────────────────────
+
+
+def compute_dynamics(transcript: Transcript) -> Dynamics:
+    """Compute trajectory dynamics from a single run transcript."""
+    X, families, tokens, latencies, errors = _embed_transcript(transcript)
+    n = len(families)
+
+    drift = (
+        np.array([_cosine_dist(X[0], X[i]) for i in range(n)])
+        if n else np.array([])
+    )
+    step_sz = np.zeros(n)
+    for i in range(1, n):
+        step_sz[i] = _cosine_dist(X[i - 1], X[i])
+
+    fam_acc: Counter = Counter()
+    err_count = 0
+    entropy_s: list[float] = []
+    error_s: list[float] = []
+    for i, (fam, err) in enumerate(zip(families, errors)):
+        fam_acc[fam] += 1
+        err_count += int(err)
+        entropy_s.append(_entropy(dict(fam_acc)))
+        error_s.append(err_count / (i + 1))
+
+    total = sum(fam_acc.values()) or 1
+    fam_dist = {k: v / total for k, v in fam_acc.items()}
+
+    mc: dict[str, Counter] = {f: Counter() for f in TOOL_FAMILIES}
+    for a, b in zip(families[:-1], families[1:]):
+        mc[a][b] += 1
+    markov = {
+        src: ({dst: c / t for dst, c in cnts.items()} if (t := sum(cnts.values())) else {})
+        for src, cnts in mc.items()
+    }
+
+    ci = 0.5
+    if n > 2:
+        cov = np.cov(X.T)
+        eigvals = np.maximum(np.linalg.eigvalsh(cov), 0)
+        tv = eigvals.sum()
+        if tv > 1e-10:
+            p = eigvals / tv
+            pr = 1.0 / np.sum(p**2)
+            ci = 1.0 - (pr - 1) / (X.shape[1] - 1)
+
+    h = _entropy(dict(fam_acc))
+    er = err_count / n if n else 0
+    regime = _classify_regime(drift, step_sz, h, er, ci, n)
+
+    return Dynamics(
+        n_steps=n,
+        embeddings=X,
+        drift=drift,
+        step_size=step_sz,
+        entropy_series=entropy_s,
+        error_rate_series=error_s,
+        tokens_series=tokens,
+        latency_series=latencies,
+        tool_sequence=families,
+        markov=markov,
+        family_dist=fam_dist,
+        regime=regime,
+        mean_drift=float(np.mean(drift)) if n else 0,
+        mean_step_size=float(np.mean(step_sz)) if n else 0,
+        tool_entropy=h,
+        error_rate=er,
+        constraint_index=ci,
+        bigram_transitions=_compute_bigram_transitions(families),
+        memory_depth=_conditional_mi(families),
+    )
+
+
+def _classify_regime(drift, step_sz, entropy, error_rate, ci, n) -> Regime:
+    if n < 3:
+        return Regime.unknown
+    if entropy < 0.5 or (error_rate > 0.6 and float(np.std(drift)) < 0.05):
+        return Regime.trapped
+    q = max(1, n // 4)
+    late_drift_std = float(np.std(drift[-q:]))
+    late_step_mean = float(np.mean(step_sz[-q:]))
+    if late_drift_std < 0.1 and late_step_mean < 0.15 and error_rate < 0.2:
+        return Regime.convergent
+    if entropy > 1.5 and error_rate < 0.15 and ci < 0.8:
+        return Regime.diffusive
+    step_var = float(np.var(step_sz[1:])) if n > 1 else 0
+    if entropy > 2.0 and step_var > 0.02:
+        return Regime.chaotic
+    if n > 6:
+        ss = step_sz[1:]
+        ss_c = ss - ss.mean()
+        norm = np.dot(ss_c, ss_c)
+        if norm > 1e-10:
+            ac = np.correlate(ss_c, ss_c, mode="full")
+            ac = ac[len(ac) // 2:] / norm
+            if len(ac) > 5 and max(ac[2:6]) > 0.3:
+                return Regime.limit_cycle
+    return Regime.unknown
+
+
+# ── Sensitivity ────────────────────────────────────────────────────────
+
+
+def compute_sensitivity(
+    run_a: TaskRunResult,
+    run_b: TaskRunResult,
+    task_id: str = "",
+) -> Sensitivity:
+    """Compare two runs of the same task for prompt sensitivity."""
+    Xa, fam_a, *_ = _embed_transcript(run_a.transcript)
+    Xb, fam_b, *_ = _embed_transcript(run_b.transcript)
+
+    min_n = min(len(Xa), len(Xb))
+    emb_div = (
+        np.array([_cosine_dist(Xa[i], Xb[i]) for i in range(min_n)])
+        if min_n else np.array([])
+    )
+
+    lyap = 0.0
+    if min_n > 1:
+        d0 = max(_cosine_dist(Xa[0], Xb[0]), 1e-6)
+        lyap = sum(
+            math.log(max(emb_div[t], 1e-6) / d0) / t for t in range(1, min_n)
+        ) / (min_n - 1)
+
+    return Sensitivity(
+        task_id=task_id or run_a.task_id,
+        score_delta=abs(run_a.run_score - run_b.run_score),
+        tool_edit_distance=_levenshtein(fam_a, fam_b),
+        family_js_divergence=_js_divergence(dict(Counter(fam_a)), dict(Counter(fam_b))),
+        embedding_divergence=emb_div,
+        lyapunov_proxy=lyap,
+    )
+
+
+# ── Survival analysis ─────────────────────────────────────────────────
+
+
+def kaplan_meier(
+    event_times: list[float],
+    censored: list[bool] | None = None,
+) -> list[SurvivalPoint]:
+    """Kaplan-Meier survival estimator."""
+    n = len(event_times)
+    if n == 0:
+        return []
+    if censored is None:
+        censored = [False] * n
+    pairs = sorted(zip(event_times, censored))
+    pts = [SurvivalPoint(0.0, 1.0)]
+    at_risk = n
+    surv = 1.0
+    for t, cens in pairs:
+        if cens:
+            at_risk -= 1
+            continue
+        if at_risk > 0:
+            surv *= (at_risk - 1) / at_risk
+        at_risk -= 1
+        pts.append(SurvivalPoint(t, surv))
+    return pts
+
+
+def find_event_step(transcript: Transcript, event: str) -> float | None:
+    """Return step index of the first occurrence of *event*, or None."""
+    msgs = transcript.assistant_messages
+    if event == "first_error_recovery":
+        in_err = False
+        for i, m in enumerate(msgs):
+            any_err = any(tc.success is False or tc.error for tc in m.tool_calls)
+            if any_err:
+                in_err = True
+            elif in_err:
+                return float(i)
+    elif event == "first_correct_write":
+        for i, m in enumerate(msgs):
+            for tc in m.tool_calls:
+                fam = tc.family or _classify_tool(tc.name)
+                if fam == "edit" and tc.success is not False and not tc.error:
+                    return float(i)
+    elif event == "task_completion":
+        if msgs:
+            last = msgs[-1]
+            if not any(tc.success is False or tc.error for tc in last.tool_calls):
+                return float(len(msgs) - 1)
+    elif event == "failure_absorption":
+        err_seen = False
+        for i, m in enumerate(msgs):
+            any_err = any(tc.success is False or tc.error for tc in m.tool_calls)
+            if any_err:
+                err_seen = True
+            elif err_seen and m.tool_calls:
+                return float(i)
+    return None
+
+
+# ── PCA trajectory bundles ─────────────────────────────────────────────
+
+
+def compute_pca_bundle(
+    dynamics_list: list[Dynamics],
+) -> tuple[np.ndarray, list[np.ndarray]]:
+    """Fit PCA on pooled embeddings, project each trajectory into PC1-PC2."""
+    non_empty = [d.embeddings for d in dynamics_list if d.n_steps > 0]
+    if not non_empty:
+        for d in dynamics_list:
+            d.pca_trajectory = np.empty((0, 2))
+        return np.zeros((2, _N_FAM + 4)), []
+    all_emb = np.vstack(non_empty)
+    mean = all_emb.mean(axis=0)
+    centred = all_emb - mean
+    _, _, Vt = np.linalg.svd(centred, full_matrices=False)
+    components = Vt[:2]
+
+    projections: list[np.ndarray] = []
+    for d in dynamics_list:
+        proj = (d.embeddings - mean) @ components.T if d.n_steps else np.empty((0, 2))
+        d.pca_trajectory = proj
+        projections.append(proj)
+    return components, projections
+
+
+# ── Stratified assessment with Bayesian reweighting ───────────────────
+
+
+@dataclass
+class StratumStats:
+    """Distributional statistics for one stratum of runs."""
+
+    name: str
+    n_runs: int
+    weight: float
+
+    # Score distribution
+    scores: np.ndarray
+    score_mean: float
+    score_std: float
+    score_quantiles: dict[str, float]  # q10, q25, q50, q75, q90
+
+    # Dynamics distributions
+    entropy_dist: np.ndarray
+    error_rate_dist: np.ndarray
+    constraint_dist: np.ndarray
+    memory_depth_dist: np.ndarray
+    mean_drift_dist: np.ndarray
+    mean_step_size_dist: np.ndarray
+
+    # Time-series curves (aligned by step index)
+    drift_curve_mean: np.ndarray
+    drift_curve_std: np.ndarray
+    step_curve_mean: np.ndarray
+    step_curve_std: np.ndarray
+
+    regime_counts: dict[str, int]
+    sensitivity_deltas: np.ndarray
+
+
+# Scalar fields on StratumStats that reweight() aggregates.
+_REWEIGHT_FIELDS = [
+    ("entropy", "entropy_dist"),
+    ("error_rate", "error_rate_dist"),
+    ("constraint", "constraint_dist"),
+    ("memory_depth", "memory_depth_dist"),
+    ("mean_drift", "mean_drift_dist"),
+    ("mean_step_size", "mean_step_size_dist"),
+]
+
+
+@dataclass
+class StratifiedAssessment:
+    """Full stratified assessment with Bayesian reweighting.
+
+    Call ``reweight(target_weights)`` with a different task distribution
+    to obtain importance-weighted aggregate estimates.
+    """
+
+    strata: list[StratumStats]
+    stratifier_name: str
+    total_runs: int
+    observed_mean_score: float
+    observed_std_score: float
+
+    def stratum_names(self) -> list[str]:
+        return [s.name for s in self.strata]
+
+    def reweight(self, target_weights: dict[str, float]) -> dict[str, float]:
+        """Bayesian importance-weight correction.
+
+        w_k = p_target(k) / p_observed(k), then normalised.
+        """
+        t_total = sum(target_weights.values()) or 1.0
+        p_target = {k: v / t_total for k, v in target_weights.items()}
+        by_name = {s.name: s for s in self.strata}
+
+        weights = {
+            name: pt / by_name[name].weight
+            for name, pt in p_target.items()
+            if name in by_name and by_name[name].weight > 1e-12
+        }
+        if not weights:
+            return {"score_mean": self.observed_mean_score,
+                    "score_std": self.observed_std_score}
+
+        w_total = sum(weights.values())
+        w = {k: v / w_total for k, v in weights.items()}
+
+        # Reweight score (mean + law-of-total-variance)
+        score_mu = sum(w[k] * by_name[k].score_mean for k in w)
+        score_var = sum(
+            w[k] * (by_name[k].score_std ** 2 + (by_name[k].score_mean - score_mu) ** 2)
+            for k in w
+        )
+        result = {"score_mean": score_mu, "score_std": math.sqrt(max(score_var, 0.0))}
+
+        def _safe_mean(arr: np.ndarray) -> float:
+            return float(np.mean(arr)) if len(arr) > 0 else 0.0
+
+        for label, dist_attr in _REWEIGHT_FIELDS:
+            result[f"{label}_mean"] = sum(
+                w[k] * _safe_mean(getattr(by_name[k], dist_attr)) for k in w
+            )
+        return result
+
+
+def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]:
+    """Mean and std of variable-length arrays aligned at step 0."""
+    if not arrays:
+        return np.array([]), np.array([])
+    max_len = max(len(a) for a in arrays)
+    mat = np.full((len(arrays), max_len), np.nan)
+    for i, a in enumerate(arrays):
+        mat[i, :len(a)] = a
+    return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0)
+
+
+def build_strata(
+    runs: list[TaskRunResult],
+    dynamics_list: list[Dynamics],
+    scores: list[float],
+    stratifier: Callable[[TaskRunResult, Dynamics], str],
+    stratifier_name: str = "custom",
+    sensitivities: list[Sensitivity] | None = None,
+) -> StratifiedAssessment:
+    """Group runs into strata and compute per-stratum distributions."""
+    assert len(runs) == len(dynamics_list) == len(scores)
+
+    groups: dict[str, list[int]] = {}
+    for idx, (r, d) in enumerate(zip(runs, dynamics_list)):
+        groups.setdefault(stratifier(r, d), []).append(idx)
+
+    total = len(runs)
+    all_scores = np.array(scores)
+
+    sens_by_task: dict[str, list[Sensitivity]] = {}
+    if sensitivities:
+        for s in sensitivities:
+            sens_by_task.setdefault(s.task_id, []).append(s)
+
+    strata: list[StratumStats] = []
+    for name, idxs in sorted(groups.items()):
+        n = len(idxs)
+        sc = np.array([scores[i] for i in idxs])
+        dyns = [dynamics_list[i] for i in idxs]
+
+        qs = {f"q{q}": float(np.percentile(sc, q)) if n else 0.0
+              for q in (10, 25, 50, 75, 90)}
+
+        drift_m, drift_s = _aligned_mean_std([d.drift for d in dyns])
+        step_m, step_s = _aligned_mean_std([d.step_size for d in dyns])
+
+        stratum_tasks = {runs[i].task_id for i in idxs}
+        sens_deltas = [
+            s.score_delta
+            for tid in stratum_tasks
+            for s in sens_by_task.get(tid, [])
+        ]
+
+        strata.append(StratumStats(
+            name=name, n_runs=n, weight=n / total if total else 0.0,
+            scores=sc,
+            score_mean=float(np.mean(sc)) if n else 0.0,
+            score_std=float(np.std(sc)) if n else 0.0,
+            score_quantiles=qs,
+            entropy_dist=np.array([d.tool_entropy for d in dyns]),
+            error_rate_dist=np.array([d.error_rate for d in dyns]),
+            constraint_dist=np.array([d.constraint_index for d in dyns]),
+            memory_depth_dist=np.array([d.memory_depth for d in dyns]),
+            mean_drift_dist=np.array([d.mean_drift for d in dyns]),
+            mean_step_size_dist=np.array([d.mean_step_size for d in dyns]),
+            drift_curve_mean=drift_m, drift_curve_std=drift_s,
+            step_curve_mean=step_m, step_curve_std=step_s,
+            regime_counts=dict(Counter(d.regime.value for d in dyns)),
+            sensitivity_deltas=np.array(sens_deltas) if sens_deltas else np.array([]),
+        ))
+
+    return StratifiedAssessment(
+        strata=strata,
+        stratifier_name=stratifier_name,
+        total_runs=total,
+        observed_mean_score=float(np.mean(all_scores)) if total else 0.0,
+        observed_std_score=float(np.std(all_scores)) if total else 0.0,
+    )
+
+
+# ── Built-in stratifiers ──────────────────────────────────────────────
+
+
+def stratify_by_regime(run: TaskRunResult, dyn: Dynamics) -> str:
+    return dyn.regime.value
+
+
+def stratify_by_task(run: TaskRunResult, dyn: Dynamics) -> str:
+    return run.task_id
+
+
+def stratify_by_tier(run: TaskRunResult, dyn: Dynamics) -> str:
+    tid = run.task_id.lower()
+    for i in range(1, 6):
+        if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"):
+            return f"tier{i}"
+    return "unknown"
+
+
+def stratify_by_tool_mix(run: TaskRunResult, dyn: Dynamics) -> str:
+    if not dyn.family_dist:
+        return "unknown"
+    return max(dyn.family_dist, key=dyn.family_dist.get)
+
+
+def stratify_by_prompt_style(run: TaskRunResult, dyn: Dynamics) -> str:
+    user_msgs = [m for m in run.transcript.messages if m.role == "user"]
+    if not user_msgs:
+        return "unknown"
+    wc = len(user_msgs[0].text.split())
+    return "terse" if wc <= 6 else ("medium" if wc <= 15 else "verbose")
+
+
+def stratify_by_scenario(run: TaskRunResult, dyn: Dynamics) -> str:
+    return run.scenario or "unknown"
+
+
+def stratify_by_family(run: TaskRunResult, dyn: Dynamics) -> str:
+    return run.family or "unknown"
--- a/clawbench/dynamics_archive.py
+++ b/clawbench/dynamics_archive.py
@ -0,0 +1,493 @@
+"""Offline dynamics analysis helpers for cached ClawBench runs."""
+
+from __future__ import annotations
+
+import json
+from itertools import combinations
+from pathlib import Path
+from typing import Iterable
+
+import numpy as np
+
+from clawbench.dynamics import (
+    build_strata,
+    compute_dynamics,
+    compute_pca_bundle,
+    compute_sensitivity,
+    find_event_step,
+    kaplan_meier,
+    stratify_by_regime,
+    stratify_by_scenario,
+    stratify_by_tier,
+    stratify_by_tool_mix,
+)
+from clawbench.dynamics_plots import generate_all_plots
+from clawbench.schemas import TaskRunResult
+
+_TIER_PREFIXES = {
+    "tier1": ("t1-", "t1_"),
+    "tier2": ("t2-", "t2_"),
+    "tier3": ("t3-", "t3_"),
+    "tier4": ("t4-", "t4_"),
+    "tier5": ("t5-", "t5_"),
+}
+
+
+def safe_model_name(model: str) -> str:
+    return model.replace("/", "_").replace(":", "_")
+
+
+def _candidate_model_dir_names(model: str) -> set[str]:
+    return {
+        model,
+        safe_model_name(model),
+        model.replace("/", "_"),
+        model.replace("/", "-").replace(":", "-"),
+    }
+
+
+def _has_run_files(path: Path) -> bool:
+    try:
+        for child in path.iterdir():
+            if child.is_file() and child.name.startswith("run") and child.suffix == ".json":
+                return True
+    except FileNotFoundError:
+        return False
+    return False
+
+
+def _is_task_collection_root(path: Path) -> bool:
+    try:
+        for child in path.iterdir():
+            if child.is_dir() and _has_run_files(child):
+                return True
+    except FileNotFoundError:
+        return False
+    return False
+
+
+def _resolve_model_roots(archive_dir: Path, model: str | None) -> list[Path]:
+    if _is_task_collection_root(archive_dir):
+        if model is not None and archive_dir.name not in _candidate_model_dir_names(model):
+            raise ValueError(
+                f"Archive dir {archive_dir} does not match requested model {model}."
+            )
+        return [archive_dir]
+
+    roots = [
+        child
+        for child in sorted(archive_dir.iterdir())
+        if child.is_dir() and _is_task_collection_root(child)
+    ]
+    if model is not None:
+        candidates = _candidate_model_dir_names(model)
+        roots = [root for root in roots if root.name in candidates]
+    elif len(roots) > 1:
+        raise ValueError(
+            "Archive root contains multiple model directories. Pass --model or point "
+            "--archive-dir at a specific model directory."
+        )
+    return roots
+
+
+def discover_model_roots(archive_dir: Path) -> dict[str, Path]:
+    """Discover model directories inside an archive root.
+
+    Returns a mapping of model directory name to its path. If archive_dir is
+    itself a model cache root (contains task directories with run*.json), the
+    mapping contains a single entry.
+    """
+    if not archive_dir.exists():
+        raise ValueError(f"Archive dir does not exist: {archive_dir}")
+
+    if _is_task_collection_root(archive_dir):
+        return {archive_dir.name: archive_dir}
+
+    roots = {
+        child.name: child
+        for child in sorted(archive_dir.iterdir())
+        if child.is_dir() and _is_task_collection_root(child)
+    }
+    return roots
+
+
+def _matches_tier(task_id: str, tier: str | None) -> bool:
+    if tier is None:
+        return True
+    return task_id.lower().startswith(_TIER_PREFIXES[tier])
+
+
+def load_task_runs_archive(
+    archive_dir: Path,
+    model: str | None = None,
+    task_ids: Iterable[str] | None = None,
+    tier: str | None = None,
+) -> dict[str, list[TaskRunResult]]:
+    """Load cached TaskRunResult objects from a run cache/archive directory."""
+    task_filter = set(task_ids or [])
+    task_runs: dict[str, list[TaskRunResult]] = {}
+
+    if not archive_dir.exists():
+        raise ValueError(f"Archive dir does not exist: {archive_dir}")
+
+    roots = _resolve_model_roots(archive_dir, model)
+    if not roots:
+        return {}
+
+    for root in roots:
+        for task_dir in sorted(child for child in root.iterdir() if child.is_dir()):
+            task_id = task_dir.name
+            if task_filter and task_id not in task_filter:
+                continue
+            if not _matches_tier(task_id, tier):
+                continue
+
+            runs = []
+            for run_file in sorted(task_dir.glob("run*.json")):
+                try:
+                    run = TaskRunResult.model_validate_json(
+                        run_file.read_text(encoding="utf-8")
+                    )
+                except Exception:
+                    continue
+                runs.append(run)
+
+            if runs:
+                task_runs.setdefault(task_id, []).extend(runs)
+
+    for task_id, runs in task_runs.items():
+        runs.sort(key=lambda run: run.run_index)
+
+    return task_runs
+
+
+def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]:
+    if not arrays:
+        return np.array([]), np.array([])
+    max_len = max(len(arr) for arr in arrays)
+    if max_len == 0:
+        return np.array([]), np.array([])
+    mat = np.full((len(arrays), max_len), np.nan)
+    for idx, arr in enumerate(arrays):
+        mat[idx, :len(arr)] = arr
+    return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0)
+
+
+def _round_list(values: np.ndarray, digits: int = 4) -> list[float]:
+    return [round(float(value), digits) for value in values.tolist()]
+
+
+def _empty_sensitivity_summary() -> dict[str, object]:
+    return {
+        "n_pairs": 0,
+        "mean_score_delta": 0.0,
+        "mean_tool_edit_distance": 0.0,
+        "mean_family_js_divergence": 0.0,
+        "mean_lyapunov_proxy": 0.0,
+        "mean_initial_divergence": 0.0,
+        "mean_final_divergence": 0.0,
+        "mean_contraction_delta": 0.0,
+        "mean_contraction_ratio": 0.0,
+        "fraction_converging_pairs": 0.0,
+        "mean_divergence_curve": [],
+        "std_divergence_curve": [],
+        "pair_points": [],
+    }
+
+
+def _summarize_sensitivity_group(pairs: list) -> dict[str, object]:
+    if not pairs:
+        return _empty_sensitivity_summary()
+
+    divergence_curves = [pair.embedding_divergence for pair in pairs if len(pair.embedding_divergence) > 0]
+    curve_mean, curve_std = _aligned_mean_std(divergence_curves)
+
+    pair_points = []
+    for pair in pairs:
+        if len(pair.embedding_divergence) > 0:
+            initial_divergence = float(pair.embedding_divergence[0])
+            final_divergence = float(pair.embedding_divergence[-1])
+            contraction_delta = final_divergence - initial_divergence
+            contraction_ratio = final_divergence / max(initial_divergence, 1e-6)
+        else:
+            initial_divergence = 0.0
+            final_divergence = 0.0
+            contraction_delta = 0.0
+            contraction_ratio = 0.0
+        pair_points.append(
+            {
+                "score_delta": round(float(pair.score_delta), 4),
+                "tool_edit_distance": int(pair.tool_edit_distance),
+                "family_js_divergence": round(float(pair.family_js_divergence), 4),
+                "lyapunov_proxy": round(float(pair.lyapunov_proxy), 4),
+                "initial_divergence": round(initial_divergence, 4),
+                "final_divergence": round(final_divergence, 4),
+                "contraction_delta": round(contraction_delta, 4),
+                "contraction_ratio": round(contraction_ratio, 4),
+            }
+        )
+
+    converging_pairs = sum(
+        1 for point in pair_points if point["final_divergence"] < point["initial_divergence"]
+    )
+
+    return {
+        "n_pairs": len(pairs),
+        "mean_score_delta": round(float(np.mean([pair.score_delta for pair in pairs])), 4),
+        "mean_tool_edit_distance": round(float(np.mean([pair.tool_edit_distance for pair in pairs])), 4),
+        "mean_family_js_divergence": round(float(np.mean([pair.family_js_divergence for pair in pairs])), 4),
+        "mean_lyapunov_proxy": round(float(np.mean([pair.lyapunov_proxy for pair in pairs])), 4),
+        "mean_initial_divergence": round(float(np.mean([point["initial_divergence"] for point in pair_points])), 4),
+        "mean_final_divergence": round(float(np.mean([point["final_divergence"] for point in pair_points])), 4),
+        "mean_contraction_delta": round(float(np.mean([point["contraction_delta"] for point in pair_points])), 4),
+        "mean_contraction_ratio": round(float(np.mean([point["contraction_ratio"] for point in pair_points])), 4),
+        "fraction_converging_pairs": round(converging_pairs / len(pair_points), 4),
+        "mean_divergence_curve": _round_list(curve_mean),
+        "std_divergence_curve": _round_list(curve_std),
+        "pair_points": pair_points,
+    }
+
+
+def _build_sensitivity_sections(
+    valid_runs_by_task: dict[str, list[TaskRunResult]],
+) -> tuple[list, dict[str, object]]:
+    same_task_pairs = []
+    per_task: dict[str, object] = {}
+    for task_id, runs in sorted(valid_runs_by_task.items()):
+        if len(runs) < 2:
+            continue
+        task_pairs = [
+            compute_sensitivity(run_a, run_b, task_id=task_id)
+            for run_a, run_b in combinations(runs, 2)
+        ]
+        if task_pairs:
+            same_task_pairs.extend(task_pairs)
+            per_task[task_id] = _summarize_sensitivity_group(task_pairs)
+
+    same_task_summary = _summarize_sensitivity_group(same_task_pairs)
+    same_task_summary["per_task"] = per_task
+
+    perturbation_pairs = []
+    per_variant_group: dict[str, object] = {}
+    runs_by_variant_group: dict[str, list[TaskRunResult]] = {}
+    for runs in valid_runs_by_task.values():
+        for run in runs:
+            runs_by_variant_group.setdefault(run.variant_group or run.task_id, []).append(run)
+
+    for variant_group, runs in sorted(runs_by_variant_group.items()):
+        distinct_members = {
+            (run.task_id, run.prompt_variant, run.variant_id)
+            for run in runs
+        }
+        if len(distinct_members) < 2:
+            continue
+
+        group_pairs = []
+        for run_a, run_b in combinations(runs, 2):
+            if (
+                run_a.task_id == run_b.task_id
+                and run_a.prompt_variant == run_b.prompt_variant
+                and run_a.variant_id == run_b.variant_id
+            ):
+                continue
+            group_pairs.append(compute_sensitivity(run_a, run_b, task_id=variant_group))
+
+        if not group_pairs:
+            continue
+
+        perturbation_pairs.extend(group_pairs)
+        group_summary = _summarize_sensitivity_group(group_pairs)
+        group_summary["members"] = [
+            {
+                "task_id": task_id,
+                "prompt_variant": prompt_variant,
+                "variant_id": variant_id,
+            }
+            for task_id, prompt_variant, variant_id in sorted(distinct_members)
+        ]
+        per_variant_group[variant_group] = group_summary
+
+    perturbation_summary = _summarize_sensitivity_group(perturbation_pairs)
+    perturbation_summary["per_variant_group"] = per_variant_group
+
+    return same_task_pairs, {
+        "same_task": same_task_summary,
+        "prompt_perturbation": perturbation_summary,
+    }
+
+
+def build_dynamics_report(
+    task_runs: dict[str, list[TaskRunResult]],
+    include_pca: bool = True,
+) -> tuple[dict, list]:
+    """Compute stratified dynamics report data from cached runs."""
+    all_runs = [run for runs in task_runs.values() for run in runs]
+    if not all_runs:
+        raise ValueError("No cached runs were loaded.")
+
+    dynamics_list = []
+    scores = []
+    valid_runs = []
+    for run in all_runs:
+        if not run.transcript.messages:
+            continue
+        dynamics_list.append(compute_dynamics(run.transcript))
+        scores.append(run.run_score)
+        valid_runs.append(run)
+
+    if not valid_runs:
+        raise ValueError("No runs with transcripts were found in the archive.")
+
+    valid_runs_by_task: dict[str, list[TaskRunResult]] = {}
+    for run in valid_runs:
+        valid_runs_by_task.setdefault(run.task_id, []).append(run)
+
+    same_task_sensitivities, sensitivity_summary = _build_sensitivity_sections(valid_runs_by_task)
+
+    stratifiers = {
+        "tier": stratify_by_tier,
+        "regime": stratify_by_regime,
+        "tool_mix": stratify_by_tool_mix,
+        "scenario": stratify_by_scenario,
+    }
+
+    report: dict[str, object] = {
+        "n_runs": len(valid_runs),
+        "n_tasks": len(task_runs),
+        "strata": {},
+    }
+
+    stratified = {}
+    for name, fn in stratifiers.items():
+        assessment = build_strata(
+            valid_runs,
+            dynamics_list,
+            scores,
+            fn,
+            name,
+            sensitivities=same_task_sensitivities,
+        )
+        stratified[name] = assessment
+        strata_summary = []
+        for stratum in assessment.strata:
+            strata_summary.append(
+                {
+                    "name": stratum.name,
+                    "n_runs": stratum.n_runs,
+                    "weight": round(stratum.weight, 4),
+                    "score_mean": round(stratum.score_mean, 4),
+                    "score_std": round(stratum.score_std, 4),
+                    "score_quantiles": {
+                        key: round(value, 4)
+                        for key, value in stratum.score_quantiles.items()
+                    },
+                    "entropy_mean": round(float(stratum.entropy_dist.mean()), 4)
+                    if len(stratum.entropy_dist)
+                    else 0.0,
+                    "error_rate_mean": round(float(stratum.error_rate_dist.mean()), 4)
+                    if len(stratum.error_rate_dist)
+                    else 0.0,
+                    "constraint_mean": round(float(stratum.constraint_dist.mean()), 4)
+                    if len(stratum.constraint_dist)
+                    else 0.0,
+                    "memory_depth_mean": round(float(stratum.memory_depth_dist.mean()), 4)
+                    if len(stratum.memory_depth_dist)
+                    else 0.0,
+                    "sensitivity_pairs": int(len(stratum.sensitivity_deltas)),
+                    "sensitivity_mean_score_delta": round(float(stratum.sensitivity_deltas.mean()), 4)
+                    if len(stratum.sensitivity_deltas)
+                    else 0.0,
+                    "regime_counts": stratum.regime_counts,
+                }
+            )
+        report["strata"][name] = {
+            "observed_mean_score": round(assessment.observed_mean_score, 4),
+            "observed_std_score": round(assessment.observed_std_score, 4),
+            "strata": strata_summary,
+        }
+
+    report["per_run"] = [
+        {
+            "task_id": run.task_id,
+            "run_index": run.run_index,
+            "score": round(run.run_score, 4),
+            "regime": dynamics.regime.value,
+            "entropy": round(dynamics.tool_entropy, 4),
+            "error_rate": round(dynamics.error_rate, 4),
+            "constraint_index": round(dynamics.constraint_index, 4),
+            "memory_depth": round(dynamics.memory_depth, 4),
+            "n_steps": dynamics.n_steps,
+            "mean_drift": round(dynamics.mean_drift, 4),
+            "mean_step_size": round(dynamics.mean_step_size, 4),
+        }
+        for run, dynamics in zip(valid_runs, dynamics_list)
+    ]
+    report["sensitivity"] = sensitivity_summary
+
+    if include_pca:
+        compute_pca_bundle(dynamics_list)
+
+    events = []
+    censored = []
+    for run in valid_runs:
+        step = find_event_step(run.transcript, "first_correct_write")
+        if step is not None:
+            events.append(step)
+            censored.append(False)
+        else:
+            events.append(float(len(run.transcript.assistant_messages)))
+            censored.append(True)
+    km_points = kaplan_meier(events, censored)
+    return report, generate_all_plots, {
+        "valid_runs": valid_runs,
+        "dynamics_list": dynamics_list,
+        "stratified": stratified,
+        "km_points": km_points,
+        "sensitivity": sensitivity_summary,
+    }
+
+
+def write_dynamics_report(
+    task_runs: dict[str, list[TaskRunResult]],
+    out_dir: Path,
+    report_name: str = "dynamics.json",
+    generate_plots: bool = True,
+) -> tuple[Path, list[Path]]:
+    """Write the dynamics report JSON and plots to an output directory."""
+    report, plotter, plot_data = build_dynamics_report(task_runs, include_pca=generate_plots)
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    report_path = out_dir / report_name
+    report_path.write_text(json.dumps(report, indent=2), encoding="utf-8")
+
+    plots: list[Path] = []
+    if generate_plots:
+        plots = plotter(
+            plot_data["dynamics_list"],
+            plot_data["valid_runs"],
+            plot_data["stratified"],
+            km_points=plot_data["km_points"],
+            event_name="first_correct_write",
+            out_dir=out_dir,
+            sensitivity_summary=plot_data["sensitivity"],
+        )
+    return report_path, plots
+
+
+def load_task_runs_by_model(
+    archive_dir: Path,
+    tier: str | None = None,
+    task_ids: Iterable[str] | None = None,
+) -> dict[str, dict[str, list[TaskRunResult]]]:
+    """Load cached TaskRunResult objects grouped by model directory name."""
+    grouped: dict[str, dict[str, list[TaskRunResult]]] = {}
+    for model_name, model_dir in discover_model_roots(archive_dir).items():
+        task_runs = load_task_runs_archive(
+            archive_dir=model_dir,
+            model=None,
+            task_ids=task_ids,
+            tier=tier,
+        )
+        if task_runs:
+            grouped[model_name] = task_runs
+    return grouped
--- a/clawbench/dynamics_plots.py
+++ b/clawbench/dynamics_plots.py
@ -0,0 +1,411 @@
+"""Plotting utilities for dynamics analysis.
+
+Generates publication-ready figures from dynamics data and saves to a
+results directory. All plots use matplotlib with the Agg backend so they
+work headlessly.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+from clawbench.dynamics import (
+    Dynamics,
+    StratifiedAssessment,
+    StratumStats,
+    SurvivalPoint,
+)
+
+
+def _savefig(fig: plt.Figure, path: Path) -> None:
+    fig.savefig(path, dpi=150, bbox_inches="tight")
+    plt.close(fig)
+
+
+def _plot_series_curves(
+    dynamics_list: list[Dynamics],
+    labels: list[str],
+    out_path: Path,
+    *,
+    series_attr: str,
+    ylabel: str,
+    title: str,
+) -> None:
+    """Plot a step-aligned per-run series coloured by label."""
+    fig, ax = plt.subplots(figsize=(10, 5))
+    cmap = plt.cm.tab10
+    unique = sorted(set(labels))
+    colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)}
+
+    for d, lbl in zip(dynamics_list, labels):
+        series = np.asarray(getattr(d, series_attr), dtype=float)
+        if len(series) < 2:
+            continue
+        ax.plot(series, alpha=0.6, color=colour_map[lbl], linewidth=1)
+
+    for lbl in unique:
+        ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2)
+    ax.legend(fontsize=8, loc="upper left")
+    ax.set_xlabel("Step")
+    ax.set_ylabel(ylabel)
+    ax.set_title(title)
+    _savefig(fig, out_path)
+
+
+def plot_drift_curves(
+    dynamics_list: list[Dynamics],
+    labels: list[str],
+    out_path: Path,
+) -> None:
+    """Drift-from-origin curves coloured by label (e.g. task_id or regime)."""
+    _plot_series_curves(
+        dynamics_list,
+        labels,
+        out_path,
+        series_attr="drift",
+        ylabel="Cosine distance from step 0",
+        title="Drift from Origin",
+    )
+
+
+def plot_step_size_curves(
+    dynamics_list: list[Dynamics],
+    labels: list[str],
+    out_path: Path,
+) -> None:
+    """Step-to-step movement curves coloured by label."""
+    _plot_series_curves(
+        dynamics_list,
+        labels,
+        out_path,
+        series_attr="step_size",
+        ylabel="Cosine distance from previous step",
+        title="Step-to-Step Movement",
+    )
+
+
+def plot_pca_trajectories(
+    dynamics_list: list[Dynamics],
+    labels: list[str],
+    out_path: Path,
+) -> None:
+    """PCA phase portraits (PC1 vs PC2) coloured by label."""
+    fig, ax = plt.subplots(figsize=(8, 8))
+    cmap = plt.cm.tab10
+    unique = sorted(set(labels))
+    colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)}
+
+    for d, lbl in zip(dynamics_list, labels):
+        if d.pca_trajectory is None or len(d.pca_trajectory) < 2:
+            continue
+        traj = d.pca_trajectory
+        ax.plot(traj[:, 0], traj[:, 1], alpha=0.5, color=colour_map[lbl], linewidth=1)
+        ax.scatter(traj[0, 0], traj[0, 1], color=colour_map[lbl], marker="o", s=30, zorder=5)
+        ax.scatter(traj[-1, 0], traj[-1, 1], color=colour_map[lbl], marker="x", s=30, zorder=5)
+
+    for lbl in unique:
+        ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2)
+    ax.legend(fontsize=8)
+    ax.set_xlabel("PC1")
+    ax.set_ylabel("PC2")
+    ax.set_title("PCA Phase Portrait (o=start, x=end)")
+    _savefig(fig, out_path)
+
+
+def plot_regime_distribution(
+    strata: list[StratumStats],
+    stratifier_name: str,
+    out_path: Path,
+) -> None:
+    """Stacked bar chart of regime counts per stratum."""
+    fig, ax = plt.subplots(figsize=(10, 5))
+    all_regimes = sorted({r for s in strata for r in s.regime_counts})
+    x = np.arange(len(strata))
+    bottom = np.zeros(len(strata))
+    cmap = plt.cm.Set2
+
+    for j, regime in enumerate(all_regimes):
+        counts = [s.regime_counts.get(regime, 0) for s in strata]
+        ax.bar(x, counts, bottom=bottom, label=regime, color=cmap(j / max(len(all_regimes) - 1, 1)))
+        bottom += np.array(counts)
+
+    ax.set_xticks(x)
+    ax.set_xticklabels([s.name for s in strata], rotation=30, ha="right")
+    ax.set_ylabel("Count")
+    ax.set_title(f"Regime Distribution by {stratifier_name}")
+    ax.legend(fontsize=8)
+    _savefig(fig, out_path)
+
+
+def plot_score_distributions(
+    strata: list[StratumStats],
+    stratifier_name: str,
+    out_path: Path,
+) -> None:
+    """Box plots of score distributions per stratum."""
+    fig, ax = plt.subplots(figsize=(10, 5))
+    data = [s.scores for s in strata if len(s.scores) > 0]
+    labels = [s.name for s in strata if len(s.scores) > 0]
+
+    if data:
+        ax.boxplot(data, labels=labels, patch_artist=True,
+                   boxprops=dict(facecolor="lightblue", alpha=0.7))
+    ax.set_ylabel("Score")
+    ax.set_title(f"Score Distribution by {stratifier_name}")
+    plt.xticks(rotation=30, ha="right")
+    _savefig(fig, out_path)
+
+
+def plot_survival_curve(
+    km_points: list[SurvivalPoint],
+    event_name: str,
+    out_path: Path,
+) -> None:
+    """Kaplan-Meier survival curve."""
+    if not km_points:
+        return
+    fig, ax = plt.subplots(figsize=(8, 5))
+    times = [p.time for p in km_points]
+    surv = [p.survival for p in km_points]
+    ax.step(times, surv, where="post", linewidth=2, color="steelblue")
+    ax.fill_between(times, surv, step="post", alpha=0.15, color="steelblue")
+    ax.set_xlabel("Step")
+    ax.set_ylabel("Survival probability")
+    ax.set_title(f"Kaplan-Meier: {event_name}")
+    ax.set_ylim(-0.05, 1.05)
+    _savefig(fig, out_path)
+
+
+def plot_stratum_dynamics_heatmap(
+    strata: list[StratumStats],
+    stratifier_name: str,
+    out_path: Path,
+) -> None:
+    """Heatmap of mean dynamics metrics across strata."""
+    metrics = ["entropy", "error_rate", "constraint", "memory_depth", "mean_drift", "mean_step_size"]
+    data = np.zeros((len(strata), len(metrics)))
+    for i, s in enumerate(strata):
+        arrays = [s.entropy_dist, s.error_rate_dist, s.constraint_dist,
+                  s.memory_depth_dist, s.mean_drift_dist, s.mean_step_size_dist]
+        for j, arr in enumerate(arrays):
+            data[i, j] = float(np.mean(arr)) if len(arr) > 0 else 0.0
+
+    fig, ax = plt.subplots(figsize=(10, max(3, len(strata) * 0.6)))
+    im = ax.imshow(data, aspect="auto", cmap="YlOrRd")
+    ax.set_xticks(range(len(metrics)))
+    ax.set_xticklabels(metrics, rotation=30, ha="right")
+    ax.set_yticks(range(len(strata)))
+    ax.set_yticklabels([s.name for s in strata])
+    for i in range(len(strata)):
+        for j in range(len(metrics)):
+            ax.text(j, i, f"{data[i, j]:.2f}", ha="center", va="center", fontsize=8)
+    fig.colorbar(im, ax=ax, shrink=0.8)
+    ax.set_title(f"Dynamics Metrics by {stratifier_name}")
+    _savefig(fig, out_path)
+
+
+def plot_pairwise_divergence_curves(
+    per_task_sensitivity: dict[str, dict],
+    out_path: Path,
+) -> bool:
+    """Plot mean pairwise trajectory divergence over aligned steps."""
+    if not per_task_sensitivity:
+        return False
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    cmap = plt.cm.tab10
+    tasks = sorted(per_task_sensitivity)
+    colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)}
+
+    plotted = False
+    for task in tasks:
+        summary = per_task_sensitivity[task]
+        mean_curve = np.asarray(summary.get("mean_divergence_curve", []), dtype=float)
+        std_curve = np.asarray(summary.get("std_divergence_curve", []), dtype=float)
+        if len(mean_curve) == 0:
+            continue
+        steps = np.arange(len(mean_curve))
+        ax.plot(steps, mean_curve, linewidth=2, color=colour_map[task], label=task)
+        if len(std_curve) == len(mean_curve):
+            ax.fill_between(steps, mean_curve - std_curve, mean_curve + std_curve, color=colour_map[task], alpha=0.12)
+        plotted = True
+
+    if not plotted:
+        plt.close(fig)
+        return False
+
+    ax.set_xlabel("Aligned step")
+    ax.set_ylabel("Pairwise embedding divergence")
+    ax.set_title("Do Repeated Trajectories Converge or Diverge?")
+    ax.legend(fontsize=8)
+    _savefig(fig, out_path)
+    return True
+
+
+def plot_pairwise_contraction_scatter(
+    per_task_sensitivity: dict[str, dict],
+    out_path: Path,
+) -> bool:
+    """Scatter initial vs final pairwise divergence; below diagonal means convergence."""
+    if not per_task_sensitivity:
+        return False
+
+    fig, ax = plt.subplots(figsize=(7, 6))
+    cmap = plt.cm.tab10
+    tasks = sorted(per_task_sensitivity)
+    colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)}
+
+    max_seen = 0.0
+    plotted = False
+    for task in tasks:
+        points = per_task_sensitivity[task].get("pair_points", [])
+        if not points:
+            continue
+        xs = [point["initial_divergence"] for point in points]
+        ys = [point["final_divergence"] for point in points]
+        max_seen = max(max_seen, *(xs + ys))
+        ax.scatter(xs, ys, s=60, alpha=0.8, color=colour_map[task], label=task)
+        plotted = True
+
+    if not plotted:
+        plt.close(fig)
+        return False
+
+    limit = max(max_seen, 0.1)
+    ax.plot([0, limit], [0, limit], linestyle="--", color="black", linewidth=1)
+    ax.set_xlabel("Initial pairwise divergence")
+    ax.set_ylabel("Final pairwise divergence")
+    ax.set_title("Pairwise Trajectory Contraction")
+    ax.legend(fontsize=8)
+    _savefig(fig, out_path)
+    return True
+
+
+def plot_sensitivity_heatmap(
+    per_task_sensitivity: dict[str, dict],
+    out_path: Path,
+) -> bool:
+    """Heatmap of per-task sensitivity metrics."""
+    if not per_task_sensitivity:
+        return False
+
+    metrics = [
+        ("mean_score_delta", "score_delta"),
+        ("mean_tool_edit_distance", "tool_edit"),
+        ("mean_family_js_divergence", "js_div"),
+        ("mean_lyapunov_proxy", "lyapunov"),
+        ("fraction_converging_pairs", "frac_converging"),
+    ]
+    tasks = sorted(per_task_sensitivity)
+    data = np.zeros((len(tasks), len(metrics)))
+    for row_idx, task in enumerate(tasks):
+        summary = per_task_sensitivity[task]
+        for col_idx, (key, _label) in enumerate(metrics):
+            data[row_idx, col_idx] = float(summary.get(key, 0.0))
+
+    fig, ax = plt.subplots(figsize=(9, max(3, len(tasks) * 0.7)))
+    im = ax.imshow(data, aspect="auto", cmap="Blues")
+    ax.set_xticks(range(len(metrics)))
+    ax.set_xticklabels([label for _key, label in metrics], rotation=30, ha="right")
+    ax.set_yticks(range(len(tasks)))
+    ax.set_yticklabels(tasks)
+    for row_idx in range(len(tasks)):
+        for col_idx in range(len(metrics)):
+            ax.text(col_idx, row_idx, f"{data[row_idx, col_idx]:.2f}", ha="center", va="center", fontsize=8)
+    fig.colorbar(im, ax=ax, shrink=0.8)
+    ax.set_title("Pairwise Sensitivity by Task")
+    _savefig(fig, out_path)
+    return True
+
+
+def generate_all_plots(
+    dynamics_list: list[Dynamics],
+    runs: list,
+    stratified: dict[str, StratifiedAssessment],
+    km_points: list[SurvivalPoint] | None = None,
+    event_name: str = "first_correct_write",
+    out_dir: Path = Path("results"),
+    sensitivity_summary: dict[str, dict] | None = None,
+) -> list[Path]:
+    """Generate all dynamics plots and return list of saved paths."""
+    out_dir.mkdir(parents=True, exist_ok=True)
+    saved: list[Path] = []
+
+    # Labels by regime
+    regime_labels = [d.regime.value for d in dynamics_list]
+    tier_labels = []
+    for r in runs:
+        tid = r.task_id.lower()
+        tier = "unknown"
+        for i in range(1, 6):
+            if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"):
+                tier = f"tier{i}"
+                break
+        tier_labels.append(tier)
+
+    # Drift curves by regime
+    p = out_dir / "drift_by_regime.png"
+    plot_drift_curves(dynamics_list, regime_labels, p)
+    saved.append(p)
+
+    # Drift curves by tier
+    p = out_dir / "drift_by_tier.png"
+    plot_drift_curves(dynamics_list, tier_labels, p)
+    saved.append(p)
+
+    p = out_dir / "step_size_by_regime.png"
+    plot_step_size_curves(dynamics_list, regime_labels, p)
+    saved.append(p)
+
+    p = out_dir / "step_size_by_tier.png"
+    plot_step_size_curves(dynamics_list, tier_labels, p)
+    saved.append(p)
+
+    # PCA trajectories
+    has_pca = any(d.pca_trajectory is not None for d in dynamics_list)
+    if has_pca:
+        p = out_dir / "pca_by_regime.png"
+        plot_pca_trajectories(dynamics_list, regime_labels, p)
+        saved.append(p)
+        p = out_dir / "pca_by_tier.png"
+        plot_pca_trajectories(dynamics_list, tier_labels, p)
+        saved.append(p)
+
+    # Per-stratifier plots
+    for name, sa in stratified.items():
+        p = out_dir / f"regimes_by_{name}.png"
+        plot_regime_distribution(sa.strata, name, p)
+        saved.append(p)
+
+        p = out_dir / f"scores_by_{name}.png"
+        plot_score_distributions(sa.strata, name, p)
+        saved.append(p)
+
+        p = out_dir / f"dynamics_heatmap_{name}.png"
+        plot_stratum_dynamics_heatmap(sa.strata, name, p)
+        saved.append(p)
+
+    # Survival curve
+    if km_points:
+        p = out_dir / f"survival_{event_name}.png"
+        plot_survival_curve(km_points, event_name, p)
+        saved.append(p)
+
+    per_task_sensitivity = (sensitivity_summary or {}).get("same_task", {}).get("per_task", {})
+    p = out_dir / "pairwise_divergence_by_task.png"
+    if plot_pairwise_divergence_curves(per_task_sensitivity, p):
+        saved.append(p)
+
+    p = out_dir / "pairwise_contraction_scatter.png"
+    if plot_pairwise_contraction_scatter(per_task_sensitivity, p):
+        saved.append(p)
+
+    p = out_dir / "sensitivity_heatmap.png"
+    if plot_sensitivity_heatmap(per_task_sensitivity, p):
+        saved.append(p)
+
+    return saved
--- a/clawbench/harness.py
+++ b/clawbench/harness.py
@ -103,6 +103,7 @@ class BenchmarkHarness:
        self.concurrency = max(1, int(concurrency))
        self.browser_concurrency = max(1, int(browser_concurrency))
        self.repo_root = Path(__file__).parent.parent
+        self.last_task_runs: dict[str, list[TaskRunResult]] = {}

    async def run(self) -> BenchmarkResult:
        tasks = load_all_tasks(
@ -148,6 +149,7 @@ class BenchmarkHarness:
                f"({mean_run:.1f}s avg, concurrency={self.concurrency})[/dim]"
            )

+        self.last_task_runs = all_results
        return self._aggregate(tasks, all_results)

    async def _execute_runs(
--- a/clawbench/submission_models.py
+++ b/clawbench/submission_models.py
@ -0,0 +1,147 @@
+"""Preset model catalog and selection helpers for the Space submit UI."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+CUSTOM_PRESET_LABEL = "(custom)"
+
+PRESET_AUDIENCE_ALL = "All Presets"
+PRESET_AUDIENCE_CLAW = "Claw Users"
+PRESET_AUDIENCE_BUDGET = "Budget Researchers"
+
+PRESET_AUDIENCE_CHOICES = (
+    PRESET_AUDIENCE_ALL,
+    PRESET_AUDIENCE_CLAW,
+    PRESET_AUDIENCE_BUDGET,
+)
+
+
+@dataclass(frozen=True)
+class PresetModel:
+    label: str
+    model_id: str
+    provider: str
+    audiences: tuple[str, ...]
+
+
+PRESET_MODELS = (
+    PresetModel(
+        label="GPT-OSS 20B (Ollama)",
+        model_id="ollama/gpt-oss:20b",
+        provider="ollama",
+        audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
+    ),
+    PresetModel(
+        label="Qwen 3.5 27B (Ollama)",
+        model_id="ollama/qwen3.5:27b",
+        provider="ollama",
+        audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
+    ),
+    PresetModel(
+        label="Qwen3 32B",
+        model_id="huggingface/Qwen/Qwen3-32B",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
+    ),
+    PresetModel(
+        label="Gemma 4 26B MoE",
+        model_id="huggingface/google/gemma-4-26B-A4B-it",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
+    ),
+    PresetModel(
+        label="GLM 5.1 (754B MoE)",
+        model_id="huggingface/zai-org/GLM-5.1",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="GLM 5 (400B MoE)",
+        model_id="huggingface/zai-org/GLM-5",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="DeepSeek R1",
+        model_id="huggingface/deepseek-ai/DeepSeek-R1",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Kimi K2 Instruct",
+        model_id="huggingface/moonshotai/Kimi-K2-Instruct",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="MiniMax M2.5",
+        model_id="huggingface/MiniMaxAI/MiniMax-M2.5",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Llama 3.3 70B",
+        model_id="huggingface/meta-llama/Llama-3.3-70B-Instruct",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Llama 3.1 70B",
+        model_id="huggingface/meta-llama/Llama-3.1-70B-Instruct",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Claude Sonnet 4.6",
+        model_id="anthropic/claude-sonnet-4-6",
+        provider="anthropic",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Claude Opus 4.6",
+        model_id="anthropic/claude-opus-4-6",
+        provider="anthropic",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+)
+
+PRESET_MODEL_MAP = {preset.label: preset.model_id for preset in PRESET_MODELS}
+_PRESET_BY_LABEL = {preset.label: preset for preset in PRESET_MODELS}
+
+
+def infer_provider(model_id: str) -> str:
+    normalized = model_id.strip()
+    if not normalized or "/" not in normalized:
+        return ""
+    return normalized.split("/", 1)[0].strip().lower()
+
+
+def preset_models_for_audience(audience: str | None) -> list[PresetModel]:
+    if not audience or audience == PRESET_AUDIENCE_ALL:
+        return list(PRESET_MODELS)
+    return [preset for preset in PRESET_MODELS if audience in preset.audiences]
+
+
+def preset_labels_for_audience(audience: str | None) -> list[str]:
+    return [preset.label for preset in preset_models_for_audience(audience)]
+
+
+def resolve_model_selection(
+    model: str,
+    preset_label: str,
+    provider: str = "",
+) -> tuple[str, str]:
+    selected_model = model.strip()
+    selected_provider = provider.strip()
+
+    preset = _PRESET_BY_LABEL.get(preset_label)
+    if preset is not None:
+        selected_model = preset.model_id
+        if not selected_provider:
+            selected_provider = preset.provider
+
+    if not selected_provider:
+        selected_provider = infer_provider(selected_model)
+
+    return selected_model, selected_provider
--- a/scripts/classify_regimes.py
+++ b/scripts/classify_regimes.py
@ -1,140 +1,112 @@
-"""Classify each archived run's dynamical regime from its turn trajectory.
+#!/usr/bin/env python3
+"""Classify posterior run trajectories into dynamical regimes.

-Following "When LLMs Are Dreaming..." §What We Expect to See:
+We embed each assistant turn using bag-of-words text plus tool-call summaries,
+then compute simple geometric proxies:

-  TRAPPED/ATTRACTOR   — low support (Vol_log), high recurrence, high BOPS.
-                        Agent converged to a point; may be good (solved it)
-                        or bad (got stuck in a loop on a single idea).
+    drift_mean = mean ||x_t - x_{t-1}||
+    from_start = max ||x_t - x_0||
+    recurrence = max cosine(x_i, x_j) for non-adjacent turns
+    vol_log    = log det(Sigma + eps I)

-  LIMIT-CYCLE         — high recurrence + bounded drift + quasi-periodic revisits.
-                        Agent loops between a few states.
-
-  DIFFUSIVE/WANDERING — growing support, rising drift, low recurrence.
-                        Agent explores without converging; often "goal drift".
-
-  SENSITIVE           — (requires paraphrased-pair runs; skip here.)
-
-  TOO-SHORT           — trajectory < 3 assistant turns; can't classify dynamics.
-
-We work in a TF-IDF bag-of-words embedding space (same vocab as C(q)),
-with each turn's state vector = its assistant text + tool-call args.
-
-Metrics per run:
-  - drift_mean:  mean ||e_t − e_{t−1}|| across turns
-  - from_start:  max ||e_t − e_0||  (farthest the run drifted from origin)
-  - recurrence:  max_{i<j, j−i≥2} cos(e_i, e_j)  — best return-after-gap match
-  - vol_log:     log det(Σ + εI) over turn states — support volume proxy
-
-Classifier rules (tuned empirically on the distribution):
-  if n_turns < 3                              → too_short
-  elif drift_mean < 0.15 and vol_log < −6     → trapped
-  elif recurrence > 0.80 and drift_mean < 0.25 → limit_cycle
-  elif drift_mean > 0.35 and vol_log > −3     → diffusive
-  else                                         → mixed
-
-Output: reports/regimes.json with per-run classification.
-
-Usage:
-    .venv/bin/python3 scripts/classify_regimes.py
+Runs are then bucketed into coarse regimes such as trapped, limit_cycle, and
+diffusive using quartile-based thresholds estimated from the observed archive.
 """

 from __future__ import annotations

+import argparse
 import json
 import re
+import sys
 from collections import Counter, defaultdict
 from pathlib import Path

 import numpy as np

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-MODELS = [
-    "anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7",
-    "anthropic_claude-sonnet-4-6", "openai_gpt-5.4",
-    "google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1",
-    "openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5",
-    "openrouter_qwen_qwen3.6-plus",
-]
+from clawbench.dynamics_archive import load_task_runs_by_model

 WORD_RE = re.compile(r"[a-z]{3,}")
-STOPWORDS = set("the and that with this have from what your will can but not "
-                "was will are been one would there been they will their has "
-                "had its were only some than about these which into also each "
-                "when where them how who them very much more most other then "
-                "here such does like just make many like want need take".split())
+STOPWORDS = set(
+    "the and that with this have from what your will can but not "
+    "was are been one would there they their has had its were only some "
+    "than about these which into also each when where them how who very "
+    "much more most other then here such does like just make many want need take".split()
+)


 def tokenize(text: str) -> list[str]:
    return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS]


-def build_vocab(all_turn_texts: list[str], top_k: int = 500) -> dict[str, int]:
-    c = Counter()
-    for t in all_turn_texts:
-        c.update(set(tokenize(t)))
-    return {w: i for i, (w, _) in enumerate(c.most_common(top_k))}
+def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]:
+    counter = Counter()
+    for text in texts:
+        counter.update(set(tokenize(text)))
+    return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))}


 def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray:
-    v = np.zeros(len(vocab), dtype=np.float32)
-    for w, c in Counter(tokenize(text)).items():
-        if w in vocab:
-            v[vocab[w]] = c
-    n = np.linalg.norm(v)
-    return v / n if n > 0 else v
+    vec = np.zeros(len(vocab), dtype=np.float32)
+    for word, cnt in Counter(tokenize(text)).items():
+        if word in vocab:
+            vec[vocab[word]] = cnt
+    norm = np.linalg.norm(vec)
+    return vec / norm if norm > 0 else vec


-def turn_texts(run_data: dict) -> list[str]:
-    """Extract one text string per assistant turn (text + tool-call summary)."""
+def turn_texts(run, fallback_any_message: bool = False) -> list[str]:
+    source = run.transcript.messages if fallback_any_message else run.transcript.assistant_messages
    out = []
-    for m in run_data.get("transcript", {}).get("messages", []):
-        if m.get("role") != "assistant":
-            continue
+    for msg in source:
        parts = []
-        if m.get("text"):
-            parts.append(m["text"])
-        for tc in (m.get("tool_calls") or []):
-            name = tc.get("name", "")
-            args_str = json.dumps(tc.get("arguments", {}))[:200]
-            parts.append(f"{name} {args_str}")
+        if msg.text:
+            parts.append(msg.text)
+        for tc in msg.tool_calls:
+            parts.append(tc.name)
+            if tc.input:
+                parts.append(json.dumps(tc.input, sort_keys=True)[:200])
        if parts:
            out.append(" ".join(parts))
    return out


-def trajectory_metrics(vecs: np.ndarray) -> dict:
-    """Compute dynamical metrics over a (n_turns, d) trajectory matrix."""
+def trajectory_metrics(vecs: np.ndarray) -> dict[str, float]:
+    """Compute drift, recurrence, and support-volume proxies for one run."""
    n = vecs.shape[0]
    if n < 2:
-        return {"n_turns": n, "drift_mean": 0.0, "from_start": 0.0,
-                "recurrence": 0.0, "vol_log": -12.0}
-    # Drift: consecutive distances
+        return {
+            "n_turns": float(n),
+            "drift_mean": 0.0,
+            "from_start": 0.0,
+            "recurrence": 0.0,
+            "vol_log": -12.0,
+        }
+
    diffs = np.linalg.norm(np.diff(vecs, axis=0), axis=1)
    drift_mean = float(diffs.mean())
-    # From start: max distance from turn 0
-    dists_from_0 = np.linalg.norm(vecs - vecs[0:1], axis=1)
-    from_start = float(dists_from_0.max())
-    # Recurrence: best non-adjacent cosine similarity (ignoring immediate neighbors)
+    from_start = float(np.linalg.norm(vecs - vecs[0:1], axis=1).max())
+
    recurrence = 0.0
    for i in range(n):
        for j in range(i + 2, n):
-            ni, nj = np.linalg.norm(vecs[i]), np.linalg.norm(vecs[j])
+            ni = np.linalg.norm(vecs[i])
+            nj = np.linalg.norm(vecs[j])
            if ni > 0 and nj > 0:
-                c = float(vecs[i] @ vecs[j] / (ni * nj))
-                if c > recurrence:
-                    recurrence = c
-    # Vol_log: log det of turn-state covariance
+                sim = float(vecs[i] @ vecs[j] / (ni * nj))
+                recurrence = max(recurrence, sim)
+
    if n >= 3:
-        Sigma = np.cov(vecs.T)
-        # Use log|Σ + εI|; since d is large (500) we take eigenvalues + clip
-        eigs = np.linalg.eigvalsh(Sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32))
+        sigma = np.cov(vecs.T)
+        eigs = np.linalg.eigvalsh(sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32))
        vol_log = float(np.log(np.clip(eigs, 1e-12, None)).sum())
    else:
        vol_log = -12.0
+
    return {
-        "n_turns": n,
+        "n_turns": float(n),
        "drift_mean": drift_mean,
        "from_start": from_start,
        "recurrence": recurrence,
@ -142,109 +114,105 @@ def trajectory_metrics(vecs: np.ndarray) -> dict:
    }


-def classify(m: dict, thresholds: dict) -> str:
-    """Classify based on quartile thresholds of the actual distribution.
-
-    Thresholds (set empirically from observed distribution):
-      drift_low  = p25  drift_hi = p75
-      vol_low    = p25  vol_hi   = p75
-      rec_hi     = p75
-
-    Rules (priority order):
-      n_turns < 3             → too_short
-      drift < drift_low AND vol < vol_low  → trapped
-      rec > rec_hi AND drift < median       → limit_cycle
-      drift > drift_hi AND vol > vol_hi     → diffusive
-      else                                  → mixed
-    """
-    n = m["n_turns"]
-    if n < 3:
+def classify(metrics: dict[str, float], thresholds: dict[str, float]) -> str:
+    """Map trajectory metrics to a coarse regime label."""
+    n_turns = int(metrics["n_turns"])
+    if n_turns < 3:
        return "too_short"
-    d = m["drift_mean"]
-    rec = m["recurrence"]
-    vol = m["vol_log"]
-    if d < thresholds["drift_low"] and vol < thresholds["vol_low"]:
+    drift = metrics["drift_mean"]
+    recurrence = metrics["recurrence"]
+    vol = metrics["vol_log"]
+
+    if drift < thresholds["drift_low"] and vol < thresholds["vol_low"]:
        return "trapped"
-    if rec > thresholds["rec_hi"] and d < thresholds["drift_med"]:
+    if recurrence > thresholds["rec_hi"] and drift < thresholds["drift_med"]:
        return "limit_cycle"
-    if d > thresholds["drift_hi"] and vol > thresholds["vol_hi"]:
+    if drift > thresholds["drift_hi"] and vol > thresholds["vol_hi"]:
        return "diffusive"
    return "mixed"


 def main() -> None:
-    # First pass: collect turn texts to build vocab
+    parser = argparse.ArgumentParser(description="Classify cached run regimes")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()
+
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")
+
    all_turn_texts: list[str] = []
-    run_turns: dict[tuple, list[str]] = {}
-    for model in MODELS:
-        for rf in (ARCH / model).rglob("run*.json"):
-            try:
-                d = json.loads(rf.read_text())
-            except Exception:
-                continue
-            task = rf.parent.name
-            run_idx = int(re.match(r"run(\d+)", rf.stem).group(1))
-            ts = turn_texts(d)
-            run_turns[(model, task, run_idx)] = ts
-            all_turn_texts.extend(ts)
+    run_turns: dict[str, list[str]] = {}
+
+    for model_name, task_runs in grouped.items():
+        for task_id, runs in task_runs.items():
+            for run in runs:
+                ts = turn_texts(run, fallback_any_message=False)
+                key = f"{model_name}/{task_id}/run{run.run_index}"
+                run_turns[key] = ts
+                all_turn_texts.extend(ts)
+
+    used_fallback_messages = False
+    if not all_turn_texts:
+        used_fallback_messages = True
+        all_turn_texts = []
+        run_turns = {}
+        for model_name, task_runs in grouped.items():
+            for task_id, runs in task_runs.items():
+                for run in runs:
+                    ts = turn_texts(run, fallback_any_message=True)
+                    key = f"{model_name}/{task_id}/run{run.run_index}"
+                    run_turns[key] = ts
+                    all_turn_texts.extend(ts)
+
+    if not all_turn_texts:
+        raise SystemExit("No usable turn text found in archive.")

    vocab = build_vocab(all_turn_texts, top_k=500)
-    print(f"Runs collected: {len(run_turns)}  vocab size: {len(vocab)}")

-    # Second pass: vectorize + compute metrics
-    per_run: dict[str, dict] = {}
+    per_run: dict[str, dict[str, float | str]] = {}
    for key, ts in run_turns.items():
-        model, task, run_idx = key
        if not ts:
            continue
-        vecs = np.stack([vectorize(t, vocab) for t in ts])
-        m = trajectory_metrics(vecs)
-        per_run[f"{model}/{task}/run{run_idx}"] = m
+        vecs = np.stack([vectorize(text, vocab) for text in ts])
+        per_run[key] = trajectory_metrics(vecs)

-    # Derive thresholds from actual distribution of n_turns>=3 runs
-    drifts = np.array([v["drift_mean"] for v in per_run.values() if v["n_turns"] >= 3])
-    recs = np.array([v["recurrence"] for v in per_run.values() if v["n_turns"] >= 3])
-    vols = np.array([v["vol_log"] for v in per_run.values() if v["n_turns"] >= 3])
-    thresholds = {
-        "drift_low": float(np.percentile(drifts, 25)),
-        "drift_med": float(np.percentile(drifts, 50)),
-        "drift_hi":  float(np.percentile(drifts, 75)),
-        "vol_low":   float(np.percentile(vols, 25)),
-        "vol_hi":    float(np.percentile(vols, 75)),
-        "rec_hi":    float(np.percentile(recs, 75)),
-    }
-    print(f"\nThresholds (quartile-based from observed distribution):")
-    for k, v in thresholds.items():
-        print(f"  {k:<12}  {v:>10.3f}")
+    eligible = [r for r in per_run.values() if int(r["n_turns"]) >= 3]
+    if eligible:
+        drifts = np.array([float(v["drift_mean"]) for v in eligible])
+        recs = np.array([float(v["recurrence"]) for v in eligible])
+        vols = np.array([float(v["vol_log"]) for v in eligible])
+        thresholds = {
+            "drift_low": float(np.percentile(drifts, 25)),
+            "drift_med": float(np.percentile(drifts, 50)),
+            "drift_hi": float(np.percentile(drifts, 75)),
+            "vol_low": float(np.percentile(vols, 25)),
+            "vol_hi": float(np.percentile(vols, 75)),
+            "rec_hi": float(np.percentile(recs, 75)),
+        }
+    else:
+        thresholds = {
+            "drift_low": 0.15,
+            "drift_med": 0.25,
+            "drift_hi": 0.35,
+            "vol_low": -6.0,
+            "vol_hi": -3.0,
+            "rec_hi": 0.8,
+        }

-    # Apply classifier with thresholds
-    for key in per_run:
-        per_run[key]["regime"] = classify(per_run[key], thresholds)
+    for key, metrics in per_run.items():
+        metrics["regime"] = classify(metrics, thresholds)
+        metrics["turn_source"] = "any_message" if used_fallback_messages else "assistant"

-    # Summary by regime
-    counts = Counter(v["regime"] for v in per_run.values())
-    print(f"\nRegime distribution (n={len(per_run)} runs):")
-    for regime, n in counts.most_common():
-        print(f"  {regime:<14} {n:>4}  ({100*n/len(per_run):>4.1f}%)")
+    args.reports_dir.mkdir(parents=True, exist_ok=True)
+    out = args.reports_dir / "regimes.json"
+    out.write_text(json.dumps(per_run, indent=2), encoding="utf-8")

-    # Per-model regime breakdown
-    print(f"\n{'Model':<10}  " + " ".join(f"{r:>11}" for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]))
-    print("-" * 70)
-    pm_counts = defaultdict(Counter)
-    for key, v in per_run.items():
-        model = key.split("/")[0]
-        pm_counts[model][v["regime"]] += 1
-    for model in MODELS:
-        row = [f"{model.split('_')[-1][:9]:<10}"]
-        for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]:
-            row.append(f"{pm_counts[model][r]:>11}")
-        print("  ".join(row))
-
-    # Write output
-    out = ROOT / "reports" / "regimes.json"
-    out.parent.mkdir(exist_ok=True)
-    out.write_text(json.dumps(per_run, indent=2))
-    print(f"\nWrote: {out}")
+    counts = Counter(str(v["regime"]) for v in per_run.values())
+    print(f"Wrote: {out}")
+    print(f"Regime counts: {dict(counts)}")


 if __name__ == "__main__":
--- a/scripts/compute_constraint_index.py
+++ b/scripts/compute_constraint_index.py
@ -1,145 +1,127 @@
-"""Compute Constraint Index C(q) per task from existing v4-19-full archive.
+#!/usr/bin/env python3
+"""Compute posterior Constraint Index C(q) from cached runs.

-Following "When LLMs Are Dreaming..." paper §Query-design:
+Task-level constraint index:

-  C(q) = z(PR(q)) + z(entropy(q)) + z(BOPS(q))
+    C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q))

 Where:
-  - PR(q): participation ratio = (tr Σ)² / tr(Σ²) of response embeddings
-           across all (model, run) responses to query q. Low PR = everyone
-           writes similar thing (prompt is constrained). High PR = responses
-           spread out (prompt is open-ended).
-  - entropy(q): Shannon entropy of (discretized) response-feature distribution.
-  - BOPS(q): Bayesian Optimal Prediction Score — how well can we predict
-             response given q? Proxied here as inter-run cosine similarity
-             for the same model (high similarity = high predictability).

-Since we don't have sentence-transformers, we use TF-IDF-style bag-of-words
-from the final assistant message per run. This is crude but measures the
-same signal — whether models produce similar vs divergent output.
+    PR(q)   = participation ratio of the task response covariance
+    H(q)    = Shannon entropy of the covariance eigenspectrum
+    BOPS(q) = within-model inter-run predictability proxy

-Output: reports/constraint_index.json with per-task C(q) components +
-        combined z-score.
+High C(q) means a task is more constrained: models and repeated runs tend to
+land in a narrower response manifold. Low C(q) means the task is more open or
+stylistically underconstrained.

-Usage:
-    .venv/bin/python3 scripts/compute_constraint_index.py
+This implementation uses a normalized bag-of-words representation built from
+the full assistant trajectory text plus tool-call names and compacted inputs.
 """

 from __future__ import annotations

+import argparse
 import json
 import re
-import glob
+import sys
 from collections import Counter, defaultdict
 from pathlib import Path

 import numpy as np
-from scipy.stats import entropy as shannon_entropy

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-MODELS = [
-    "anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7",
-    "anthropic_claude-sonnet-4-6", "openai_gpt-5.4",
-    "google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1",
-    "openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5",
-    "openrouter_qwen_qwen3.6-plus",
-]
+from clawbench.dynamics_archive import load_task_runs_by_model

 WORD_RE = re.compile(r"[a-z]{3,}")
-STOPWORDS = set("the and that with this have from what your will can but not "
-                "was will are been one would there been they will their has "
-                "had its were only some than about these which into also each "
-                "when where them how who them very much more most other then "
-                "here such does like just make many like want need take".split())
+STOPWORDS = set(
+    "the and that with this have from what your will can but not "
+    "was are been one would there they their has had its were only some "
+    "than about these which into also each when where them how who very "
+    "much more most other then here such does like just make many want need take".split()
+)


-def final_assistant_text(run_path: Path, max_chars: int = 4000) -> str:
-    """Extract the last assistant message text + tool-call arg summary."""
-    try:
-        d = json.loads(run_path.read_text())
-    except Exception:
-        return ""
-    msgs = d.get("transcript", {}).get("messages", [])
-    texts = []
-    for m in msgs:
-        if m.get("role") != "assistant":
-            continue
-        if m.get("text"):
-            texts.append(m["text"])
-        for tc in (m.get("tool_calls") or []):
-            name = tc.get("name", "")
-            args_str = json.dumps(tc.get("arguments", {}))[:200]
-            texts.append(f"{name} {args_str}")
-    blob = " ".join(texts)[:max_chars]
-    return blob
+def _assistant_trajectory_text(run, max_chars: int = 4000) -> str:
+    parts = []
+    for message in run.transcript.assistant_messages:
+        if message.text:
+            parts.append(message.text)
+        for call in message.tool_calls:
+            parts.append(call.name)
+            if call.input:
+                parts.append(json.dumps(call.input, sort_keys=True)[:200])
+    return " ".join(p for p in parts if p).strip()[:max_chars]
+
+
+def _fallback_text_from_any_message(run) -> str:
+    for msg in reversed(run.transcript.messages):
+        parts = []
+        if msg.text:
+            parts.append(msg.text)
+        for call in msg.tool_calls:
+            parts.append(call.name)
+            if call.input:
+                parts.append(json.dumps(call.input, sort_keys=True)[:200])
+        if parts:
+            return " ".join(parts).strip()
+    return ""


 def tokenize(text: str) -> list[str]:
-    return [w for w in WORD_RE.findall(text.lower()) if w not in STOPWORDS]
+    return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS]


 def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]:
-    """Build a vocab of the top-k most common tokens across all texts."""
-    counter = Counter()
-    for t in texts:
-        counter.update(set(tokenize(t)))
-    return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))}
+    counts = Counter()
+    for text in texts:
+        counts.update(set(tokenize(text)))
+    return {word: idx for idx, (word, _) in enumerate(counts.most_common(top_k))}


 def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray:
-    """TF-IDF-ish: token frequency normalized to unit L2 for cosine geometry."""
-    v = np.zeros(len(vocab), dtype=np.float32)
+    vec = np.zeros(len(vocab), dtype=np.float32)
    toks = tokenize(text)
    if not toks:
-        return v
+        return vec
    counts = Counter(toks)
-    for w, c in counts.items():
-        if w in vocab:
-            v[vocab[w]] = c
-    n = np.linalg.norm(v)
-    return v / n if n > 0 else v
+    for word, cnt in counts.items():
+        if word in vocab:
+            vec[vocab[word]] = cnt
+    norm = np.linalg.norm(vec)
+    return vec / norm if norm > 0 else vec


 def participation_ratio(X: np.ndarray) -> float:
-    """PR(X) = (tr Σ)² / tr(Σ²). Measures effective dimensionality 1–d."""
+    """PR(X) = (tr Sigma)^2 / tr(Sigma^2), an effective dimensionality proxy."""
    if X.shape[0] < 2:
        return 1.0
-    Sigma = np.cov(X.T)
-    if Sigma.ndim == 0:
+    sigma = np.cov(X.T)
+    if sigma.ndim == 0:
        return 1.0
-    tr = np.trace(Sigma)
-    tr_sq = np.trace(Sigma @ Sigma)
+    tr = np.trace(sigma)
+    tr_sq = np.trace(sigma @ sigma)
    if tr_sq < 1e-12:
        return 1.0
-    return float(tr ** 2 / tr_sq)
+    return float((tr**2) / tr_sq)


-def response_entropy(X: np.ndarray, n_clusters: int = 8) -> float:
-    """Entropy of a k-means-like discretization of responses.
-
-    Since we have small n per task (~27 responses), we cluster by nearest-
-    centroid using the top-few PCA directions. Simpler: use normalized
-    eigenvalues of covariance as a proxy for entropy over principal modes.
-    """
+def response_entropy(X: np.ndarray) -> float:
+    """Entropy over normalized covariance eigenvalues, in bits."""
    if X.shape[0] < 2:
        return 0.0
-    Sigma = np.cov(X.T)
-    eigs = np.linalg.eigvalsh(Sigma)
+    sigma = np.cov(X.T)
+    eigs = np.linalg.eigvalsh(sigma)
    eigs = np.clip(eigs, 1e-12, None)
-    eigs = eigs / eigs.sum()
-    return float(shannon_entropy(eigs, base=2))
+    probs = eigs / eigs.sum()
+    return float(-np.sum(probs * np.log2(probs)))


 def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> float:
-    """BOPS proxy: inter-run cosine similarity within same model.
-
-    High similarity = predictable (high BOPS). Low similarity = novel each run.
-    Returns mean cosine across all pairs within each model, averaged across models.
-    """
+    """Mean within-model pairwise cosine similarity across repeated runs."""
    per_model_means = []
-    for _model, vecs in run_vecs.items():
+    for vecs in run_vecs.values():
        if len(vecs) < 2:
            continue
        sims = []
@ -154,91 +136,88 @@ def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> floa
    return float(np.mean(per_model_means)) if per_model_means else 0.0


+def zscore(value: float, arr: np.ndarray) -> float:
+    std = arr.std()
+    return float((value - arr.mean()) / std) if std > 1e-12 else 0.0
+
+
 def main() -> None:
-    # Gather: per-task list of texts + per-model list of per-run vectors
+    parser = argparse.ArgumentParser(description="Compute posterior constraint index per task")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()
+
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")
+
    per_task_texts: dict[str, list[str]] = defaultdict(list)
-    per_task_model_runs: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
-    for model in MODELS:
-        model_dir = ARCH / model
-        if not model_dir.exists():
-            continue
-        for task_dir in model_dir.iterdir():
-            if not task_dir.is_dir():
-                continue
-            task = task_dir.name
-            for rf in sorted(task_dir.glob("run*.json")):
-                text = final_assistant_text(rf)
+    per_task_model_texts: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
+
+    use_fallback_messages = False
+    for model_name, task_runs in grouped.items():
+        for task_id, runs in task_runs.items():
+            for run in runs:
+                text = _assistant_trajectory_text(run)
                if text:
-                    per_task_texts[task].append(text)
-                    per_task_model_runs[task][model].append(text)
+                    per_task_texts[task_id].append(text)
+                    per_task_model_texts[task_id][model_name].append(text)

-    print(f"Tasks with responses: {len(per_task_texts)}")
+    all_texts = [text for texts in per_task_texts.values() for text in texts]
+    if not all_texts:
+        use_fallback_messages = True
+        for model_name, task_runs in grouped.items():
+            for task_id, runs in task_runs.items():
+                for run in runs:
+                    text = _fallback_text_from_any_message(run)
+                    if text:
+                        per_task_texts[task_id].append(text)
+                        per_task_model_texts[task_id][model_name].append(text)
+        all_texts = [text for texts in per_task_texts.values() for text in texts]
+
+    if not all_texts:
+        raise SystemExit("No usable text found in cached transcripts.")

-    # Build a GLOBAL vocab across all tasks for comparable vector spaces
-    all_texts = [t for ts in per_task_texts.values() for t in ts]
    vocab = build_vocab(all_texts, top_k=500)
-    print(f"Global vocab size: {len(vocab)}")
-
-    # Compute per-task metrics
-    per_task: dict[str, dict] = {}
-    for task, texts in sorted(per_task_texts.items()):
-        if len(texts) < 5:
-            continue
-        X = np.stack([vectorize(t, vocab) for t in texts])  # (n_responses, vocab_dim)
+    per_task: dict[str, dict[str, float | str]] = {}
+    for task_id, texts in sorted(per_task_texts.items()):
+        X = np.stack([vectorize(text, vocab) for text in texts])
        pr = participation_ratio(X)
        ent = response_entropy(X)
-        # BOPS: within-model run predictability
-        model_vecs: dict[str, list[np.ndarray]] = {}
-        for m, ts in per_task_model_runs[task].items():
-            model_vecs[m] = [vectorize(t, vocab) for t in ts]
+        model_vecs = {
+            model_name: [vectorize(text, vocab) for text in model_texts]
+            for model_name, model_texts in per_task_model_texts[task_id].items()
+        }
        bops = bops_inter_run_predictability(model_vecs)
-        per_task[task] = {
+        per_task[task_id] = {
            "n_responses": len(texts),
            "PR": pr,
            "entropy": ent,
            "BOPS": bops,
+            "data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",
        }

-    # Z-score each component across tasks → combine into C(q)
+    if not per_task:
+        raise SystemExit("Not enough data to compute C(q).")
+
    prs = np.array([v["PR"] for v in per_task.values()])
    ents = np.array([v["entropy"] for v in per_task.values()])
    bopss = np.array([v["BOPS"] for v in per_task.values()])

-    def z(x, arr):
-        return float((x - arr.mean()) / (arr.std() or 1.0))
+    for task_id, v in per_task.items():
+        z_pr = zscore(v["PR"], prs)
+        z_ent = zscore(v["entropy"], ents)
+        z_bops = zscore(v["BOPS"], bopss)
+        v["z_PR"] = z_pr
+        v["z_entropy"] = z_ent
+        v["z_BOPS"] = z_bops
+        v["C_q"] = -z_pr - z_ent + z_bops

-    for task, v in per_task.items():
-        zpr = z(v["PR"], prs)
-        zent = z(v["entropy"], ents)
-        zbops = z(v["BOPS"], bopss)
-        # Paper: higher PR/entropy = MORE open-ended. Higher BOPS = MORE predictable.
-        # "Constraint" = opposite of openness. C(q) high ⇒ constrained task.
-        # So: C(q) = −z(PR) − z(entropy) + z(BOPS)
-        v["z_PR"] = zpr
-        v["z_entropy"] = zent
-        v["z_BOPS"] = zbops
-        v["C_q"] = -zpr - zent + zbops
-
-    # Sort + print
-    ranked = sorted(per_task.items(), key=lambda kv: -kv[1]["C_q"])
-    print(f"\n{'Task':<38} {'n':>3}  {'PR':>5}  {'H':>5}  {'BOPS':>5}  {'C(q)':>6}  (constraint level)")
-    print("-" * 78)
-    for task, v in ranked:
-        print(f"{task:<38} {v['n_responses']:>3}  {v['PR']:>5.2f}  {v['entropy']:>5.2f}  "
-              f"{v['BOPS']:>5.2f}  {v['C_q']:>+6.2f}")
-
-    out_path = ROOT / "reports" / "constraint_index.json"
-    out_path.parent.mkdir(exist_ok=True)
-    out_path.write_text(json.dumps(per_task, indent=2))
-    print(f"\nWrote: {out_path}")
-
-    # Bucket summary
-    highs = [t for t, v in per_task.items() if v["C_q"] > 0.5]
-    lows = [t for t, v in per_task.items() if v["C_q"] < -0.5]
-    mids = [t for t, v in per_task.items() if -0.5 <= v["C_q"] <= 0.5]
-    print(f"\nHigh-constraint (C>+0.5): {len(highs)} tasks  (responses converge)")
-    print(f"Mid:                       {len(mids)} tasks")
-    print(f"Low-constraint (C<-0.5):   {len(lows)} tasks  (responses diverge — open-ended)")
+    args.reports_dir.mkdir(parents=True, exist_ok=True)
+    out_path = args.reports_dir / "constraint_index.json"
+    out_path.write_text(json.dumps(per_task, indent=2), encoding="utf-8")
+    print(f"Wrote: {out_path}")


 if __name__ == "__main__":
--- a/scripts/generate_dynamical_report.py
+++ b/scripts/generate_dynamical_report.py
@ -1,221 +1,144 @@
-"""Assemble a combined dynamical-systems report integrating:
-  - Constraint Index C(q) per task
-  - Regime classification per run
-  - Seed vs capability variance
-  - Survival / hazard analysis
+#!/usr/bin/env python3
+"""Assemble a combined posterior dynamical-systems markdown report.

-Requires: reports/constraint_index.json, reports/regimes.json,
-          reports/variance_decomposition.json, reports/survival_analysis.json
+Inputs:
+    - constraint_index.json
+    - regimes.json
+    - variance_decomposition.json
+    - survival_analysis.json
+    - snr_weighted_ranking.json (optional)

-Output: reports/EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md
+Output:
+    - EVAL_REPORT_DYNAMICAL.md
+
+The goal is to keep a compact human-readable summary next to the machine
+outputs produced by the posterior analysis pipeline.
 """

 from __future__ import annotations

+import argparse
 import json
 from collections import Counter, defaultdict
 from pathlib import Path
-from statistics import mean

-ROOT = Path(__file__).resolve().parent.parent
-REPORTS = ROOT / "reports"

-MODEL_MAP = {
-    "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
-    "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
-    "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
-    "gpt54": ("openai_gpt-5.4", "GPT 5.4"),
-    "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
-    "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
-    "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
-    "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
-    "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
-}
+def _read_json(path: Path):
+    if not path.exists():
+        raise SystemExit(f"Missing required report file: {path}")
+    return json.loads(path.read_text(encoding="utf-8"))


 def main() -> None:
-    cq = json.loads((REPORTS / "constraint_index.json").read_text())
-    regimes = json.loads((REPORTS / "regimes.json").read_text())
-    variance = json.loads((REPORTS / "variance_decomposition.json").read_text())
-    survival = json.loads((REPORTS / "survival_analysis.json").read_text())
-
-    lines = []
-    L = lines.append
-    L("# ClawBench — Dynamical Systems Analysis (v2026-4-19-full)")
-    L("")
-    L("Inspired by *\"When LLMs Are Dreaming, Where Do They Go?\"* — treats")
-    L("agent runs as dynamical systems and extracts signal ClawBench's flat")
-    L("run_score can't: task constraint level, per-run regime, noise vs")
-    L("signal ratio, and per-turn survival curves.")
-    L("")
-
-    # ----------------- 1. Constraint Index summary -----------------
-    L("## 1. Constraint Index C(q) per task")
-    L("")
-    L("C(q) = −z(PR) − z(entropy) + z(BOPS). High C(q) = task is constrained")
-    L("(responses converge); low C(q) = open-ended (responses diverge).")
-    L("")
-    high = sorted([(t, v) for t, v in cq.items() if v["C_q"] > 0.5],
-                  key=lambda kv: -kv[1]["C_q"])
-    low = sorted([(t, v) for t, v in cq.items() if v["C_q"] < -0.5],
-                 key=lambda kv: kv[1]["C_q"])
-    mid = [t for t, v in cq.items() if -0.5 <= v["C_q"] <= 0.5]
-    L(f"- **High-constraint ({len(high)} tasks, C>+0.5):** {', '.join(t for t, _ in high[:5])}, …")
-    L(f"- **Low-constraint ({len(low)} tasks, C<−0.5):** {', '.join(t for t, _ in low[:5])}, …")
-    L(f"- **Middle ({len(mid)} tasks):** {', '.join(mid[:5])}, …")
-    L("")
-    L("Top 5 most-constrained and most-divergent tasks:")
-    L("")
-    L("| Constraint | Task | PR | Entropy | BOPS | C(q) |")
-    L("|---|---|:---:|:---:|:---:|:---:|")
-    for t, v in high[:5]:
-        L(f"| HIGH | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |")
-    for t, v in low[:5]:
-        L(f"| LOW | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |")
-    L("")
-
-    # ----------------- 2. Regime distribution -----------------
-    L("## 2. Dynamical regime per run")
-    L("")
-    L("Each run's turn-by-turn trajectory classified by drift, recurrence,")
-    L("and support volume thresholds (quartile-based).")
-    L("")
-    pm = defaultdict(Counter)
-    for key, v in regimes.items():
-        model_sub = key.split("/")[0]
-        # Reverse-map to label
-        label = next((l for l, (s, _) in MODEL_MAP.items() if s == model_sub), None)
-        if label:
-            pm[label][v["regime"]] += 1
-    L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |")
-    L("|---|:---:|:---:|:---:|:---:|:---:|")
-    for label, (_sub, pretty) in MODEL_MAP.items():
-        c = pm[label]
-        L(f"| {pretty} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | "
-          f"{c['diffusive']} | {c['mixed']} |")
-    L("")
-    L("**Interpretation:**")
-    L("- `trapped` = low drift + small support: agent converges to a point.")
-    L("  Often good on constrained tasks, sometimes 'stuck'.")
-    L("- `limit_cycle` = repeats similar states non-consecutively: tool-use loop.")
-    L("- `diffusive` = keeps exploring without converging. Goal drift risk.")
-    L("- `mixed` = no strong signature.")
-    L("")
-    L("Notable findings:")
-    L("")
-    # Find outliers
-    trap_counts = [(label, pm[label]["trapped"]) for label in MODEL_MAP]
-    cycle_counts = [(label, pm[label]["limit_cycle"]) for label in MODEL_MAP]
-    trap_counts.sort(key=lambda x: -x[1])
-    cycle_counts.sort(key=lambda x: -x[1])
-    L(f"- Most `trapped` runs: **{MODEL_MAP[trap_counts[0][0]][1]}** ({trap_counts[0][1]} runs) —")
-    L(f"  converges aggressively; often one-shot answer without iteration.")
-    L(f"- Most `limit_cycle` runs: **{MODEL_MAP[cycle_counts[0][0]][1]}** ({cycle_counts[0][1]} runs) —")
-    L(f"  repeats tool patterns between turns; check for productive vs stuck loops.")
-    L("")
-
-    # ----------------- 3. Variance decomposition -----------------
-    L("## 3. Seed-noise vs capability-signal")
-    L("")
-    agg = variance["aggregate"]
-    L(f"- **Seed-noise variance** (same model, 3 runs): **{agg['mean_seed_var']:.4f}**")
-    L(f"- **Capability variance** (across models): **{agg['mean_cap_var']:.4f}**")
-    L(f"- **Capability fraction: {agg['capability_fraction']:.1%}**")
-    L(f"  (= fraction of benchmark variance that reflects real model differences)")
-    L("")
-    L("**The other ~47% is seed noise.** Any ranking gap < √(2·seed_var) ≈")
-    L(f"0.20 between two models is within noise. Top-5 models' gap is 0.02 →")
-    L("**statistically indistinguishable.**")
-    L("")
-    L("### SNR tiers across 40 tasks")
-    L("")
-    per_task = variance["per_task"]
-    hi = [r for r in per_task if r["snr"] >= 5]
-    mid = [r for r in per_task if 1 <= r["snr"] < 5]
-    lo = [r for r in per_task if r["snr"] < 1]
-    L(f"- **High-SNR ({len(hi)} tasks, SNR ≥ 5):** reliably discriminate models")
-    for r in hi[:3]:
-        L(f"  - `{r['task']}` (SNR={r['snr']:.1f})")
-    L(f"- **Mid-SNR ({len(mid)} tasks, 1 ≤ SNR < 5):** moderate signal")
-    L(f"- **Low-SNR ({len(lo)} tasks, SNR < 1):** seed noise dominates; these")
-    L(f"  tasks give essentially random rankings")
-    for r in sorted(lo, key=lambda x: x['snr'])[:3]:
-        L(f"  - `{r['task']}` (SNR={r['snr']:.2f}) — random")
-    L("")
-
-    # ----------------- 4. Survival analysis -----------------
-    L("## 4. Per-turn survival: when do runs fail?")
-    L("")
-    L("T_F = first turn where agent emits empty response or run ends in failure.")
-    L("S(t) = fraction of runs still on-track past turn t. Low = dies early.")
-    L("")
-    L("| Model | Median fail turn | S(3) | S(5) | S(8) | S(12) | S(20) |")
-    L("|---|:---:|:---:|:---:|:---:|:---:|:---:|")
-    for label, (_sub, pretty) in MODEL_MAP.items():
-        d = survival.get(label, {})
-        surv = d.get("survival", [0]*20)
-        med = d.get("median_fail_turn", "—")
-        med_str = f"{med:.1f}" if isinstance(med, (int, float)) and med != float("inf") else str(med)
-        L(f"| {pretty} | {med_str} | {surv[2]:.2f} | {surv[4]:.2f} | "
-          f"{surv[7]:.2f} | {surv[11]:.2f} | {surv[19]:.2f} |")
-    L("")
-    # Narrative
-    surv_rank_t8 = sorted(
-        [(label, survival[label]["survival"][7])
-         for label in MODEL_MAP if label in survival],
-        key=lambda x: -x[1]
+    parser = argparse.ArgumentParser(description="Generate a combined dynamical report markdown")
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=None,
+        help="Markdown output path; defaults to <reports-dir>/EVAL_REPORT_DYNAMICAL.md",
    )
-    best = MODEL_MAP[surv_rank_t8[0][0]][1]
-    worst = MODEL_MAP[surv_rank_t8[-1][0]][1]
-    L(f"- **{best}** survives longest — {surv_rank_t8[0][1]:.0%} of runs still")
-    L(f"  producing output at turn 8.")
-    L(f"- **{worst}** dies earliest — only {surv_rank_t8[-1][1]:.0%} make it to turn 8.")
+    args = parser.parse_args()
+
+    reports = args.reports_dir
+    output_path = args.output or (reports / "EVAL_REPORT_DYNAMICAL.md")
+    cq = _read_json(reports / "constraint_index.json")
+    regimes = _read_json(reports / "regimes.json")
+    variance = _read_json(reports / "variance_decomposition.json")
+    survival = _read_json(reports / "survival_analysis.json")
+    ranking_path = reports / "snr_weighted_ranking.json"
+    ranking = json.loads(ranking_path.read_text(encoding="utf-8")) if ranking_path.exists() else None
+
+    lines: list[str] = []
+    L = lines.append
+
+    L("# ClawBench Posterior Dynamical Report")
    L("")
-    L("This is signal invisible in flat run_score: two models can score")
-    L("similarly but have very different failure profiles. Pick accordingly")
-    L("for long-horizon deployments.")
+    L("This report combines posterior-only diagnostics from cached run artifacts.")
    L("")

-    # ----------------- 5. Integrated view -----------------
-    L("## 5. Integrated view — combining all four lenses")
+    L("## 1. Constraint Index C(q)")
    L("")
-    L("For a model to be **reliably good** at a task, we need:")
-    L("- (a) It scores well (run_score high)")
-    L("- (b) Variance across seeds is low (predictable)")
-    L("- (c) It doesn't exhibit pathological regime (trapped on wrong answer / cycling)")
-    L("- (d) It survives multi-turn without dying early")
+    values = [(task, float(data.get("C_q", 0.0))) for task, data in cq.items()]
+    values.sort(key=lambda row: row[1], reverse=True)
+    highs = [row for row in values if row[1] > 0.5]
+    lows = [row for row in values if row[1] < -0.5]
+    L(f"- High-constraint tasks (C > 0.5): {len(highs)}")
+    L(f"- Low-constraint tasks (C < -0.5): {len(lows)}")
    L("")
-    L("These lenses disagree constructively:")
+    if values:
+        L("Top tasks by C(q):")
+        L("")
+        L("| Task | C(q) |")
+        L("|---|---:|")
+        for task, c_q in values[:10]:
+            L(f"| {task} | {c_q:+.3f} |")
+        L("")
+
+    L("## 2. Regime Classification")
    L("")
-    L("- **Opus 4.6** tops flat run_score but median failure at turn 5.5 (earlier than Opus 4.7's 7).")
-    L("- **GPT 5.4** is mid-pack on flat score but has highest S(8)=0.60 — long-horizon champion.")
-    L("- **Sonnet 4.6** most `trapped` runs — it commits early and sticks. Good on")
-    L("  constrained tasks, bad on open-ended (cf. memory-recall-continuation 0.15).")
-    L("- **GLM 5.1** most balanced regime distribution; justifies broad performance.")
-    L("- **Kimi K2.5** median fail at turn 3 — it's not just low-scoring, it's")
-    L("  specifically fragile under multi-turn execution.")
+    by_model = defaultdict(Counter)
+    for key, row in regimes.items():
+        model = key.split("/")[0]
+        regime = row.get("regime", "unknown")
+        by_model[model][regime] += 1
+
+    L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |")
+    L("|---|---:|---:|---:|---:|---:|")
+    for model in sorted(by_model):
+        c = by_model[model]
+        L(
+            f"| {model} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | "
+            f"{c['diffusive']} | {c['mixed']} |"
+        )
    L("")

-    # ----------------- 6. What to do next -----------------
-    L("## 6. Implications for the benchmark")
+    L("## 3. Variance Decomposition")
+    L("")
+    agg = variance.get("aggregate", {})
+    L(f"- Mean seed variance: {agg.get('mean_seed_var', 0.0):.6f}")
+    L(f"- Mean capability variance: {agg.get('mean_cap_var', 0.0):.6f}")
+    L(f"- Capability fraction: {agg.get('capability_fraction', 0.0):.1%}")
+    L(f"- High-SNR tasks: {agg.get('high_snr_tasks', 0)}")
+    L(f"- Mid-SNR tasks: {agg.get('mid_snr_tasks', 0)}")
+    L(f"- Low-SNR tasks: {agg.get('low_snr_tasks', 0)}")
    L("")
-    L("- **47% seed noise** means any gap < 0.02 is meaningless. Treat top-5")
-    L("  as a statistical tie. Dropping the 21 low-SNR tasks would sharpen")
-    L("  remaining rankings considerably.")
-    L("- **Weight tasks by SNR × |C(q)|** instead of flat mean. High-SNR,")
-    L("  high-|C(q)| tasks give the cleanest capability signal.")
-    L("- **Report survival curves alongside run_score** to surface long-horizon")
-    L("  capability that single-number metrics hide.")
-    L("- **Flag 'trapped' runs that scored high** — the model may have")
-    L("  guessed-and-committed rather than reasoned; not same reliability.")
-    L("- **Add a Tier 6 long-horizon (100+ turn) task set** to actually")
-    L("  measure the dynamical regimes the paper proposes — current")
-    L("  trajectories are too short (median 6 assistant turns) for clean")
-    L("  Lyapunov or attractor diagnostics.")

-    out = REPORTS / "EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md"
-    out.write_text("\n".join(lines) + "\n")
-    print(f"Wrote: {out}")
+    L("## 4. Survival Analysis")
+    L("")
+    L("| Model | Runs | Events | Median failure turn | S(3) | S(5) | S(8) |")
+    L("|---|---:|---:|---:|---:|---:|---:|")
+    for model in sorted(survival):
+        row = survival[model]
+        surv = row.get("survival", [0.0] * 8)
+        med = row.get("median_fail_turn", "inf")
+        if isinstance(med, float) and med == float("inf"):
+            med_display = "inf"
+        else:
+            med_display = f"{float(med):.1f}"
+        L(
+            f"| {model} | {row.get('n_runs', 0)} | {row.get('n_events', 0)} | "
+            f"{med_display} | {surv[2] if len(surv) > 2 else 0.0:.2f} | "
+            f"{surv[4] if len(surv) > 4 else 0.0:.2f} | {surv[7] if len(surv) > 7 else 0.0:.2f} |"
+        )
+    L("")
+
+    if ranking is not None:
+        L("## 5. SNR-weighted Ranking")
+        L("")
+        L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |")
+        L("|---:|---|---:|---:|---:|---:|")
+        for idx, row in enumerate(ranking.get("results", []), start=1):
+            L(
+                f"| {idx} | {row.get('model', '')} | {row.get('flat', 0.0):.4f} | "
+                f"{row.get('snr_x_abs_cq', 0.0):.4f} | {row.get('snr_x_abs_cq_winsorized', 0.0):.4f} | "
+                f"{row.get('coverage', 0)} |"
+            )
+        L("")
+
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+    print(f"Wrote: {output_path}")


 if __name__ == "__main__":
--- a/scripts/run_posterior_dynamics_pipeline.py
+++ b/scripts/run_posterior_dynamics_pipeline.py
@ -0,0 +1,89 @@
+#!/usr/bin/env python3
+"""Run the full posterior dynamical analysis pipeline."""
+
+from __future__ import annotations
+
+import argparse
+import subprocess
+import sys
+from pathlib import Path
+
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO_ROOT))
+
+from clawbench.dynamics_archive import discover_model_roots, load_task_runs_archive, write_dynamics_report
+
+
+def _run(cmd: list[str]) -> None:
+    print("$", " ".join(cmd))
+    result = subprocess.run(cmd, cwd=REPO_ROOT)
+    if result.returncode != 0:
+        raise SystemExit(result.returncode)
+
+
+def _resolve_path(path: Path) -> Path:
+    return path if path.is_absolute() else (REPO_ROOT / path)
+
+
+def _write_dynamics_reports(
+    archive_dir: Path,
+    output_dir: Path,
+    tier: str | None,
+) -> None:
+    roots = discover_model_roots(archive_dir)
+    if not roots:
+        raise SystemExit(f"No cached runs found under {archive_dir}")
+
+    multiple_models = len(roots) > 1
+    wrote_any = False
+    for model_name, model_dir in roots.items():
+        task_runs = load_task_runs_archive(model_dir, tier=tier)
+        if not task_runs:
+            continue
+
+        wrote_any = True
+        model_output_dir = output_dir / model_name if multiple_models else output_dir
+        report_path, plots = write_dynamics_report(task_runs, model_output_dir)
+        n_runs = sum(len(runs) for runs in task_runs.values())
+
+        print(f"[dynamics] {model_name}: loaded {n_runs} cached runs across {len(task_runs)} tasks")
+        print(f"[dynamics] {model_name}: wrote {report_path}")
+        print(f"[dynamics] {model_name}: saved {len(plots)} plots to {model_output_dir}/")
+
+    if not wrote_any:
+        raise SystemExit(f"No cached runs found under {archive_dir}")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Run posterior dynamics pipeline end to end")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--output-dir", type=Path, default=Path("results/posterior_dynamics"))
+    parser.add_argument(
+        "--include-dynamics-report",
+        action="store_true",
+        help="Also build per-model dynamics.json files and plots from the archive.",
+    )
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()
+
+    py = sys.executable
+    archive_dir = _resolve_path(args.archive_dir)
+    reports_dir = _resolve_path(args.reports_dir)
+    output_dir = _resolve_path(args.output_dir)
+    tier_args = ["--tier", args.tier] if args.tier else []
+    scripts_dir = REPO_ROOT / "scripts"
+
+    _run([py, str(scripts_dir / "compute_constraint_index.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "classify_regimes.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "variance_decomp.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "survival_analysis.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "snr_weighted_ranking.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "generate_dynamical_report.py"), "--reports-dir", str(reports_dir)])
+    if args.include_dynamics_report:
+        _write_dynamics_reports(archive_dir, output_dir, args.tier)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/snr_weighted_ranking.py
+++ b/scripts/snr_weighted_ranking.py
@ -1,148 +1,130 @@
-"""SNR × |C(q)|-weighted ranking — the dynamical-systems-informed metric.
+#!/usr/bin/env python3
+"""SNR x |C(q)| weighted ranking from posterior cached runs.

-Motivation: from variance_decomp.py we know 47% of run_score variance is
-seed noise. From compute_constraint_index.py we know some tasks are
-high-constraint (everyone converges) and others are open-ended (responses
-diverge for style reasons, not capability).
+Weighted headline score:

-Weighted mean:
-    w(task) = SNR(task) × |C(q)(task)|
-    score(model) = Σ_task w(task) · mean_run_score(task, model) / Σ_task w(task)
+    w(q) = max(0, SNR(q)) * |C(q)|
+    score(model) = sum_q w(q) * mean_run_score(model, q) / sum_q w(q)

-Why:
- High SNR tasks contribute more than low-SNR tasks (noise-weighted)
- |C(q)| amplifies tasks that are either strongly constrained OR strongly
-  open-ended (i.e. measures what they're supposed to measure, regardless
-  of polarity)
- Moderate C(q) tasks (C near 0) are inherently ambiguous — down-weighted
+We also report:

-Outputs:
-  - Per-model weighted score
-  - Comparison against flat-mean ranking
-  - Published to reports/snr_weighted_ranking.json
+    snr_only              = SNR-weighted mean
+    snr_x_abs_cq          = SNR x |C(q)| weighted mean
+    snr_x_abs_cq_winsorized = same, but top task weights are clamped at p95
+
+This keeps noisy low-SNR tasks from dominating and upweights tasks whose
+response geometry suggests a stronger capability signal.
 """

 from __future__ import annotations

-import glob
+import argparse
 import json
+import sys
 from collections import defaultdict
 from pathlib import Path
 from statistics import mean

 import numpy as np

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
-REPORTS = ROOT / "reports"
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-MODELS = {
-    "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
-    "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
-    "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
-    "gpt54": ("openai_gpt-5.4", "GPT 5.4"),
-    "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
-    "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
-    "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
-    "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
-    "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
-}
+from clawbench.dynamics_archive import load_task_runs_by_model


 def main() -> None:
-    cq = json.loads((REPORTS / "constraint_index.json").read_text())
-    var = json.loads((REPORTS / "variance_decomposition.json").read_text())
-    snr_by_task = {r["task"]: r["snr"] for r in var["per_task"]}
+    parser = argparse.ArgumentParser(description="Compute SNR-weighted posterior model ranking")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()

-    # Per (model, task): mean run_score over the 3 runs
-    per_mt: dict[str, dict[str, list[float]]] = defaultdict(dict)
-    for label, (sub, _) in MODELS.items():
-        for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
-            try:
-                d = json.loads(Path(p).read_text())
-            except Exception:
-                continue
-            task = p.split("/")[-2]
-            per_mt[label].setdefault(task, []).append(d.get("run_score", 0))
-    per_mt_mean = {
-        m: {t: mean(v) for t, v in d.items() if v} for m, d in per_mt.items()
+    cq_path = args.reports_dir / "constraint_index.json"
+    var_path = args.reports_dir / "variance_decomposition.json"
+    if not cq_path.exists() or not var_path.exists():
+        raise SystemExit("Missing prerequisite reports: run compute_constraint_index.py and variance_decomp.py first.")
+
+    cq = json.loads(cq_path.read_text(encoding="utf-8"))
+    var = json.loads(var_path.read_text(encoding="utf-8"))
+    snr_by_task = {row["task"]: row["snr"] for row in var.get("per_task", [])}
+
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")
+
+    per_model_task_scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
+    for model_name, task_runs in grouped.items():
+        for task_id, runs in task_runs.items():
+            per_model_task_scores[model_name][task_id] = [float(run.run_score) for run in runs]
+
+    per_model_task_mean = {
+        model_name: {
+            task_id: mean(vals)
+            for task_id, vals in task_scores.items()
+            if vals
+        }
+        for model_name, task_scores in per_model_task_scores.items()
    }

-    # Only consider tasks present in both C(q) and SNR
    common_tasks = sorted(set(cq) & set(snr_by_task))
-    print(f"Using {len(common_tasks)} tasks with both C(q) and SNR.")
+    if not common_tasks:
+        raise SystemExit("No overlap between constraint_index and variance_decomposition task sets.")

-    # Compute weights w(task) = SNR × |C(q)|, clamped to [0, ∞)
-    weights = {}
-    for t in common_tasks:
-        w = max(0.0, snr_by_task[t]) * abs(cq[t]["C_q"])
-        weights[t] = w
-    # Also: SNR-only weighting (simpler, no C(q))
-    snr_weights = {t: max(0.0, snr_by_task[t]) for t in common_tasks}
-    # Also: Winsorize — clamp top-1 task's weight to 95th percentile to
-    # prevent single task from dominating
-    import numpy as _np
-    _w95 = float(_np.percentile(list(weights.values()), 95))
-    weights_wins = {t: min(w, _w95) for t, w in weights.items()}
-    wsum = sum(weights.values())
-    if wsum == 0:
-        print("All weights zero — bail.")
-        return
+    weights = {task: max(0.0, snr_by_task[task]) * abs(cq[task].get("C_q", 0.0)) for task in common_tasks}
+    snr_weights = {task: max(0.0, snr_by_task[task]) for task in common_tasks}

-    # Compute per-model scores under 3 variants
-    results = []
+    w95 = float(np.percentile(list(weights.values()), 95)) if weights else 0.0
+    winsorized = {task: min(weight, w95) for task, weight in weights.items()}
+
+    w_sum = sum(weights.values())
    snr_sum = sum(snr_weights.values())
-    wins_sum = sum(weights_wins.values())
-    for label, (sub, pretty) in MODELS.items():
-        task_means = per_mt_mean.get(label, {})
-        if not task_means:
+    wins_sum = sum(winsorized.values())
+
+    results = []
+    for model_name, task_means in per_model_task_mean.items():
+        covered = [task for task in common_tasks if task in task_means]
+        if not covered:
            continue
-        num_cq = sum(weights[t] * task_means.get(t, 0) for t in common_tasks)
-        num_snr = sum(snr_weights[t] * task_means.get(t, 0) for t in common_tasks)
-        num_wins = sum(weights_wins[t] * task_means.get(t, 0) for t in common_tasks)
-        wscore = num_cq / wsum
-        snr_only = num_snr / snr_sum if snr_sum > 0 else 0
-        wins_score = num_wins / wins_sum if wins_sum > 0 else 0
-        flat = mean(task_means[t] for t in common_tasks if t in task_means)
-        results.append((label, pretty, flat, wscore, snr_only, wins_score))

-    print()
-    print(f"{'Model':<16}  {'Flat':>7}  {'SNR×|C|':>8}  {'Winsorized':>11}  {'SNR-only':>9}")
-    print("-" * 66)
-    # Rank by winsorized variant (primary)
-    for label, pretty, flat, w, snr_only, wins in sorted(results, key=lambda x: -x[5]):
-        print(f"{pretty:<16}  {flat:>7.4f}  {w:>8.4f}  {wins:>11.4f}  {snr_only:>9.4f}")
+        flat = mean(task_means[task] for task in covered)
+        weighted = (
+            sum(weights[task] * task_means.get(task, 0.0) for task in common_tasks) / w_sum
+            if w_sum > 1e-12
+            else 0.0
+        )
+        snr_only = (
+            sum(snr_weights[task] * task_means.get(task, 0.0) for task in common_tasks) / snr_sum
+            if snr_sum > 1e-12
+            else 0.0
+        )
+        wins_score = (
+            sum(winsorized[task] * task_means.get(task, 0.0) for task in common_tasks) / wins_sum
+            if wins_sum > 1e-12
+            else 0.0
+        )

-    # Rank comparisons
-    print("\n=== Ranking shifts vs flat-mean (winsorized) ===")
-    flat_rank_order = sorted(results, key=lambda x: -x[2])
-    flat_rank = {r[0]: i + 1 for i, r in enumerate(flat_rank_order)}
-    wins_rank_order = sorted(results, key=lambda x: -x[5])
-    print(f"{'Rank':<5}{'Model':<16} {'Flat':>8}  {'Winsorized':>11}  {'Δrank':>6}")
-    for i, (label, pretty, flat, _w, _snr, wins) in enumerate(wins_rank_order, 1):
-        fr = flat_rank[label]
-        move = ""
-        if fr > i: move = f"↑{fr-i}"
-        elif fr < i: move = f"↓{i-fr}"
-        print(f"{i:<5}{pretty:<16} {flat:>8.4f}  {wins:>11.4f}  {move:>6}")
+        results.append(
+            {
+                "model": model_name,
+                "flat": float(flat),
+                "snr_x_abs_cq": float(weighted),
+                "snr_only": float(snr_only),
+                "snr_x_abs_cq_winsorized": float(wins_score),
+                "coverage": len(covered),
+            }
+        )
+
+    results.sort(key=lambda row: row["snr_x_abs_cq_winsorized"], reverse=True)

-    # Save
    out = {
-        "flat_score": {r[0]: r[2] for r in results},
-        "snr_x_cq_weighted": {r[0]: r[3] for r in results},
-        "snr_x_cq_winsorized": {r[0]: r[5] for r in results},
-        "snr_only_weighted": {r[0]: r[4] for r in results},
-        "weights_per_task": weights,
        "common_tasks": common_tasks,
+        "weights_per_task": weights,
+        "results": results,
    }
-    (REPORTS / "snr_weighted_ranking.json").write_text(json.dumps(out, indent=2))
-    print(f"\nWrote reports/snr_weighted_ranking.json")

-    # Show top-5 contributing tasks (highest weight) for context
-    print()
-    print("Top-10 tasks by weight (SNR × |C(q)|):")
-    for t, w in sorted(weights.items(), key=lambda kv: -kv[1])[:10]:
-        print(f"  {t:<38}  SNR={snr_by_task[t]:>5.1f}  |C(q)|={abs(cq[t]['C_q']):>5.2f}  w={w:>6.2f}")
+    out_path = args.reports_dir / "snr_weighted_ranking.json"
+    out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
+    print(f"Wrote: {out_path}")


 if __name__ == "__main__":
--- a/scripts/survival_analysis.py
+++ b/scripts/survival_analysis.py
@ -1,164 +1,118 @@
-"""Per-turn survival analysis: when do agent runs fail?
+#!/usr/bin/env python3
+"""Per-turn survival analysis on posterior cached runs.

-Following paper §Latent-state survival:
-  T_F = inf { t ≥ 0 : failure at time t }
-  S(t) = P(T_F > t)   — survival function
-  h(t) = P(T_F = t | T_F ≥ t)  — hazard rate
+For each run, define a failure time T_F as the first assistant turn where the
+agent emits neither text nor tool calls, or the final assistant turn of an
+unsuccessful run with delivery outcome in {fail, partial}.

-For each run, we define FAILURE as the first turn where:
-  (a) the assistant emits no text AND no tool calls, OR
-  (b) the run's delivery_outcome is 'fail'/'partial' AND the transcript
-      ended at this turn (no more assistant turns follow).
+We then estimate:

-T_F = assistant-turn index of first failure (starting at 1).
-If the run succeeded (run_score ≥ 0.7), T_F is right-censored at the
-final turn count N (i.e. survived the whole trajectory).
+    S(t) = P(T_F > t)
+    h(t) = P(T_F = t | T_F >= t)

-Output per model:
-  - Median turn-to-failure
-  - Empirical survival curve S(t) for t = 1..20
-  - Hazard profile h(t)
-  - Stratified by task-constraint bucket (using C(q) from earlier)
-
-Usage:
-    .venv/bin/python3 scripts/survival_analysis.py
+This exposes long-horizon fragility that is easy to hide in flat mean scores.
 """

 from __future__ import annotations

-import glob
+import argparse
 import json
-import re
-from collections import defaultdict
+import sys
 from pathlib import Path
 from statistics import median

-import numpy as np
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
-
-MODELS = {
-    "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
-    "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
-    "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
-    "gpt54": ("openai_gpt-5.4", "GPT 5.4"),
-    "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
-    "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
-    "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
-    "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
-    "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
-}
+from clawbench.dynamics_archive import load_task_runs_by_model

 SUCCESS_THRESHOLD = 0.7


-def assistant_turns(d: dict) -> list[dict]:
-    return [m for m in d.get("transcript", {}).get("messages", [])
-            if m.get("role") == "assistant"]
+def assistant_turns(run) -> list:
+    return run.transcript.assistant_messages


-def find_failure_turn(d: dict) -> tuple[int, bool]:
-    """Return (T_F, is_event). T_F is 1-indexed turn of failure.
-
-    is_event=True means failure actually happened; False means the run was
-    censored (survived to end without failing).
-    """
-    turns = assistant_turns(d)
+def find_failure_turn(run) -> tuple[int, bool]:
+    """Return (failure_turn, is_event) with 1-indexed assistant turns."""
+    turns = assistant_turns(run)
    n = len(turns)
-    run_score = d.get("run_score", 0) or 0
-    delivery = d.get("delivery_outcome", "")

-    # Scan for first empty-turn
-    for i, t in enumerate(turns, 1):
-        has_text = bool((t.get("text") or "").strip())
-        has_tool_call = bool(t.get("tool_calls"))
+    for idx, turn in enumerate(turns, 1):
+        has_text = bool((turn.text or "").strip())
+        has_tool_call = bool(turn.tool_calls)
        if not has_text and not has_tool_call:
-            return i, True  # failure event
+            return idx, True

-    # If run was unsuccessful and ended early, mark last turn as failure
-    if run_score < SUCCESS_THRESHOLD and delivery in ("fail", "partial"):
+    if run.run_score < SUCCESS_THRESHOLD and run.delivery_outcome.value in {"fail", "partial"}:
        return max(n, 1), True

-    # Survived: right-censored at n
    return max(n, 1), False


 def empirical_survival(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]:
-    """Kaplan-Meier-like survival curve, non-parametric.
-
-    S(t) = fraction of runs that survived past turn t.
-    """
-    survival = []
+    """Empirical survival curve S(t) over assistant-turn index."""
    total = len(times_events)
+    if total == 0:
+        return [0.0] * max_t
+
+    survival = []
    for t in range(1, max_t + 1):
-        # Survived past t = either censored at ≥t or event at >t
-        survived = sum(1 for tf, is_event in times_events
-                       if (not is_event and tf >= t) or (is_event and tf > t))
-        survival.append(survived / total if total > 0 else 0.0)
+        survived = sum(
+            1
+            for tf, is_event in times_events
+            if (not is_event and tf >= t) or (is_event and tf > t)
+        )
+        survival.append(survived / total)
    return survival


 def hazard(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]:
-    """Hazard rate h(t) = events at t / at-risk at t."""
-    h = []
+    """Discrete hazard h(t) = events_at_t / at_risk_at_t."""
+    hazard_vals = []
    for t in range(1, max_t + 1):
        at_risk = sum(1 for tf, _ in times_events if tf >= t)
-        events_at_t = sum(1 for tf, is_event in times_events
-                           if is_event and tf == t)
-        h.append(events_at_t / at_risk if at_risk > 0 else 0.0)
-    return h
+        events_at_t = sum(1 for tf, is_event in times_events if is_event and tf == t)
+        hazard_vals.append(events_at_t / at_risk if at_risk > 0 else 0.0)
+    return hazard_vals


 def main() -> None:
-    per_model: dict[str, list[tuple[int, bool]]] = defaultdict(list)
-    for label, (sub, _) in MODELS.items():
-        for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
-            try:
-                d = json.loads(Path(p).read_text())
-            except Exception:
-                continue
-            tf, is_event = find_failure_turn(d)
-            per_model[label].append((tf, is_event))
+    parser = argparse.ArgumentParser(description="Survival analysis on cached runs")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    parser.add_argument("--max-turn", type=int, default=20)
+    args = parser.parse_args()

-    # Load C(q) to stratify
-    cq_path = ROOT / "reports" / "constraint_index.json"
-    cq_by_task = {}
-    if cq_path.exists():
-        cq = json.loads(cq_path.read_text())
-        cq_by_task = {t: v["C_q"] for t, v in cq.items()}
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")

-    # Print summary
-    print(f"{'Model':<14}  {'n_runs':>6}  {'events':>6}  {'med_tf':>8}  "
-          f"{'S(3)':>6}  {'S(5)':>6}  {'S(8)':>6}  {'S(12)':>6}  {'S(20)':>6}")
-    print("-" * 90)
    out = {}
-    for label, (_sub, pretty) in MODELS.items():
-        evs = per_model[label]
-        n = len(evs)
-        n_events = sum(1 for _, e in evs if e)
-        tfs_events = [tf for tf, e in evs if e]
-        med = median(tfs_events) if tfs_events else float("inf")
-        surv = empirical_survival(evs, max_t=20)
-        haz = hazard(evs, max_t=20)
-        print(f"{pretty:<14}  {n:>6}  {n_events:>6}  {med:>8.1f}  "
-              f"{surv[2]:>6.2f}  {surv[4]:>6.2f}  {surv[7]:>6.2f}  "
-              f"{surv[11]:>6.2f}  {surv[19]:>6.2f}")
-        out[label] = {
-            "pretty": pretty,
-            "n_runs": n,
+    for model_name, task_runs in grouped.items():
+        events = []
+        for runs in task_runs.values():
+            for run in runs:
+                events.append(find_failure_turn(run))
+
+        n_runs = len(events)
+        n_events = sum(1 for _, is_event in events if is_event)
+        event_times = [t for t, is_event in events if is_event]
+        med = median(event_times) if event_times else float("inf")
+
+        out[model_name] = {
+            "pretty": model_name,
+            "n_runs": n_runs,
            "n_events": n_events,
            "median_fail_turn": med,
-            "survival": surv,
-            "hazard": haz,
+            "survival": empirical_survival(events, max_t=args.max_turn),
+            "hazard": hazard(events, max_t=args.max_turn),
        }

-    print("\n(Interpretation: S(t) = fraction of runs still on-track past turn t.")
-    print(" Lower values = more frequent early failure.)")
-
-    out_path = ROOT / "reports" / "survival_analysis.json"
-    out_path.write_text(json.dumps(out, indent=2))
-    print(f"\nWrote: {out_path}")
+    args.reports_dir.mkdir(parents=True, exist_ok=True)
+    out_path = args.reports_dir / "survival_analysis.json"
+    out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
+    print(f"Wrote: {out_path}")


 if __name__ == "__main__":
--- a/scripts/variance_decomp.py
+++ b/scripts/variance_decomp.py
@ -1,132 +1,118 @@
-"""Decompose run_score variance into seed-noise vs capability-signal.
+#!/usr/bin/env python3
+"""Decompose posterior run_score variance into seed noise and capability signal.

-Each task has 3 runs per model (same prompt, different random seed).
-  σ²_seed(task, model)  = variance across the 3 runs of (task, model)
-  σ²_capability(task)   = variance across model means for the task
+Each task has repeated runs per model.
+
+    sigma^2_seed(task, model) = variance across repeated runs for one model
+    sigma^2_capability(task)  = variance across model means for that task

 Signal-to-noise ratio per task:
-  SNR(task) = σ²_capability / σ²_seed

-High SNR → differences between models on this task are REAL (not noise).
-Low SNR  → the 3-run variance per model is so large that cross-model gaps
-           are indistinguishable from seed noise. These tasks don't
-           discriminate models reliably.
+    SNR(task) = sigma^2_capability / mean_model sigma^2_seed

-Aggregated over all 40 tasks, we also decompose TOTAL variance:
-  total_var = mean_capability_var + mean_seed_var
-  capability_fraction = mean_capability_var / total_var
+High SNR means cross-model differences are likely real. Low SNR means the
+benchmark signal is dominated by run-to-run variance rather than capability.

-This answers "what fraction of the benchmark signal is real model
-capability vs. run-to-run luck?"
+Aggregate decomposition:

-Usage:
-    .venv/bin/python3 scripts/variance_decomp.py
+    total_var = mean_task seed_var + mean_task cap_var
+    capability_fraction = mean_task cap_var / total_var
+
+This script keeps the posterior/archive-based workflow used by the current
+pipeline, but the statistical meaning is the same as the earlier analysis.
 """

 from __future__ import annotations

-import glob
+import argparse
 import json
-import re
+import sys
 from collections import defaultdict
 from pathlib import Path
 from statistics import mean, variance

-import numpy as np
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
-
-MODELS = {
-    "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
-    "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
-    "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
-    "gpt54": ("openai_gpt-5.4", "GPT 5.4"),
-    "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
-    "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
-    "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
-    "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
-    "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
-}
+from clawbench.dynamics_archive import load_task_runs_by_model


 def main() -> None:
-    # {task: {model: [run_scores]}}
-    scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
-    for label, (sub, _) in MODELS.items():
-        for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
-            task = p.split("/")[-2]
-            try:
-                d = json.loads(Path(p).read_text())
-            except Exception:
-                continue
-            scores[task].setdefault(label, []).append(d.get("run_score", 0))
+    parser = argparse.ArgumentParser(description="Variance decomposition on cached runs")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()
+
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")
+
+    # Collect repeated run scores as {task -> {model -> [run_scores]}}.
+    scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
+    for model_name, task_runs in grouped.items():
+        for task_id, runs in task_runs.items():
+            vals = [float(run.run_score) for run in runs]
+            if vals:
+                scores[task_id][model_name] = vals

-    # Per-task: seed var per model, cross-model var of means, SNR
    task_stats = []
-    for task, per_model in scores.items():
-        # Only use models with all 3 runs for clean seed-variance estimate
+    for task_id, per_model in scores.items():
        model_vars = []
        model_means = []
-        for m, runs in per_model.items():
+        for runs in per_model.values():
            if len(runs) >= 2:
                model_vars.append(variance(runs))
+            if runs:
                model_means.append(mean(runs))
-        if len(model_means) < 2 or not model_vars:
-            continue
-        mean_seed_var = mean(model_vars)        # noise
-        cap_var = variance(model_means)          # signal
+
+        # Mean within-model variance is the seed-noise term.
+        mean_seed_var = mean(model_vars) if model_vars else 0.0
+        # Variance of model means is the capability-signal term.
+        cap_var = variance(model_means) if len(model_means) >= 2 else 0.0
        snr = cap_var / (mean_seed_var + 1e-9)
-        task_stats.append({
-            "task": task,
-            "seed_var": mean_seed_var,
-            "cap_var": cap_var,
-            "snr": snr,
-            "n_models": len(model_means),
-        })
+        task_stats.append(
+            {
+                "task": task_id,
+                "seed_var": float(mean_seed_var),
+                "cap_var": float(cap_var),
+                "snr": float(snr),
+                "n_models": len(model_means),
+                "limited_model_diversity": len(model_means) < 2,
+            }
+        )

-    # Sort by SNR
-    task_stats.sort(key=lambda x: -x["snr"])
+    task_stats.sort(key=lambda row: row["snr"], reverse=True)
+    if not task_stats:
+        raise SystemExit("No task-level scores found in archive.")

-    print(f"{'Task':<38}  {'seed_var':>9}  {'cap_var':>9}  {'SNR':>8}")
-    print("-" * 70)
-    for r in task_stats:
-        print(f"{r['task']:<38}  {r['seed_var']:>9.4f}  {r['cap_var']:>9.4f}  "
-              f"{r['snr']:>8.2f}")
-
-    # Aggregate decomposition
-    total_seed = mean(r["seed_var"] for r in task_stats)
-    total_cap = mean(r["cap_var"] for r in task_stats)
+    # Aggregate over tasks to estimate how much of benchmark variance is real
+    # capability signal versus run-to-run noise.
+    total_seed = mean(row["seed_var"] for row in task_stats)
+    total_cap = mean(row["cap_var"] for row in task_stats)
    total = total_seed + total_cap
-    cap_frac = total_cap / (total + 1e-9)
+    capability_fraction = total_cap / total if total > 1e-12 else 0.0

-    print("\n=== AGGREGATE VARIANCE DECOMPOSITION ===")
-    print(f"  Mean seed variance (noise):        {total_seed:.5f}")
-    print(f"  Mean capability variance (signal): {total_cap:.5f}")
-    print(f"  Capability fraction:               {cap_frac:.1%}")
-    print(f"  (= what % of run_score variance comes from real model differences)")
+    # Coarse SNR buckets help downstream reporting and task weighting.
+    high_snr = [row for row in task_stats if row["snr"] >= 5]
+    mid_snr = [row for row in task_stats if 1 <= row["snr"] < 5]
+    low_snr = [row for row in task_stats if row["snr"] < 1]

-    # Classify tasks by SNR tiers
-    high_snr = [r for r in task_stats if r["snr"] >= 5]
-    mid_snr = [r for r in task_stats if 1 <= r["snr"] < 5]
-    low_snr = [r for r in task_stats if r["snr"] < 1]
-    print(f"\n=== SNR TIERS ===")
-    print(f"  High SNR (≥5):       {len(high_snr)} tasks — differentiate models reliably")
-    print(f"  Mid SNR (1–5):       {len(mid_snr)} tasks — moderate signal")
-    print(f"  Low SNR (<1):        {len(low_snr)} tasks — seed noise ≥ capability signal")
-    print(f"     (these tasks give random-ish results; weight down)")
-
-    # Write output
-    out_path = ROOT / "reports" / "variance_decomposition.json"
-    out_path.write_text(json.dumps({
+    out = {
        "per_task": task_stats,
        "aggregate": {
-            "mean_seed_var": total_seed,
-            "mean_cap_var": total_cap,
-            "capability_fraction": cap_frac,
+            "mean_seed_var": float(total_seed),
+            "mean_cap_var": float(total_cap),
+            "capability_fraction": float(capability_fraction),
+            "high_snr_tasks": len(high_snr),
+            "mid_snr_tasks": len(mid_snr),
+            "low_snr_tasks": len(low_snr),
        },
-    }, indent=2))
-    print(f"\nWrote: {out_path}")
+    }
+
+    args.reports_dir.mkdir(parents=True, exist_ok=True)
+    out_path = args.reports_dir / "variance_decomposition.json"
+    out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
+    print(f"Wrote: {out_path}")


 if __name__ == "__main__":
--- a/tests/test_dynamics.py
+++ b/tests/test_dynamics.py
@ -0,0 +1,356 @@
+"""Tests for clawbench.dynamics."""
+
+from __future__ import annotations
+
+import math
+
+import numpy as np
+import pytest
+
+from clawbench.dynamics import (
+    TOOL_FAMILIES,
+    Dynamics,
+    Regime,
+    Sensitivity,
+    SurvivalPoint,
+    StratumStats,
+    StratifiedAssessment,
+    _classify_tool,
+    _cosine_dist,
+    _entropy,
+    _js_divergence,
+    _levenshtein,
+    build_strata,
+    compute_dynamics,
+    compute_sensitivity,
+    find_event_step,
+    kaplan_meier,
+    stratify_by_regime,
+    stratify_by_tier,
+)
+from clawbench.schemas import (
+    TokenUsage,
+    ToolCall,
+    Transcript,
+    TranscriptMessage,
+    TaskRunResult,
+)
+
+
+# ── helpers ──────────────────────────────────────────────────────────
+
+
+def _msg(role, text="", family=None, success=True, error="", ts=0, tok=100):
+    tcs = []
+    if family:
+        tcs.append(ToolCall(
+            name=f"tool_{family}", family=family,
+            success=success, error=error, mutating=family == "edit",
+        ))
+    return TranscriptMessage(
+        role=role, text=text, tool_calls=tcs, timestamp_ms=ts,
+        usage=TokenUsage(input_tokens=tok, output_tokens=tok // 2,
+                         total_tokens=tok + tok // 2),
+    )
+
+
+def _simple_transcript(families, errors=None):
+    if errors is None:
+        errors = [False] * len(families)
+    msgs = [_msg("user", "task")]
+    for i, (fam, err) in enumerate(zip(families, errors)):
+        msgs.append(_msg("assistant", f"step {i}", family=fam,
+                         success=not err, error="err" if err else "",
+                         ts=(i + 1) * 1000, tok=100 + i * 10))
+    return Transcript(messages=msgs)
+
+
+def _run(transcript, score=0.5, task_id="t1"):
+    return TaskRunResult(
+        task_id=task_id, run_index=0, transcript=transcript,
+        run_score=score, duration_ms=10000,
+        token_usage=transcript.total_usage,
+    )
+
+
+# ── _cosine_dist ─────────────────────────────────────────────────────
+
+
+def test_cosine_dist_identical():
+    a = np.array([1.0, 0.0, 0.5])
+    assert _cosine_dist(a, a) == pytest.approx(0.0, abs=1e-9)
+
+
+def test_cosine_dist_orthogonal():
+    assert _cosine_dist(np.array([1, 0, 0.0]), np.array([0, 1, 0.0])) == pytest.approx(1.0)
+
+
+def test_cosine_dist_zero_vector():
+    assert _cosine_dist(np.zeros(3), np.array([1, 2, 3.0])) == 1.0
+
+
+# ── _entropy ─────────────────────────────────────────────────────────
+
+
+def test_entropy_uniform():
+    assert _entropy({"a": 10, "b": 10}) == pytest.approx(1.0)
+
+
+def test_entropy_single():
+    assert _entropy({"a": 100}) == pytest.approx(0.0)
+
+
+def test_entropy_empty():
+    assert _entropy({}) == 0.0
+
+
+# ── _js_divergence ───────────────────────────────────────────────────
+
+
+def test_jsd_identical():
+    d = {"a": 5, "b": 5}
+    assert _js_divergence(d, d) == pytest.approx(0.0, abs=1e-9)
+
+
+def test_jsd_disjoint():
+    assert _js_divergence({"a": 10}, {"b": 10}) > 0.5
+
+
+# ── _levenshtein ────────────────────────────────────────────────────
+
+
+def test_levenshtein_equal():
+    assert _levenshtein([1, 2, 3], [1, 2, 3]) == 0
+
+
+def test_levenshtein_empty():
+    assert _levenshtein([], [1, 2]) == 2
+
+
+def test_levenshtein_different():
+    assert _levenshtein(["a", "b"], ["c", "d"]) == 2
+
+
+# ── _classify_tool ──────────────────────────────────────────────────
+
+
+@pytest.mark.parametrize("name,expected", [
+    ("bash_execute", "execute"),
+    ("file_read", "read"),
+    ("tool_edit", "edit"),
+    ("web_browser", "browser"),
+    ("grep_search", "search"),
+    ("write_file", "edit"),
+    ("run_tests", "execute"),
+])
+def test_classify_tool(name, expected):
+    assert _classify_tool(name) == expected
+
+
+# ── compute_dynamics ─────────────────────────────────────────────────
+
+
+def test_dynamics_basic():
+    t = _simple_transcript(["read", "edit", "execute", "read", "edit"])
+    d = compute_dynamics(t)
+    assert d.n_steps == 5
+    assert len(d.drift) == 5
+    assert len(d.step_size) == 5
+    assert len(d.entropy_series) == 5
+    assert len(d.tool_sequence) == 5
+    assert d.tool_entropy > 0
+
+
+def test_dynamics_empty():
+    t = Transcript(messages=[_msg("user", "hi")])
+    d = compute_dynamics(t)
+    assert d.n_steps == 0
+    assert d.regime == Regime.unknown
+
+
+def test_dynamics_trapped():
+    t = _simple_transcript(["execute"] * 15, errors=[True] * 15)
+    d = compute_dynamics(t)
+    assert d.regime == Regime.trapped
+    assert d.error_rate > 0.5
+
+
+def test_dynamics_convergent():
+    cycle = ["read", "search", "edit", "read", "execute"] * 6
+    t = _simple_transcript(cycle[:30])
+    d = compute_dynamics(t)
+    assert d.regime in (Regime.convergent, Regime.limit_cycle, Regime.diffusive, Regime.unknown)
+    assert d.error_rate == 0.0
+
+
+def test_dynamics_markov_keys():
+    t = _simple_transcript(["read", "edit", "read"])
+    d = compute_dynamics(t)
+    assert "read" in d.markov
+    assert "edit" in d.markov["read"]
+
+
+def test_dynamics_constraint_index_range():
+    t = _simple_transcript(["read", "edit", "search", "execute", "browser", "memory"] * 3)
+    d = compute_dynamics(t)
+    assert 0 <= d.constraint_index <= 1
+
+
+def test_dynamics_memory_depth():
+    t = _simple_transcript(["read", "edit", "read", "edit", "read", "edit"] * 3)
+    d = compute_dynamics(t)
+    assert d.memory_depth >= 0
+
+
+def test_dynamics_normalizes_unknown_tool_family():
+    transcript = Transcript(
+        messages=[
+            _msg("user", "task"),
+            TranscriptMessage(
+                role="assistant",
+                text="searching",
+                tool_calls=[
+                    ToolCall(
+                        name="grep_search",
+                        family="unknown",
+                        success=True,
+                        error="",
+                        mutating=False,
+                    )
+                ],
+                timestamp_ms=1000,
+                usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
+            ),
+            _msg("assistant", "next", family="read", ts=2000),
+            _msg("assistant", "done", family="edit", ts=3000),
+        ]
+    )
+
+    dynamics = compute_dynamics(transcript)
+
+    assert dynamics.tool_sequence[0] == "search"
+    assert "search" in dynamics.markov
+
+
+# ── compute_sensitivity ──────────────────────────────────────────────
+
+
+def test_sensitivity_identical_runs():
+    t = _simple_transcript(["read", "edit", "execute"])
+    ra = _run(t, score=0.8)
+    rb = _run(t, score=0.8)
+    s = compute_sensitivity(ra, rb)
+    assert s.score_delta == pytest.approx(0.0)
+    assert s.tool_edit_distance == 0
+
+
+def test_sensitivity_different_runs():
+    ta = _simple_transcript(["read", "edit", "execute"])
+    tb = _simple_transcript(["search", "browser", "memory"])
+    ra = _run(ta, score=0.9)
+    rb = _run(tb, score=0.3)
+    s = compute_sensitivity(ra, rb)
+    assert s.score_delta == pytest.approx(0.6)
+    assert s.tool_edit_distance > 0
+    assert s.family_js_divergence > 0
+
+
+# ── kaplan_meier ─────────────────────────────────────────────────────
+
+
+def test_km_basic():
+    pts = kaplan_meier([1, 2, 3])
+    assert pts[0].time == 0.0
+    assert pts[0].survival == 1.0
+    assert pts[-1].survival == pytest.approx(0.0)
+
+
+def test_km_with_censoring():
+    pts = kaplan_meier([1, 5, 3], censored=[False, True, False])
+    assert len(pts) == 3
+    assert pts[-1].survival > 0
+
+
+def test_km_empty():
+    assert kaplan_meier([]) == []
+
+
+# ── find_event_step ──────────────────────────────────────────────────
+
+
+def test_find_first_correct_write():
+    t = _simple_transcript(["read", "search", "edit", "execute"])
+    assert find_event_step(t, "first_correct_write") == 2.0
+
+
+def test_find_first_error_recovery():
+    t = _simple_transcript(
+        ["read", "execute", "read"],
+        errors=[False, True, False],
+    )
+    assert find_event_step(t, "first_error_recovery") == 2.0
+
+
+def test_find_task_completion():
+    t = _simple_transcript(["read", "edit"])
+    assert find_event_step(t, "task_completion") == 1.0
+
+
+def test_find_event_none():
+    t = _simple_transcript(["read", "read"])
+    assert find_event_step(t, "first_correct_write") is None
+
+
+# ── build_strata + reweight ──────────────────────────────────────────
+
+
+def test_build_strata_by_tier():
+    runs, dyns, scores = [], [], []
+    for tid, sc in [("t1-a", 0.8), ("t1-b", 0.6), ("t2-a", 0.4), ("t2-b", 0.3)]:
+        t = _simple_transcript(["read", "edit", "execute"])
+        r = _run(t, score=sc, task_id=tid)
+        runs.append(r)
+        dyns.append(compute_dynamics(t))
+        scores.append(sc)
+
+    sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
+    assert sa.total_runs == 4
+    names = sa.stratum_names()
+    assert "tier1" in names
+    assert "tier2" in names
+    for s in sa.strata:
+        assert s.n_runs == 2
+        assert s.weight == pytest.approx(0.5)
+
+
+def test_reweight_shifts_mean():
+    runs, dyns, scores = [], [], []
+    for tid, sc in [("t1-a", 0.9), ("t1-b", 0.8), ("t2-a", 0.2), ("t2-b", 0.1)]:
+        t = _simple_transcript(["read", "edit", "execute"])
+        r = _run(t, score=sc, task_id=tid)
+        runs.append(r)
+        dyns.append(compute_dynamics(t))
+        scores.append(sc)
+
+    sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
+
+    # Reweight towards tier1 (high scores)
+    high = sa.reweight({"tier1": 0.9, "tier2": 0.1})
+    # Reweight towards tier2 (low scores)
+    low = sa.reweight({"tier1": 0.1, "tier2": 0.9})
+
+    assert high["score_mean"] > low["score_mean"]
+
+
+def test_reweight_unknown_stratum():
+    runs, dyns, scores = [], [], []
+    t = _simple_transcript(["read", "edit"])
+    r = _run(t, score=0.5, task_id="t1-x")
+    runs.append(r)
+    dyns.append(compute_dynamics(t))
+    scores.append(0.5)
+
+    sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
+    # Reweight with a stratum that doesn't exist — should fall back
+    result = sa.reweight({"nonexistent": 1.0})
+    assert "score_mean" in result
--- a/tests/test_dynamics_archive.py
+++ b/tests/test_dynamics_archive.py
@ -0,0 +1,115 @@
+"""Tests for offline dynamics archive helpers."""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+from clawbench.dynamics_archive import build_dynamics_report, load_task_runs_archive, safe_model_name, write_dynamics_report
+from clawbench.schemas import TaskRunResult, TokenUsage, ToolCall, Transcript, TranscriptMessage
+
+
+def _msg(role: str, text: str = "", family: str | None = None, ts: int = 0) -> TranscriptMessage:
+    tool_calls = []
+    if family is not None:
+        tool_calls.append(
+            ToolCall(
+                name=f"tool_{family}",
+                family=family,
+                success=True,
+                error="",
+                mutating=family == "edit",
+            )
+        )
+    return TranscriptMessage(
+        role=role,
+        text=text,
+        tool_calls=tool_calls,
+        timestamp_ms=ts,
+        usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
+    )
+
+
+def _run(task_id: str, score: float = 0.5, run_index: int = 0) -> TaskRunResult:
+    transcript = Transcript(
+        messages=[
+            _msg("user", f"Solve {task_id}"),
+            _msg("assistant", "inspect", family="read", ts=1000),
+            _msg("assistant", "edit", family="edit", ts=2000),
+            _msg("assistant", "verify", family="execute", ts=3000),
+        ]
+    )
+    return TaskRunResult(
+        task_id=task_id,
+        run_index=run_index,
+        transcript=transcript,
+        run_score=score,
+        duration_ms=3000,
+        token_usage=transcript.total_usage,
+    )
+
+
+def test_load_task_runs_archive_filters_model_and_tier(tmp_path: Path):
+    model_dir = tmp_path / safe_model_name("ollama/gpt-oss:20b")
+    other_dir = tmp_path / safe_model_name("openai/gpt-5.4")
+    for root, task_id in ((model_dir, "t1-demo-task"), (other_dir, "t2-other-task")):
+        task_dir = root / task_id
+        task_dir.mkdir(parents=True)
+        run = _run(task_id)
+        (task_dir / "run0.json").write_text(run.model_dump_json(indent=2), encoding="utf-8")
+
+    loaded = load_task_runs_archive(
+        archive_dir=tmp_path,
+        model="ollama/gpt-oss:20b",
+        tier="tier1",
+    )
+
+    assert list(loaded) == ["t1-demo-task"]
+    assert loaded["t1-demo-task"][0].task_id == "t1-demo-task"
+
+
+def test_write_dynamics_report_creates_report_without_plots(tmp_path: Path):
+    task_runs = {
+        "t1-demo-task": [_run("t1-demo-task", score=0.8)],
+        "t2-demo-task": [_run("t2-demo-task", score=0.4)],
+    }
+
+    report_path, plots = write_dynamics_report(task_runs, tmp_path, generate_plots=False)
+
+    assert report_path.exists()
+    assert report_path.name == "dynamics.json"
+    assert plots == []
+
+    report = json.loads(report_path.read_text(encoding="utf-8"))
+    assert "sensitivity" in report
+    assert report["sensitivity"]["same_task"]["n_pairs"] == 0
+
+
+def test_build_dynamics_report_includes_pairwise_sensitivity():
+    task_runs = {
+        "t1-demo-task": [
+            _run("t1-demo-task", score=0.8, run_index=0),
+            TaskRunResult(
+                task_id="t1-demo-task",
+                run_index=1,
+                transcript=Transcript(
+                    messages=[
+                        _msg("user", "Solve t1-demo-task"),
+                        _msg("assistant", "inspect", family="search", ts=1000),
+                        _msg("assistant", "edit", family="edit", ts=2000),
+                        _msg("assistant", "verify", family="execute", ts=3000),
+                    ]
+                ),
+                run_score=0.5,
+                duration_ms=3000,
+                token_usage=TokenUsage(input_tokens=30, output_tokens=15, total_tokens=45),
+            ),
+        ]
+    }
+
+    report, _plotter, _plot_data = build_dynamics_report(task_runs, include_pca=False)
+
+    same_task = report["sensitivity"]["same_task"]
+    assert same_task["n_pairs"] == 1
+    assert "t1-demo-task" in same_task["per_task"]
+    assert same_task["per_task"]["t1-demo-task"]["mean_score_delta"] > 0
--- a/tests/test_dynamics_cli.py
+++ b/tests/test_dynamics_cli.py
@ -0,0 +1,76 @@
+from pathlib import Path
+
+from click.testing import CliRunner
+
+from clawbench.cli import cli
+from clawbench.dynamics_archive import safe_model_name
+from clawbench.schemas import TaskRunResult, TokenUsage, ToolCall, Transcript, TranscriptMessage
+
+
+def _msg(role: str, text: str = "", family: str | None = None, ts: int = 0) -> TranscriptMessage:
+    tool_calls = []
+    if family is not None:
+        tool_calls.append(
+            ToolCall(
+                name=f"tool_{family}",
+                family=family,
+                success=True,
+                error="",
+                mutating=family == "edit",
+            )
+        )
+    return TranscriptMessage(
+        role=role,
+        text=text,
+        tool_calls=tool_calls,
+        timestamp_ms=ts,
+        usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
+    )
+
+
+def _run(task_id: str, run_index: int = 0) -> TaskRunResult:
+    transcript = Transcript(
+        messages=[
+            _msg("user", f"Solve {task_id}"),
+            _msg("assistant", "inspect", family="read", ts=1000),
+            _msg("assistant", "edit", family="edit", ts=2000),
+            _msg("assistant", "verify", family="execute", ts=3000),
+        ]
+    )
+    return TaskRunResult(
+        task_id=task_id,
+        run_index=run_index,
+        transcript=transcript,
+        run_score=0.8,
+        duration_ms=3000,
+        token_usage=transcript.total_usage,
+    )
+
+
+def test_dynamics_report_cli_supports_no_plots(tmp_path: Path):
+    model_dir = tmp_path / safe_model_name("ollama/gpt-oss:20b") / "t1-demo-task"
+    model_dir.mkdir(parents=True)
+    run = _run("t1-demo-task")
+    (model_dir / "run0.json").write_text(run.model_dump_json(indent=2), encoding="utf-8")
+
+    runner = CliRunner()
+    output_dir = tmp_path / "out"
+    result = runner.invoke(
+        cli,
+        [
+            "dynamics-report",
+            "--archive-dir",
+            str(tmp_path),
+            "--model",
+            "ollama/gpt-oss:20b",
+            "--output-dir",
+            str(output_dir),
+            "--no-plots",
+        ],
+    )
+
+    assert result.exit_code == 0, result.output
+    assert "Loaded 1 cached runs across 1 tasks" in result.output
+    assert "Saved 0 plots" in result.output
+    assert (output_dir / "dynamics.json").exists()
+    assert list(output_dir.glob("*.png")) == []
--- a/tests/test_submission_models.py
+++ b/tests/test_submission_models.py
@ -0,0 +1,44 @@
+from clawbench.submission_models import (
+    CUSTOM_PRESET_LABEL,
+    PRESET_AUDIENCE_BUDGET,
+    PRESET_AUDIENCE_CLAW,
+    infer_provider,
+    preset_labels_for_audience,
+    resolve_model_selection,
+)
+
+
+def test_budget_audience_keeps_budget_friendly_presets():
+    labels = preset_labels_for_audience(PRESET_AUDIENCE_BUDGET)
+
+    assert "GPT-OSS 20B (Ollama)" in labels
+    assert "Qwen 3.5 27B (Ollama)" in labels
+    assert "Claude Opus 4.6" not in labels
+
+
+def test_claw_audience_keeps_full_catalog():
+    labels = preset_labels_for_audience(PRESET_AUDIENCE_CLAW)
+
+    assert "GPT-OSS 20B (Ollama)" in labels
+    assert "Claude Opus 4.6" in labels
+
+
+def test_resolve_model_selection_prefers_preset_provider():
+    model_id, provider = resolve_model_selection("", "GPT-OSS 20B (Ollama)")
+
+    assert model_id == "ollama/gpt-oss:20b"
+    assert provider == "ollama"
+
+
+def test_resolve_model_selection_infers_custom_provider():
+    model_id, provider = resolve_model_selection(
+        "huggingface/Qwen/Qwen3-32B",
+        CUSTOM_PRESET_LABEL,
+    )
+
+    assert model_id == "huggingface/Qwen/Qwen3-32B"
+    assert provider == "huggingface"
+
+
+def test_infer_provider_requires_provider_prefix():
+    assert infer_provider("qwen3.5:27b") == ""