Add archive dynamics pipeline and audience-based model presets
This commit is contained in:
parent
5b50814dfc
commit
c209612d46
141
README.md
141
README.md
@ -104,20 +104,56 @@ Core v1 drops the noisy tasks and reports variance decomposition alongside ranki
|
||||
|
||||
Inspired by *"When LLMs Are Dreaming, Where Do They Go?"* — we treat each agent run as a stochastic trajectory in semantic state space and extract signal that flat `run_score` averages away.
|
||||
|
||||
| Diagnostic | Formula / Method | Reveals |
|
||||
|---|---|---|
|
||||
| **Constraint Index C(q)** | `-z(PR) - z(entropy) + z(BOPS)` over response embeddings | Which tasks converge to one answer vs diverge openly |
|
||||
| **Regime classification** | Trajectory drift / recurrence / support-volume thresholds | Per-run dynamical signature (trapped / limit-cycle / diffusive) |
|
||||
| **Survival analysis** | `S(t) = P(T_F > t)` where T_F = first empty assistant turn | Per-turn failure rates; long-horizon capability |
|
||||
| **SNR-weighted ranking** | `w(task) = SNR × |C(q)|`, winsorized at p95 | Headline metric that weights tasks by their signal density |
|
||||
| **Variance decomposition** | `Var(score) = Var_seeds + Var_models` per task | Separate capability signal from coin-flip noise |
|
||||
Current code-path formulas:
|
||||
|
||||
```text
|
||||
Per assistant step t:
|
||||
x_t = [tool_family_proportions(6), error_flag, normalized_tokens, normalized_text_len, progress]
|
||||
drift_t = cosine_distance(x_0, x_t)
|
||||
step_t = cosine_distance(x_{t-1}, x_t)
|
||||
|
||||
Task-level Constraint Index:
|
||||
PR(q) = tr(Σ_q)^2 / tr(Σ_q^2)
|
||||
H(q) = -Σ_i p_i log2 p_i, p_i = λ_i / Σ_j λ_j, λ = eigvals(Σ_q)
|
||||
BOPS(q) = mean_m mean_{i<j} cos(v_{q,m,i}, v_{q,m,j})
|
||||
C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q))
|
||||
|
||||
Per-run constraint index used inside the regime classifier:
|
||||
PR_run = 1 / Σ_i p_i^2
|
||||
constraint_index_run = 1 - (PR_run - 1) / (d - 1)
|
||||
|
||||
Variance decomposition:
|
||||
seed_var(q) = mean_m Var(run_score_{q,m,*})
|
||||
cap_var(q) = Var_m Mean(run_score_{q,m,*})
|
||||
SNR(q) = cap_var(q) / (seed_var(q) + 1e-9)
|
||||
capability_fraction = mean_q cap_var(q) / (mean_q cap_var(q) + mean_q seed_var(q))
|
||||
|
||||
Survival:
|
||||
T_F = first assistant turn with empty text and no tool calls,
|
||||
else final assistant turn if run_score < 0.7 and delivery_outcome in {fail, partial}
|
||||
S(t) = P(T_F > t)
|
||||
h(t) = P(T_F = t | T_F >= t)
|
||||
```
|
||||
|
||||
Implemented regime classifier in `clawbench/dynamics.py`:
|
||||
|
||||
```text
|
||||
trapped if H_tools < 0.5 or (error_rate > 0.6 and std(drift) < 0.05)
|
||||
convergent if std(drift_last_quartile) < 0.1 and mean(step_last_quartile) < 0.15 and error_rate < 0.2
|
||||
diffusive if H_tools > 1.5 and error_rate < 0.15 and constraint_index_run < 0.8
|
||||
chaotic if H_tools > 2.0 and var(step[1:]) > 0.02
|
||||
limit_cycle if max autocorr(centered step[1:], lags 2..5) > 0.3
|
||||
unknown otherwise, or <3 assistant turns
|
||||
```
|
||||
|
||||
The task-level `C(q)` uses a normalized bag-of-words response vector built from the full assistant trajectory text plus tool-call names and compacted inputs, not just the last assistant turn.
|
||||
|
||||
From the v4-19 sweep data:
|
||||
- **Gemini 3.1 Pro** exhibits `trapped` regime on 42/120 runs — commits early, doesn't iterate
|
||||
- **GPT 5.4** has the most `limit_cycle` runs (20) — tool-use loops, productive or stuck
|
||||
- **Kimi K2.5** dies at median turn 3 (worst survival); **GPT 5.4** survives to turn 8 at 60% rate (best)
|
||||
|
||||
All scripts under `scripts/` — pure numpy + scipy, no torch / sentence-transformers required, runs on any archive dir.
|
||||
All scripts under `scripts/` run on cached per-run JSONs with plain numpy-based tooling; no torch or sentence-transformers required.
|
||||
|
||||
### 4. We ablate configurations, not just models
|
||||
|
||||
@ -264,9 +300,12 @@ The `1/y_i^2` term means the worst score dominates. A configuration scoring 0.85
|
||||
Flat-mean compresses frontier model gaps. An alternative that weights tasks by their signal density:
|
||||
|
||||
```
|
||||
weight(task) = max(0, SNR(task)) × |C(q)(task)| # unbounded
|
||||
weight_winsorized(task) = min(weight(task), p95) # prevent single-task dominance
|
||||
score(model) = Σ weight × mean_run_score / Σ weight
|
||||
w_q = max(0, SNR(q)) × |C(q)|
|
||||
w_q^wins = min(w_q, p95({w_q}))
|
||||
|
||||
flat_score(model) = mean_q mean_run_score(model, q) over covered tasks
|
||||
weighted_score(model) = Σ_q w_q mean_run_score(model, q) / Σ_q w_q
|
||||
winsorized_score(model) = Σ_q w_q^wins mean_run_score(model, q) / Σ_q w_q^wins
|
||||
```
|
||||
|
||||
Under SNR × |C(q)| winsorized on the same 1,080-run archive, **Opus 4.7 ranks #1** (instead of Opus 4.6 under flat mean) and **GPT 5.4 drops from #3 to #7** — its task-specific cliffs (0.16 on `t3-feature-export`) fall on the highest-signal tasks. This exposes what the flat mean averages away.
|
||||
@ -349,27 +388,48 @@ clawbench run \
|
||||
-o results/opus46_core_v1.json
|
||||
```
|
||||
|
||||
### Analyze an archive with the diagnostic suite
|
||||
### Analyze a real archive
|
||||
|
||||
```bash
|
||||
# 1. Aggregate coverage + fair-comparison audit
|
||||
# Fair-comparison audit
|
||||
python3 scripts/audit_runs.py
|
||||
|
||||
# 2. Rejudge any judge-infrastructure failures via direct Anthropic API
|
||||
python3 scripts/rejudge_all.py \
|
||||
--drift-dir data/drift_2026-04-19-full \
|
||||
--archive-dir data/run_cache_archive/v2026-4-19-full
|
||||
|
||||
# 3. Generate the fair comparison report
|
||||
python3 scripts/generate_fair_report.py --tag v2026-4-19-full
|
||||
|
||||
# 4. Dynamical-systems diagnostics (C(q), regimes, survival, SNR-weighted)
|
||||
.venv/bin/python3 scripts/compute_constraint_index.py
|
||||
.venv/bin/python3 scripts/classify_regimes.py
|
||||
.venv/bin/python3 scripts/variance_decomp.py
|
||||
.venv/bin/python3 scripts/survival_analysis.py
|
||||
.venv/bin/python3 scripts/snr_weighted_ranking.py
|
||||
.venv/bin/python3 scripts/generate_dynamical_report.py
|
||||
# Posterior dynamics + ranking from cached per-run JSONs
|
||||
python3 scripts/run_posterior_dynamics_pipeline.py \
|
||||
--archive-dir .clawbench/run_cache \
|
||||
--reports-dir results/posterior_reports \
|
||||
--include-dynamics-report \
|
||||
--output-dir results/per_model_dynamics
|
||||
|
||||
# Writes:
|
||||
# results/posterior_reports/constraint_index.json
|
||||
# results/posterior_reports/regimes.json
|
||||
# results/posterior_reports/variance_decomposition.json
|
||||
# results/posterior_reports/survival_analysis.json
|
||||
# results/posterior_reports/snr_weighted_ranking.json
|
||||
# results/posterior_reports/EVAL_REPORT_DYNAMICAL.md
|
||||
# results/per_model_dynamics/<safe_model_name>/dynamics.json
|
||||
# results/per_model_dynamics/<safe_model_name>/*.png
|
||||
```
|
||||
|
||||
If you only want one model's offline dynamics bundle:
|
||||
|
||||
```bash
|
||||
clawbench dynamics-report \
|
||||
--archive-dir .clawbench/run_cache \
|
||||
--model ollama/gpt-oss:20b \
|
||||
--output-dir results/gptoss_dynamics
|
||||
|
||||
# Quick CI path: skip plot rendering
|
||||
clawbench dynamics-report \
|
||||
--archive-dir .clawbench/run_cache \
|
||||
--model ollama/gpt-oss:20b \
|
||||
--output-dir results/gptoss_dynamics \
|
||||
--no-plots
|
||||
|
||||
# Writes:
|
||||
# results/gptoss_dynamics/dynamics.json
|
||||
```
|
||||
|
||||
### Running locally with small models (Ollama)
|
||||
@ -379,7 +439,24 @@ A single consumer GPU running an open-weight model is enough to develop plugin p
|
||||
```bash
|
||||
ollama pull gpt-oss:20b
|
||||
export OPENCLAW_GATEWAY_TOKEN=<your-gateway-token>
|
||||
clawbench run --model ollama/gpt-oss:20b --task t1-fs-quick-note --runs 1
|
||||
export CLAWBENCH_RUN_CACHE_DIR=$PWD/.clawbench/run_cache
|
||||
|
||||
# Real benchmark run + immediate per-run dynamics bundle
|
||||
clawbench run \
|
||||
--model ollama/gpt-oss:20b \
|
||||
--task t1-fs-quick-note \
|
||||
--runs 1 \
|
||||
--dynamics \
|
||||
-o results/ollama_smoke.json
|
||||
|
||||
# Optional second local model
|
||||
ollama pull qwen3.5:27b
|
||||
|
||||
# Offline posterior analysis reads CLAWBENCH_RUN_CACHE_DIR
|
||||
python3 scripts/run_posterior_dynamics_pipeline.py \
|
||||
--archive-dir .clawbench/run_cache \
|
||||
--reports-dir results/posterior_reports
|
||||
|
||||
clawbench diagnose profiles/local_ollama_gpt_oss.yaml
|
||||
```
|
||||
|
||||
@ -415,6 +492,9 @@ clawbench/
|
||||
│ ├── profile.py # v0.5 plugin fingerprinting
|
||||
│ ├── diagnostic.py # Configuration Diagnostic report
|
||||
│ ├── factor_analysis.py # fANOVA factor importance
|
||||
│ ├── dynamics.py # Trajectory metrics + sensitivity analysis
|
||||
│ ├── dynamics_archive.py # Cached-run loading + offline report assembly
|
||||
│ ├── dynamics_plots.py # Offline dynamics visualizations
|
||||
│ └── cli.py # CLI entry points
|
||||
│
|
||||
├── tasks-public/ # Core v1 PUBLIC release (19 tasks)
|
||||
@ -431,6 +511,7 @@ clawbench/
|
||||
│ ├── audit_per_run.py # Per-run cross-model audit
|
||||
│ ├── rejudge_all.py # Direct-API rejudge for broken gateway judges
|
||||
│ ├── generate_fair_report.py # Fair N-model comparison report
|
||||
│ ├── run_posterior_dynamics_pipeline.py # One-shot posterior analysis driver
|
||||
│ ├── compute_constraint_index.py # C(q) per task
|
||||
│ ├── classify_regimes.py # Per-run dynamical regime classifier
|
||||
│ ├── variance_decomp.py # Seed-noise vs capability-signal decomposition
|
||||
@ -439,7 +520,7 @@ clawbench/
|
||||
│ └── generate_dynamical_report.py # Combined dynamical-systems report
|
||||
│
|
||||
├── profiles/ # v0.5 plugin profile YAMLs
|
||||
├── tests/ # 107 tests
|
||||
├── tests/ # Test suite
|
||||
├── Dockerfile # Layered on ghcr.io/openclaw/openclaw:latest
|
||||
├── CLAWBENCH_V0_4_SPEC.md # Full specification
|
||||
└── PARTNER_TRACE_SPEC.md # Trace interchange format
|
||||
@ -469,7 +550,7 @@ clawbench/
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
python -m pytest -q # 107 tests
|
||||
python -m pytest -q
|
||||
```
|
||||
|
||||
Key test invariants:
|
||||
|
||||
@ -136,6 +136,15 @@ submission
|
||||
|
||||
Important rule: browser tasks stay serialized on one dedicated lane to avoid Chromium and port-range collisions.
|
||||
|
||||
## Submission presets
|
||||
|
||||
The Submit tab now exposes two preset audiences so the Space can serve both general Claw users and lower-budget exploratory runs:
|
||||
|
||||
- `Claw Users` keeps the full preset catalog, including provider-backed frontier models.
|
||||
- `Budget Researchers` narrows the list to local or lower-cost presets such as `ollama/gpt-oss:20b`, `ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and `huggingface/google/gemma-4-26B-A4B-it`.
|
||||
|
||||
You can still enter any custom model ID directly; the preset audience only filters the shortcut catalog and the bulk-submit action.
|
||||
|
||||
## Task inventory
|
||||
|
||||
| Task | Tier | Family | Main verification |
|
||||
|
||||
132
app.py
132
app.py
@ -26,6 +26,15 @@ from clawbench.hub import (
|
||||
load_submission_rows_from_parquet,
|
||||
resolve_dataset_repo,
|
||||
)
|
||||
from clawbench.submission_models import (
|
||||
CUSTOM_PRESET_LABEL,
|
||||
PRESET_AUDIENCE_ALL,
|
||||
PRESET_AUDIENCE_CHOICES,
|
||||
PRESET_MODEL_MAP,
|
||||
preset_labels_for_audience,
|
||||
preset_models_for_audience,
|
||||
resolve_model_selection,
|
||||
)
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
||||
logger = logging.getLogger("clawbench.app")
|
||||
@ -51,31 +60,6 @@ def _env_int(name: str, default: int, *, minimum: int, maximum: int) -> int:
|
||||
DEFAULT_RUNS_PER_TASK = _env_int("CLAWBENCH_DEFAULT_RUNS_PER_TASK", 3, minimum=1, maximum=10)
|
||||
DEFAULT_PARALLEL_LANES = _env_int("CLAWBENCH_DEFAULT_PARALLEL_LANES", 1, minimum=1, maximum=4)
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Preset models for quick submission
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
PRESET_MODELS = {
|
||||
# All models verified working on HF Inference API (free with HF_TOKEN)
|
||||
# Tested 2026-04-07 via router.huggingface.co/v1/chat/completions
|
||||
#
|
||||
# --- Chinese open-source ---
|
||||
"GLM 5.1 (754B MoE)": "huggingface/zai-org/GLM-5.1",
|
||||
"GLM 5 (400B MoE)": "huggingface/zai-org/GLM-5",
|
||||
"Qwen3 32B": "huggingface/Qwen/Qwen3-32B",
|
||||
"DeepSeek R1": "huggingface/deepseek-ai/DeepSeek-R1",
|
||||
"Kimi K2 Instruct": "huggingface/moonshotai/Kimi-K2-Instruct",
|
||||
"MiniMax M2.5": "huggingface/MiniMaxAI/MiniMax-M2.5",
|
||||
# --- Google open-source ---
|
||||
"Gemma 4 26B MoE": "huggingface/google/gemma-4-26B-A4B-it",
|
||||
# --- Meta open-source ---
|
||||
"Llama 3.3 70B": "huggingface/meta-llama/Llama-3.3-70B-Instruct",
|
||||
"Llama 3.1 70B": "huggingface/meta-llama/Llama-3.1-70B-Instruct",
|
||||
# --- Proprietary models (require runtime auth configured for the model provider) ---
|
||||
"Claude Sonnet 4.6": "anthropic/claude-sonnet-4-6",
|
||||
"Claude Opus 4.6": "anthropic/claude-opus-4-6",
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Background worker (starts in a thread)
|
||||
# ---------------------------------------------------------------------------
|
||||
@ -271,15 +255,14 @@ def submit_model(
|
||||
prompt_variant: str,
|
||||
submitter: str,
|
||||
) -> str:
|
||||
# Use preset if selected, otherwise use custom model ID
|
||||
model_id = PRESET_MODELS.get(preset, "") or model.strip()
|
||||
model_id, provider_id = resolve_model_selection(model, preset, provider)
|
||||
if not model_id:
|
||||
return "Please enter a model ID or select a preset."
|
||||
|
||||
selected_tier = tier if tier != "all" else None
|
||||
request = SubmissionRequest(
|
||||
model=model_id,
|
||||
provider=provider.strip(),
|
||||
provider=provider_id,
|
||||
judge_model=judge_model.strip(),
|
||||
runs_per_task=int(runs),
|
||||
max_parallel_lanes=int(max_parallel_lanes),
|
||||
@ -292,20 +275,38 @@ def submit_model(
|
||||
return f"Submitted [{model_id}]! Job ID: {job.job_id}. Check the Queue tab."
|
||||
|
||||
|
||||
def submit_all_presets(runs: int, max_parallel_lanes: int, submitter: str) -> str:
|
||||
"""Submit all preset models at once."""
|
||||
def submit_all_presets(
|
||||
preset_audience: str,
|
||||
runs: int,
|
||||
max_parallel_lanes: int,
|
||||
submitter: str,
|
||||
) -> str:
|
||||
"""Submit all preset models from the selected audience track."""
|
||||
presets = preset_models_for_audience(preset_audience)
|
||||
if not presets:
|
||||
return f"No presets configured for {preset_audience}."
|
||||
|
||||
submitted = []
|
||||
for name, model_id in PRESET_MODELS.items():
|
||||
for preset in presets:
|
||||
request = SubmissionRequest(
|
||||
model=model_id,
|
||||
provider="",
|
||||
model=preset.model_id,
|
||||
provider=preset.provider,
|
||||
runs_per_task=int(runs),
|
||||
max_parallel_lanes=int(max_parallel_lanes),
|
||||
submitter=submitter.strip(),
|
||||
)
|
||||
job = asyncio.run(queue.submit(request))
|
||||
submitted.append(f"{name} ({job.job_id})")
|
||||
return f"Submitted {len(submitted)} models:\n" + "\n".join(f" - {s}" for s in submitted)
|
||||
submitted.append(f"{preset.label} ({job.job_id})")
|
||||
return f"Submitted {len(submitted)} models from {preset_audience}:\n" + "\n".join(
|
||||
f" - {item}" for item in submitted
|
||||
)
|
||||
|
||||
|
||||
def update_preset_choices(preset_audience: str):
|
||||
return gr.update(
|
||||
choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(preset_audience),
|
||||
value=CUSTOM_PRESET_LABEL,
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
@ -952,7 +953,7 @@ STAT_JUDGE = (
|
||||
)
|
||||
STAT_PRESETS = (
|
||||
'<div class="stat-pill"><div class="label">Presets</div><div class="value teal">'
|
||||
+ str(len(PRESET_MODELS))
|
||||
+ str(len(PRESET_MODEL_MAP))
|
||||
+ "</div></div>"
|
||||
)
|
||||
|
||||
@ -986,12 +987,28 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
|
||||
"run via HuggingFace Inference API. You can also use locally hosted models "
|
||||
"(for example Ollama) when your OpenClaw runtime has them configured."
|
||||
)
|
||||
gr.Markdown(
|
||||
"Use `Preset Audience` to switch between the full Claw catalog and a smaller budget track. "
|
||||
"The budget track keeps local and lower-cost options upfront, including `ollama/gpt-oss:20b`, "
|
||||
"`ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and "
|
||||
"`huggingface/google/gemma-4-26B-A4B-it`."
|
||||
)
|
||||
|
||||
preset_audience_input = gr.Dropdown(
|
||||
choices=list(PRESET_AUDIENCE_CHOICES),
|
||||
value=PRESET_AUDIENCE_ALL,
|
||||
label="Preset Audience",
|
||||
)
|
||||
preset_input = gr.Dropdown(
|
||||
choices=["(custom)"] + list(PRESET_MODELS.keys()),
|
||||
value="(custom)",
|
||||
choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(PRESET_AUDIENCE_ALL),
|
||||
value=CUSTOM_PRESET_LABEL,
|
||||
label="Preset models",
|
||||
)
|
||||
preset_audience_input.change(
|
||||
fn=update_preset_choices,
|
||||
inputs=preset_audience_input,
|
||||
outputs=preset_input,
|
||||
)
|
||||
with gr.Row():
|
||||
model_input = gr.Textbox(
|
||||
label="Custom Model ID (if not using preset)",
|
||||
@ -1074,26 +1091,35 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
|
||||
)
|
||||
submit_all_btn.click(
|
||||
fn=submit_all_presets,
|
||||
inputs=[runs_input, max_parallel_lanes_input, submitter_input],
|
||||
inputs=[preset_audience_input, runs_input, max_parallel_lanes_input, submitter_input],
|
||||
outputs=submit_output,
|
||||
)
|
||||
|
||||
gr.Markdown("""
|
||||
**All presets verified working on HF Inference API (free):**
|
||||
**Preset audiences:**
|
||||
|
||||
| Model | Provider | Size | Runtime |
|
||||
|-------|----------|------|---------|
|
||||
| GLM 5.1 | Z.ai | 754B MoE | HF free |
|
||||
| GLM 5 | Z.ai | 400B MoE | HF free |
|
||||
| Qwen3 32B | Alibaba | 32B | HF free |
|
||||
| DeepSeek R1 | DeepSeek | 671B MoE | HF free |
|
||||
| Kimi K2 Instruct | Moonshot AI | MoE | HF free |
|
||||
| MiniMax M2.5 | MiniMax | MoE | HF free |
|
||||
| Gemma 4 26B MoE | Google | 26B MoE | HF free |
|
||||
| Llama 3.3 70B | Meta | 70B | HF free |
|
||||
| Llama 3.1 70B | Meta | 70B | HF free |
|
||||
| Claude Sonnet 4.6 | Anthropic | - | configured auth |
|
||||
| Claude Opus 4.6 | Anthropic | - | configured auth |
|
||||
| Audience | What it optimizes for | Presets |
|
||||
|---|---|---|
|
||||
| Claw Users | Full preset catalog, including provider-backed frontier options | Anthropic, HF open-weight, and Ollama presets |
|
||||
| Budget Researchers | Smaller local/free-friendly track | GPT-OSS 20B, Qwen 3.5 27B, Qwen3 32B, Gemma 4 26B |
|
||||
|
||||
**Current preset catalog:**
|
||||
|
||||
| Model | Provider | Audience |
|
||||
|---|---|---|
|
||||
| GPT-OSS 20B (Ollama) | Ollama | Claw Users, Budget Researchers |
|
||||
| Qwen 3.5 27B (Ollama) | Ollama | Claw Users, Budget Researchers |
|
||||
| Qwen3 32B | HuggingFace | Claw Users, Budget Researchers |
|
||||
| Gemma 4 26B MoE | HuggingFace | Claw Users, Budget Researchers |
|
||||
| GLM 5.1 | HuggingFace | Claw Users |
|
||||
| GLM 5 | HuggingFace | Claw Users |
|
||||
| DeepSeek R1 | HuggingFace | Claw Users |
|
||||
| Kimi K2 Instruct | HuggingFace | Claw Users |
|
||||
| MiniMax M2.5 | HuggingFace | Claw Users |
|
||||
| Llama 3.3 70B | HuggingFace | Claw Users |
|
||||
| Llama 3.1 70B | HuggingFace | Claw Users |
|
||||
| Claude Sonnet 4.6 | Anthropic | Claw Users |
|
||||
| Claude Opus 4.6 | Anthropic | Claw Users |
|
||||
""")
|
||||
|
||||
with gr.Tab("Queue"):
|
||||
|
||||
104
clawbench/cli.py
104
clawbench/cli.py
@ -116,6 +116,11 @@ def cli(verbose: bool) -> None:
|
||||
show_default=True,
|
||||
help="Where to write ecosystem insight files after a --profile run.",
|
||||
)
|
||||
@click.option(
|
||||
"--dynamics",
|
||||
is_flag=True,
|
||||
help="Run quick post-benchmark dynamics analysis. Prefer dynamics-report for offline cache/archive analysis.",
|
||||
)
|
||||
def run(
|
||||
model: str,
|
||||
gateway_token: str,
|
||||
@ -137,6 +142,7 @@ def run(
|
||||
browser_concurrency: int,
|
||||
profile: Path | None,
|
||||
insights_dir: Path,
|
||||
dynamics: bool,
|
||||
) -> None:
|
||||
gateway_config = GatewayConfig(token=gateway_token)
|
||||
harness = BenchmarkHarness(
|
||||
@ -165,6 +171,9 @@ def run(
|
||||
json.dump(result.model_dump(), handle, indent=2)
|
||||
click.echo(f"\nResults saved to {out_path}")
|
||||
|
||||
if dynamics:
|
||||
_run_dynamics_analysis(harness.last_task_runs, out_path)
|
||||
|
||||
if profile is not None:
|
||||
_run_v05_diagnostic(
|
||||
profile_path=profile,
|
||||
@ -179,6 +188,83 @@ def run(
|
||||
asyncio.run(upload_result(result))
|
||||
|
||||
|
||||
@cli.command("dynamics-report")
|
||||
@click.option(
|
||||
"--archive-dir",
|
||||
type=click.Path(exists=True, file_okay=False, path_type=Path),
|
||||
required=True,
|
||||
help="Path to a run cache/archive root or a single model cache directory.",
|
||||
)
|
||||
@click.option(
|
||||
"--model",
|
||||
default=None,
|
||||
help="Model id to select when the archive root contains multiple model directories.",
|
||||
)
|
||||
@click.option("--tier", type=click.Choice(["tier1", "tier2", "tier3", "tier4", "tier5"]))
|
||||
@click.option("--task", "task_ids", multiple=True, help="Specific task IDs to include from the archive.")
|
||||
@click.option(
|
||||
"--output-dir",
|
||||
type=click.Path(path_type=Path),
|
||||
default=Path("results/offline_dynamics"),
|
||||
show_default=True,
|
||||
help="Directory where dynamics.json and plots will be written.",
|
||||
)
|
||||
@click.option(
|
||||
"--no-plots",
|
||||
is_flag=True,
|
||||
help="Write only dynamics.json and skip plot rendering.",
|
||||
)
|
||||
def dynamics_report(
|
||||
archive_dir: Path,
|
||||
model: str | None,
|
||||
tier: str | None,
|
||||
task_ids: tuple[str, ...],
|
||||
output_dir: Path,
|
||||
no_plots: bool,
|
||||
) -> None:
|
||||
"""Generate dynamics plots and a JSON report from cached TaskRunResult archives."""
|
||||
from clawbench.dynamics_archive import load_task_runs_archive
|
||||
|
||||
try:
|
||||
task_runs = load_task_runs_archive(
|
||||
archive_dir=archive_dir,
|
||||
model=model,
|
||||
task_ids=task_ids,
|
||||
tier=tier,
|
||||
)
|
||||
except ValueError as exc:
|
||||
raise click.ClickException(str(exc)) from exc
|
||||
|
||||
if not task_runs:
|
||||
raise click.ClickException(f"No cached runs found under {archive_dir}")
|
||||
|
||||
report_path, plots, n_runs = _write_dynamics_report(
|
||||
task_runs,
|
||||
output_dir,
|
||||
generate_plots=not no_plots,
|
||||
)
|
||||
click.echo(f"Loaded {n_runs} cached runs across {len(task_runs)} tasks")
|
||||
click.echo(f"Dynamics report saved to {report_path}")
|
||||
click.echo(f"Saved {len(plots)} plots to {output_dir}/")
|
||||
|
||||
|
||||
def _write_dynamics_report(
|
||||
task_runs: dict[str, list],
|
||||
output_dir: Path,
|
||||
*,
|
||||
generate_plots: bool = True,
|
||||
) -> tuple[Path, list[Path], int]:
|
||||
from clawbench.dynamics_archive import write_dynamics_report
|
||||
|
||||
report_path, plots = write_dynamics_report(
|
||||
task_runs,
|
||||
output_dir,
|
||||
generate_plots=generate_plots,
|
||||
)
|
||||
n_runs = sum(len(runs) for runs in task_runs.values())
|
||||
return report_path, plots, n_runs
|
||||
|
||||
|
||||
def _run_v05_diagnostic(
|
||||
*,
|
||||
profile_path: Path,
|
||||
@ -693,5 +779,23 @@ def show(result_file: str) -> None:
|
||||
)
|
||||
|
||||
|
||||
def _run_dynamics_analysis(
|
||||
task_runs: dict[str, list],
|
||||
result_path: str,
|
||||
) -> None:
|
||||
"""Compute stratified dynamics from raw TaskRunResult objects."""
|
||||
run_stem = Path(result_path).stem
|
||||
dyn_dir = Path(result_path).parent / f"{run_stem}_dynamics"
|
||||
try:
|
||||
dyn_path, plots, n_runs = _write_dynamics_report(task_runs, dyn_dir)
|
||||
except ValueError as exc:
|
||||
click.echo(str(exc))
|
||||
return
|
||||
|
||||
click.echo(f"\n[dynamics] Analysed {n_runs} cached runs")
|
||||
click.echo(f" Dynamics report saved to {dyn_path}")
|
||||
click.echo(f" Saved {len(plots)} plots to {dyn_dir}/")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
cli()
|
||||
|
||||
@ -8,7 +8,9 @@ import logging
|
||||
import math
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import subprocess
|
||||
import sys
|
||||
import uuid
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Any
|
||||
@ -24,10 +26,10 @@ logger = logging.getLogger(__name__)
|
||||
|
||||
PROTOCOL_VERSION = 3
|
||||
DEVICE_IDENTITY_HELPER_JS = r"""
|
||||
const crypto = require("node:crypto");
|
||||
const fs = require("node:fs");
|
||||
const os = require("node:os");
|
||||
const path = require("node:path");
|
||||
const crypto = require("crypto");
|
||||
const fs = require("fs");
|
||||
const os = require("os");
|
||||
const path = require("path");
|
||||
|
||||
const ED25519_SPKI_PREFIX = Buffer.from("302a300506032b6570032100", "hex");
|
||||
|
||||
@ -52,7 +54,7 @@ function fingerprintPublicKey(publicKeyPem) {
|
||||
}
|
||||
|
||||
function generateIdentity() {
|
||||
const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519");
|
||||
const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519", {});
|
||||
const publicKeyPem = publicKey.export({ type: "spki", format: "pem" }).toString();
|
||||
const privateKeyPem = privateKey.export({ type: "pkcs8", format: "pem" }).toString();
|
||||
return {
|
||||
@ -445,12 +447,48 @@ class GatewayClient:
|
||||
max_wait_seconds=2.0,
|
||||
)
|
||||
)
|
||||
|
||||
# Some gateway/provider paths persist assistant messages in session
|
||||
# history without emitting complete streaming events. Backfill from
|
||||
# sessions.get if stream capture appears incomplete.
|
||||
history_messages = await self.get_session_messages(session_key)
|
||||
collected_assistant = sum(
|
||||
1 for msg in collected_messages if msg.role == "assistant"
|
||||
)
|
||||
history_assistant = sum(
|
||||
1 for msg in history_messages if msg.role == "assistant"
|
||||
)
|
||||
if history_messages and (
|
||||
len(history_messages) > len(collected_messages)
|
||||
or history_assistant > collected_assistant
|
||||
):
|
||||
collected_messages = history_messages
|
||||
finally:
|
||||
self._event_queues.pop(chat_queue_key, None)
|
||||
self._event_queues.pop(msg_queue_key, None)
|
||||
|
||||
return _correlate_transcript(Transcript(messages=collected_messages))
|
||||
|
||||
async def get_session_messages(self, session_key: str) -> list[TranscriptMessage]:
|
||||
try:
|
||||
response = await self._rpc("sessions.get", {"key": session_key})
|
||||
except Exception:
|
||||
return []
|
||||
|
||||
payload = response.get("payload", {})
|
||||
raw_messages = payload.get("messages", [])
|
||||
if not isinstance(raw_messages, list):
|
||||
return []
|
||||
|
||||
parsed: list[TranscriptMessage] = []
|
||||
for raw in raw_messages:
|
||||
if not isinstance(raw, dict):
|
||||
continue
|
||||
msg = _parse_single_message(raw)
|
||||
if msg is not None:
|
||||
parsed.append(msg)
|
||||
return parsed
|
||||
|
||||
async def _rpc(
|
||||
self,
|
||||
method: str,
|
||||
@ -551,9 +589,17 @@ def _build_connect_device(
|
||||
"deviceFamily": device_family or "",
|
||||
}
|
||||
)
|
||||
|
||||
node_executable = _resolve_node_executable()
|
||||
if not node_executable:
|
||||
logger.warning(
|
||||
"Failed to build device identity payload: no Node executable found"
|
||||
)
|
||||
return None
|
||||
|
||||
try:
|
||||
completed = subprocess.run(
|
||||
["node", "-e", DEVICE_IDENTITY_HELPER_JS],
|
||||
[node_executable, "-e", DEVICE_IDENTITY_HELPER_JS],
|
||||
input=helper_input,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
@ -577,6 +623,25 @@ def _build_connect_device(
|
||||
return payload
|
||||
|
||||
|
||||
def _resolve_node_executable() -> str | None:
|
||||
"""Resolve Node binary, preferring the active Python/conda environment."""
|
||||
candidates: list[str] = []
|
||||
|
||||
# First try the same environment as the active Python interpreter.
|
||||
candidates.append(os.path.join(os.path.dirname(sys.executable), "node"))
|
||||
|
||||
# Then try CONDA_PREFIX when available.
|
||||
conda_prefix = os.environ.get("CONDA_PREFIX")
|
||||
if conda_prefix:
|
||||
candidates.append(os.path.join(conda_prefix, "bin", "node"))
|
||||
|
||||
for candidate in candidates:
|
||||
if os.path.isfile(candidate) and os.access(candidate, os.X_OK):
|
||||
return candidate
|
||||
|
||||
return shutil.which("node")
|
||||
|
||||
|
||||
def _is_transient_gateway_connect_error(exc: Exception) -> bool:
|
||||
if isinstance(exc, InvalidStatus):
|
||||
return exc.response.status_code in {502, 503, 504}
|
||||
@ -615,6 +680,9 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N
|
||||
if block_type == "text":
|
||||
text_parts.append(block.get("text", ""))
|
||||
continue
|
||||
if block_type == "output_text":
|
||||
text_parts.append(block.get("text", ""))
|
||||
continue
|
||||
if block_type in {"tool_use", "toolCall"}:
|
||||
arguments = block.get("input", block.get("arguments", {}))
|
||||
if isinstance(arguments, str):
|
||||
@ -641,6 +709,16 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N
|
||||
if tool_result_content:
|
||||
text_parts.append(tool_result_content)
|
||||
|
||||
# Some providers surface assistant failures in a dedicated error field
|
||||
# with empty content blocks. Preserve that signal in transcript text.
|
||||
error_message = message_data.get("errorMessage", "")
|
||||
if isinstance(error_message, str) and error_message.strip():
|
||||
text_parts.append(error_message.strip())
|
||||
|
||||
direct_text = message_data.get("text", "")
|
||||
if isinstance(direct_text, str) and direct_text.strip():
|
||||
text_parts.append(direct_text.strip())
|
||||
|
||||
if not text_parts and not tool_calls and not tool_result_for:
|
||||
return None
|
||||
|
||||
|
||||
695
clawbench/dynamics.py
Normal file
695
clawbench/dynamics.py
Normal file
@ -0,0 +1,695 @@
|
||||
"""Dynamics analysis for ClawBench agent trajectories.
|
||||
|
||||
Treats each agent run as a discrete dynamical system and computes step
|
||||
embeddings, trajectory metrics, sensitivity analysis, regime classification,
|
||||
Kaplan-Meier survival, non-Markov memory, and stratified assessment with
|
||||
Bayesian importance-weight correction for distribution shift.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import math
|
||||
from collections import Counter
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
from typing import TYPE_CHECKING, Callable
|
||||
|
||||
import numpy as np
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from clawbench.schemas import TaskRunResult, Transcript
|
||||
|
||||
# ── Constants ──────────────────────────────────────────────────────────
|
||||
|
||||
TOOL_FAMILIES = ("browser", "edit", "execute", "memory", "read", "search")
|
||||
_N_FAM = len(TOOL_FAMILIES)
|
||||
|
||||
# ── Types ──────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
class Regime(str, Enum):
|
||||
convergent = "convergent"
|
||||
chaotic = "chaotic"
|
||||
trapped = "trapped"
|
||||
diffusive = "diffusive"
|
||||
limit_cycle = "limit_cycle"
|
||||
unknown = "unknown"
|
||||
|
||||
|
||||
@dataclass
|
||||
class Dynamics:
|
||||
"""Computed dynamics for a single trajectory."""
|
||||
|
||||
n_steps: int
|
||||
embeddings: np.ndarray # (n_steps, 10)
|
||||
drift: np.ndarray # cosine distance from step 0
|
||||
step_size: np.ndarray # cosine distance from step t-1
|
||||
entropy_series: list[float] # running tool-family entropy
|
||||
error_rate_series: list[float] # running error fraction
|
||||
tokens_series: list[int]
|
||||
latency_series: list[float]
|
||||
tool_sequence: list[str] # primary family per step
|
||||
markov: dict[str, dict[str, float]]
|
||||
family_dist: dict[str, float]
|
||||
regime: Regime
|
||||
mean_drift: float
|
||||
mean_step_size: float
|
||||
tool_entropy: float
|
||||
error_rate: float
|
||||
constraint_index: float
|
||||
pca_trajectory: np.ndarray | None = None # (n_steps, 2)
|
||||
bigram_transitions: dict[str, dict[str, float]] = field(default_factory=dict)
|
||||
memory_depth: float = 0.0 # I(X_t; X_{t-2} | X_{t-1})
|
||||
|
||||
|
||||
@dataclass
|
||||
class Sensitivity:
|
||||
"""Pairwise comparison between two runs of the same task."""
|
||||
|
||||
task_id: str
|
||||
score_delta: float
|
||||
tool_edit_distance: int
|
||||
family_js_divergence: float
|
||||
embedding_divergence: np.ndarray # (min_steps,)
|
||||
lyapunov_proxy: float
|
||||
|
||||
|
||||
@dataclass
|
||||
class SurvivalPoint:
|
||||
time: float
|
||||
survival: float
|
||||
|
||||
|
||||
# ── Helpers ────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _cosine_dist(a: np.ndarray, b: np.ndarray) -> float:
|
||||
na, nb = np.linalg.norm(a), np.linalg.norm(b)
|
||||
if na < 1e-12 or nb < 1e-12:
|
||||
return 1.0
|
||||
return float(1.0 - np.dot(a, b) / (na * nb))
|
||||
|
||||
|
||||
def _entropy(counts: dict[str, int]) -> float:
|
||||
total = sum(counts.values())
|
||||
if total == 0:
|
||||
return 0.0
|
||||
return -sum(
|
||||
(c / total) * math.log2(c / total) for c in counts.values() if c > 0
|
||||
)
|
||||
|
||||
|
||||
def _js_divergence(p: dict[str, int], q: dict[str, int]) -> float:
|
||||
keys = set(p) | set(q)
|
||||
if not keys:
|
||||
return 0.0
|
||||
tp, tq = sum(p.values()) or 1, sum(q.values()) or 1
|
||||
jsd = 0.0
|
||||
for k in keys:
|
||||
pk, qk = p.get(k, 0) / tp, q.get(k, 0) / tq
|
||||
mk = (pk + qk) / 2
|
||||
if pk > 0 and mk > 0:
|
||||
jsd += 0.5 * pk * math.log2(pk / mk)
|
||||
if qk > 0 and mk > 0:
|
||||
jsd += 0.5 * qk * math.log2(qk / mk)
|
||||
return jsd
|
||||
|
||||
|
||||
def _levenshtein(a: list, b: list) -> int:
|
||||
if not a:
|
||||
return len(b)
|
||||
if not b:
|
||||
return len(a)
|
||||
prev = list(range(len(b) + 1))
|
||||
for ca in a:
|
||||
curr = [prev[0] + 1] + [0] * len(b)
|
||||
for j, cb in enumerate(b):
|
||||
curr[j + 1] = min(
|
||||
prev[j] + (0 if ca == cb else 1),
|
||||
prev[j + 1] + 1,
|
||||
curr[j] + 1,
|
||||
)
|
||||
prev = curr
|
||||
return prev[-1]
|
||||
|
||||
|
||||
def _classify_tool(name: str) -> str:
|
||||
lo = name.lower()
|
||||
for fam in TOOL_FAMILIES:
|
||||
if fam in lo:
|
||||
return fam
|
||||
_ALIASES = {
|
||||
"edit": ("write_file", "create_file", "str_replace", "patch"),
|
||||
"execute": ("bash", "terminal", "shell", "run", "exec"),
|
||||
"browser": ("browse", "click", "navigate", "screenshot"),
|
||||
"search": ("grep", "find", "glob", "semantic"),
|
||||
"read": ("cat", "head", "tail", "view", "list_dir"),
|
||||
}
|
||||
for fam, keywords in _ALIASES.items():
|
||||
if any(k in lo for k in keywords):
|
||||
return fam
|
||||
return "execute"
|
||||
|
||||
|
||||
def _normalize_tool_family(name: str, family: str | None) -> str:
|
||||
if family in TOOL_FAMILIES:
|
||||
return family
|
||||
return _classify_tool(name)
|
||||
|
||||
|
||||
# ── Feature embedding ──────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _embed_transcript(
|
||||
transcript: Transcript,
|
||||
) -> tuple[np.ndarray, list[str], list[int], list[float], list[bool]]:
|
||||
"""Build (n_steps, 10) feature matrix from assistant turns.
|
||||
|
||||
Features: [0:6] tool-family proportions, [6] error flag,
|
||||
[7] normalised tokens, [8] normalised text length, [9] progress.
|
||||
"""
|
||||
msgs = transcript.assistant_messages
|
||||
n = len(msgs)
|
||||
if n == 0:
|
||||
return np.empty((0, _N_FAM + 4)), [], [], [], []
|
||||
|
||||
X = np.zeros((n, _N_FAM + 4))
|
||||
families: list[str] = []
|
||||
tokens: list[int] = []
|
||||
latencies: list[float] = []
|
||||
errors: list[bool] = []
|
||||
raw_tokens = np.zeros(n)
|
||||
raw_text = np.zeros(n)
|
||||
|
||||
for i, msg in enumerate(msgs):
|
||||
fam_counts: Counter = Counter()
|
||||
has_err = False
|
||||
for tc in msg.tool_calls:
|
||||
fam = _normalize_tool_family(tc.name, tc.family)
|
||||
fam_counts[fam] += 1
|
||||
if tc.success is False or tc.error:
|
||||
has_err = True
|
||||
n_tc = sum(fam_counts.values()) or 1
|
||||
for j, fam in enumerate(TOOL_FAMILIES):
|
||||
X[i, j] = fam_counts.get(fam, 0) / n_tc
|
||||
X[i, _N_FAM] = 1.0 if has_err else 0.0
|
||||
X[i, _N_FAM + 3] = i / max(n - 1, 1)
|
||||
|
||||
families.append(
|
||||
max(fam_counts, key=fam_counts.get) if fam_counts else "execute"
|
||||
)
|
||||
errors.append(has_err)
|
||||
tokens.append(msg.usage.total_tokens)
|
||||
raw_tokens[i] = float(msg.usage.total_tokens)
|
||||
raw_text[i] = float(len(msg.text))
|
||||
dt = msg.timestamp_ms - msgs[i - 1].timestamp_ms if i > 0 else 0
|
||||
latencies.append(max(float(dt), 0.0))
|
||||
|
||||
mx_tok = raw_tokens.max() or 1
|
||||
mx_txt = raw_text.max() or 1
|
||||
X[:, _N_FAM + 1] = raw_tokens / mx_tok
|
||||
X[:, _N_FAM + 2] = raw_text / mx_txt
|
||||
|
||||
return X, families, tokens, latencies, errors
|
||||
|
||||
|
||||
# ── Non-Markov memory ────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _compute_bigram_transitions(seq: list[str]) -> dict[str, dict[str, float]]:
|
||||
"""P(family_t | family_{t-1}, family_{t-2}) grouped by bigram context."""
|
||||
if len(seq) < 3:
|
||||
return {}
|
||||
bigrams: dict[str, Counter] = {}
|
||||
for a, b, c in zip(seq[:-2], seq[1:-1], seq[2:]):
|
||||
ctx = f"{a}->{b}"
|
||||
bigrams.setdefault(ctx, Counter())[c] += 1
|
||||
return {
|
||||
ctx: {k: v / sum(cnts.values()) for k, v in cnts.items()}
|
||||
for ctx, cnts in bigrams.items()
|
||||
}
|
||||
|
||||
|
||||
def _conditional_mi(seq: list[str]) -> float:
|
||||
"""I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator."""
|
||||
if len(seq) < 3:
|
||||
return 0.0
|
||||
n = len(seq) - 2
|
||||
triple = Counter(zip(seq[:-2], seq[1:-1], seq[2:]))
|
||||
pair_01 = Counter(zip(seq[:-2], seq[1:-1]))
|
||||
pair_12 = Counter(zip(seq[1:-1], seq[2:]))
|
||||
single = Counter(seq[1:-1])
|
||||
|
||||
mi = 0.0
|
||||
for (a, b, c), count in triple.items():
|
||||
p_abc = count / n
|
||||
p_ab, p_bc, p_b = pair_01[(a, b)] / n, pair_12[(b, c)] / n, single[b] / n
|
||||
if p_ab > 0 and p_bc > 0 and p_b > 0:
|
||||
mi += p_abc * math.log2((p_abc * p_b) / (p_ab * p_bc))
|
||||
return max(mi, 0.0)
|
||||
|
||||
|
||||
# ── Core analysis ──────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def compute_dynamics(transcript: Transcript) -> Dynamics:
|
||||
"""Compute trajectory dynamics from a single run transcript."""
|
||||
X, families, tokens, latencies, errors = _embed_transcript(transcript)
|
||||
n = len(families)
|
||||
|
||||
drift = (
|
||||
np.array([_cosine_dist(X[0], X[i]) for i in range(n)])
|
||||
if n else np.array([])
|
||||
)
|
||||
step_sz = np.zeros(n)
|
||||
for i in range(1, n):
|
||||
step_sz[i] = _cosine_dist(X[i - 1], X[i])
|
||||
|
||||
fam_acc: Counter = Counter()
|
||||
err_count = 0
|
||||
entropy_s: list[float] = []
|
||||
error_s: list[float] = []
|
||||
for i, (fam, err) in enumerate(zip(families, errors)):
|
||||
fam_acc[fam] += 1
|
||||
err_count += int(err)
|
||||
entropy_s.append(_entropy(dict(fam_acc)))
|
||||
error_s.append(err_count / (i + 1))
|
||||
|
||||
total = sum(fam_acc.values()) or 1
|
||||
fam_dist = {k: v / total for k, v in fam_acc.items()}
|
||||
|
||||
mc: dict[str, Counter] = {f: Counter() for f in TOOL_FAMILIES}
|
||||
for a, b in zip(families[:-1], families[1:]):
|
||||
mc[a][b] += 1
|
||||
markov = {
|
||||
src: ({dst: c / t for dst, c in cnts.items()} if (t := sum(cnts.values())) else {})
|
||||
for src, cnts in mc.items()
|
||||
}
|
||||
|
||||
ci = 0.5
|
||||
if n > 2:
|
||||
cov = np.cov(X.T)
|
||||
eigvals = np.maximum(np.linalg.eigvalsh(cov), 0)
|
||||
tv = eigvals.sum()
|
||||
if tv > 1e-10:
|
||||
p = eigvals / tv
|
||||
pr = 1.0 / np.sum(p**2)
|
||||
ci = 1.0 - (pr - 1) / (X.shape[1] - 1)
|
||||
|
||||
h = _entropy(dict(fam_acc))
|
||||
er = err_count / n if n else 0
|
||||
regime = _classify_regime(drift, step_sz, h, er, ci, n)
|
||||
|
||||
return Dynamics(
|
||||
n_steps=n,
|
||||
embeddings=X,
|
||||
drift=drift,
|
||||
step_size=step_sz,
|
||||
entropy_series=entropy_s,
|
||||
error_rate_series=error_s,
|
||||
tokens_series=tokens,
|
||||
latency_series=latencies,
|
||||
tool_sequence=families,
|
||||
markov=markov,
|
||||
family_dist=fam_dist,
|
||||
regime=regime,
|
||||
mean_drift=float(np.mean(drift)) if n else 0,
|
||||
mean_step_size=float(np.mean(step_sz)) if n else 0,
|
||||
tool_entropy=h,
|
||||
error_rate=er,
|
||||
constraint_index=ci,
|
||||
bigram_transitions=_compute_bigram_transitions(families),
|
||||
memory_depth=_conditional_mi(families),
|
||||
)
|
||||
|
||||
|
||||
def _classify_regime(drift, step_sz, entropy, error_rate, ci, n) -> Regime:
|
||||
if n < 3:
|
||||
return Regime.unknown
|
||||
if entropy < 0.5 or (error_rate > 0.6 and float(np.std(drift)) < 0.05):
|
||||
return Regime.trapped
|
||||
q = max(1, n // 4)
|
||||
late_drift_std = float(np.std(drift[-q:]))
|
||||
late_step_mean = float(np.mean(step_sz[-q:]))
|
||||
if late_drift_std < 0.1 and late_step_mean < 0.15 and error_rate < 0.2:
|
||||
return Regime.convergent
|
||||
if entropy > 1.5 and error_rate < 0.15 and ci < 0.8:
|
||||
return Regime.diffusive
|
||||
step_var = float(np.var(step_sz[1:])) if n > 1 else 0
|
||||
if entropy > 2.0 and step_var > 0.02:
|
||||
return Regime.chaotic
|
||||
if n > 6:
|
||||
ss = step_sz[1:]
|
||||
ss_c = ss - ss.mean()
|
||||
norm = np.dot(ss_c, ss_c)
|
||||
if norm > 1e-10:
|
||||
ac = np.correlate(ss_c, ss_c, mode="full")
|
||||
ac = ac[len(ac) // 2:] / norm
|
||||
if len(ac) > 5 and max(ac[2:6]) > 0.3:
|
||||
return Regime.limit_cycle
|
||||
return Regime.unknown
|
||||
|
||||
|
||||
# ── Sensitivity ────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def compute_sensitivity(
|
||||
run_a: TaskRunResult,
|
||||
run_b: TaskRunResult,
|
||||
task_id: str = "",
|
||||
) -> Sensitivity:
|
||||
"""Compare two runs of the same task for prompt sensitivity."""
|
||||
Xa, fam_a, *_ = _embed_transcript(run_a.transcript)
|
||||
Xb, fam_b, *_ = _embed_transcript(run_b.transcript)
|
||||
|
||||
min_n = min(len(Xa), len(Xb))
|
||||
emb_div = (
|
||||
np.array([_cosine_dist(Xa[i], Xb[i]) for i in range(min_n)])
|
||||
if min_n else np.array([])
|
||||
)
|
||||
|
||||
lyap = 0.0
|
||||
if min_n > 1:
|
||||
d0 = max(_cosine_dist(Xa[0], Xb[0]), 1e-6)
|
||||
lyap = sum(
|
||||
math.log(max(emb_div[t], 1e-6) / d0) / t for t in range(1, min_n)
|
||||
) / (min_n - 1)
|
||||
|
||||
return Sensitivity(
|
||||
task_id=task_id or run_a.task_id,
|
||||
score_delta=abs(run_a.run_score - run_b.run_score),
|
||||
tool_edit_distance=_levenshtein(fam_a, fam_b),
|
||||
family_js_divergence=_js_divergence(dict(Counter(fam_a)), dict(Counter(fam_b))),
|
||||
embedding_divergence=emb_div,
|
||||
lyapunov_proxy=lyap,
|
||||
)
|
||||
|
||||
|
||||
# ── Survival analysis ─────────────────────────────────────────────────
|
||||
|
||||
|
||||
def kaplan_meier(
|
||||
event_times: list[float],
|
||||
censored: list[bool] | None = None,
|
||||
) -> list[SurvivalPoint]:
|
||||
"""Kaplan-Meier survival estimator."""
|
||||
n = len(event_times)
|
||||
if n == 0:
|
||||
return []
|
||||
if censored is None:
|
||||
censored = [False] * n
|
||||
pairs = sorted(zip(event_times, censored))
|
||||
pts = [SurvivalPoint(0.0, 1.0)]
|
||||
at_risk = n
|
||||
surv = 1.0
|
||||
for t, cens in pairs:
|
||||
if cens:
|
||||
at_risk -= 1
|
||||
continue
|
||||
if at_risk > 0:
|
||||
surv *= (at_risk - 1) / at_risk
|
||||
at_risk -= 1
|
||||
pts.append(SurvivalPoint(t, surv))
|
||||
return pts
|
||||
|
||||
|
||||
def find_event_step(transcript: Transcript, event: str) -> float | None:
|
||||
"""Return step index of the first occurrence of *event*, or None."""
|
||||
msgs = transcript.assistant_messages
|
||||
if event == "first_error_recovery":
|
||||
in_err = False
|
||||
for i, m in enumerate(msgs):
|
||||
any_err = any(tc.success is False or tc.error for tc in m.tool_calls)
|
||||
if any_err:
|
||||
in_err = True
|
||||
elif in_err:
|
||||
return float(i)
|
||||
elif event == "first_correct_write":
|
||||
for i, m in enumerate(msgs):
|
||||
for tc in m.tool_calls:
|
||||
fam = tc.family or _classify_tool(tc.name)
|
||||
if fam == "edit" and tc.success is not False and not tc.error:
|
||||
return float(i)
|
||||
elif event == "task_completion":
|
||||
if msgs:
|
||||
last = msgs[-1]
|
||||
if not any(tc.success is False or tc.error for tc in last.tool_calls):
|
||||
return float(len(msgs) - 1)
|
||||
elif event == "failure_absorption":
|
||||
err_seen = False
|
||||
for i, m in enumerate(msgs):
|
||||
any_err = any(tc.success is False or tc.error for tc in m.tool_calls)
|
||||
if any_err:
|
||||
err_seen = True
|
||||
elif err_seen and m.tool_calls:
|
||||
return float(i)
|
||||
return None
|
||||
|
||||
|
||||
# ── PCA trajectory bundles ─────────────────────────────────────────────
|
||||
|
||||
|
||||
def compute_pca_bundle(
|
||||
dynamics_list: list[Dynamics],
|
||||
) -> tuple[np.ndarray, list[np.ndarray]]:
|
||||
"""Fit PCA on pooled embeddings, project each trajectory into PC1-PC2."""
|
||||
non_empty = [d.embeddings for d in dynamics_list if d.n_steps > 0]
|
||||
if not non_empty:
|
||||
for d in dynamics_list:
|
||||
d.pca_trajectory = np.empty((0, 2))
|
||||
return np.zeros((2, _N_FAM + 4)), []
|
||||
all_emb = np.vstack(non_empty)
|
||||
mean = all_emb.mean(axis=0)
|
||||
centred = all_emb - mean
|
||||
_, _, Vt = np.linalg.svd(centred, full_matrices=False)
|
||||
components = Vt[:2]
|
||||
|
||||
projections: list[np.ndarray] = []
|
||||
for d in dynamics_list:
|
||||
proj = (d.embeddings - mean) @ components.T if d.n_steps else np.empty((0, 2))
|
||||
d.pca_trajectory = proj
|
||||
projections.append(proj)
|
||||
return components, projections
|
||||
|
||||
|
||||
# ── Stratified assessment with Bayesian reweighting ───────────────────
|
||||
|
||||
|
||||
@dataclass
|
||||
class StratumStats:
|
||||
"""Distributional statistics for one stratum of runs."""
|
||||
|
||||
name: str
|
||||
n_runs: int
|
||||
weight: float
|
||||
|
||||
# Score distribution
|
||||
scores: np.ndarray
|
||||
score_mean: float
|
||||
score_std: float
|
||||
score_quantiles: dict[str, float] # q10, q25, q50, q75, q90
|
||||
|
||||
# Dynamics distributions
|
||||
entropy_dist: np.ndarray
|
||||
error_rate_dist: np.ndarray
|
||||
constraint_dist: np.ndarray
|
||||
memory_depth_dist: np.ndarray
|
||||
mean_drift_dist: np.ndarray
|
||||
mean_step_size_dist: np.ndarray
|
||||
|
||||
# Time-series curves (aligned by step index)
|
||||
drift_curve_mean: np.ndarray
|
||||
drift_curve_std: np.ndarray
|
||||
step_curve_mean: np.ndarray
|
||||
step_curve_std: np.ndarray
|
||||
|
||||
regime_counts: dict[str, int]
|
||||
sensitivity_deltas: np.ndarray
|
||||
|
||||
|
||||
# Scalar fields on StratumStats that reweight() aggregates.
|
||||
_REWEIGHT_FIELDS = [
|
||||
("entropy", "entropy_dist"),
|
||||
("error_rate", "error_rate_dist"),
|
||||
("constraint", "constraint_dist"),
|
||||
("memory_depth", "memory_depth_dist"),
|
||||
("mean_drift", "mean_drift_dist"),
|
||||
("mean_step_size", "mean_step_size_dist"),
|
||||
]
|
||||
|
||||
|
||||
@dataclass
|
||||
class StratifiedAssessment:
|
||||
"""Full stratified assessment with Bayesian reweighting.
|
||||
|
||||
Call ``reweight(target_weights)`` with a different task distribution
|
||||
to obtain importance-weighted aggregate estimates.
|
||||
"""
|
||||
|
||||
strata: list[StratumStats]
|
||||
stratifier_name: str
|
||||
total_runs: int
|
||||
observed_mean_score: float
|
||||
observed_std_score: float
|
||||
|
||||
def stratum_names(self) -> list[str]:
|
||||
return [s.name for s in self.strata]
|
||||
|
||||
def reweight(self, target_weights: dict[str, float]) -> dict[str, float]:
|
||||
"""Bayesian importance-weight correction.
|
||||
|
||||
w_k = p_target(k) / p_observed(k), then normalised.
|
||||
"""
|
||||
t_total = sum(target_weights.values()) or 1.0
|
||||
p_target = {k: v / t_total for k, v in target_weights.items()}
|
||||
by_name = {s.name: s for s in self.strata}
|
||||
|
||||
weights = {
|
||||
name: pt / by_name[name].weight
|
||||
for name, pt in p_target.items()
|
||||
if name in by_name and by_name[name].weight > 1e-12
|
||||
}
|
||||
if not weights:
|
||||
return {"score_mean": self.observed_mean_score,
|
||||
"score_std": self.observed_std_score}
|
||||
|
||||
w_total = sum(weights.values())
|
||||
w = {k: v / w_total for k, v in weights.items()}
|
||||
|
||||
# Reweight score (mean + law-of-total-variance)
|
||||
score_mu = sum(w[k] * by_name[k].score_mean for k in w)
|
||||
score_var = sum(
|
||||
w[k] * (by_name[k].score_std ** 2 + (by_name[k].score_mean - score_mu) ** 2)
|
||||
for k in w
|
||||
)
|
||||
result = {"score_mean": score_mu, "score_std": math.sqrt(max(score_var, 0.0))}
|
||||
|
||||
def _safe_mean(arr: np.ndarray) -> float:
|
||||
return float(np.mean(arr)) if len(arr) > 0 else 0.0
|
||||
|
||||
for label, dist_attr in _REWEIGHT_FIELDS:
|
||||
result[f"{label}_mean"] = sum(
|
||||
w[k] * _safe_mean(getattr(by_name[k], dist_attr)) for k in w
|
||||
)
|
||||
return result
|
||||
|
||||
|
||||
def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]:
|
||||
"""Mean and std of variable-length arrays aligned at step 0."""
|
||||
if not arrays:
|
||||
return np.array([]), np.array([])
|
||||
max_len = max(len(a) for a in arrays)
|
||||
mat = np.full((len(arrays), max_len), np.nan)
|
||||
for i, a in enumerate(arrays):
|
||||
mat[i, :len(a)] = a
|
||||
return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0)
|
||||
|
||||
|
||||
def build_strata(
|
||||
runs: list[TaskRunResult],
|
||||
dynamics_list: list[Dynamics],
|
||||
scores: list[float],
|
||||
stratifier: Callable[[TaskRunResult, Dynamics], str],
|
||||
stratifier_name: str = "custom",
|
||||
sensitivities: list[Sensitivity] | None = None,
|
||||
) -> StratifiedAssessment:
|
||||
"""Group runs into strata and compute per-stratum distributions."""
|
||||
assert len(runs) == len(dynamics_list) == len(scores)
|
||||
|
||||
groups: dict[str, list[int]] = {}
|
||||
for idx, (r, d) in enumerate(zip(runs, dynamics_list)):
|
||||
groups.setdefault(stratifier(r, d), []).append(idx)
|
||||
|
||||
total = len(runs)
|
||||
all_scores = np.array(scores)
|
||||
|
||||
sens_by_task: dict[str, list[Sensitivity]] = {}
|
||||
if sensitivities:
|
||||
for s in sensitivities:
|
||||
sens_by_task.setdefault(s.task_id, []).append(s)
|
||||
|
||||
strata: list[StratumStats] = []
|
||||
for name, idxs in sorted(groups.items()):
|
||||
n = len(idxs)
|
||||
sc = np.array([scores[i] for i in idxs])
|
||||
dyns = [dynamics_list[i] for i in idxs]
|
||||
|
||||
qs = {f"q{q}": float(np.percentile(sc, q)) if n else 0.0
|
||||
for q in (10, 25, 50, 75, 90)}
|
||||
|
||||
drift_m, drift_s = _aligned_mean_std([d.drift for d in dyns])
|
||||
step_m, step_s = _aligned_mean_std([d.step_size for d in dyns])
|
||||
|
||||
stratum_tasks = {runs[i].task_id for i in idxs}
|
||||
sens_deltas = [
|
||||
s.score_delta
|
||||
for tid in stratum_tasks
|
||||
for s in sens_by_task.get(tid, [])
|
||||
]
|
||||
|
||||
strata.append(StratumStats(
|
||||
name=name, n_runs=n, weight=n / total if total else 0.0,
|
||||
scores=sc,
|
||||
score_mean=float(np.mean(sc)) if n else 0.0,
|
||||
score_std=float(np.std(sc)) if n else 0.0,
|
||||
score_quantiles=qs,
|
||||
entropy_dist=np.array([d.tool_entropy for d in dyns]),
|
||||
error_rate_dist=np.array([d.error_rate for d in dyns]),
|
||||
constraint_dist=np.array([d.constraint_index for d in dyns]),
|
||||
memory_depth_dist=np.array([d.memory_depth for d in dyns]),
|
||||
mean_drift_dist=np.array([d.mean_drift for d in dyns]),
|
||||
mean_step_size_dist=np.array([d.mean_step_size for d in dyns]),
|
||||
drift_curve_mean=drift_m, drift_curve_std=drift_s,
|
||||
step_curve_mean=step_m, step_curve_std=step_s,
|
||||
regime_counts=dict(Counter(d.regime.value for d in dyns)),
|
||||
sensitivity_deltas=np.array(sens_deltas) if sens_deltas else np.array([]),
|
||||
))
|
||||
|
||||
return StratifiedAssessment(
|
||||
strata=strata,
|
||||
stratifier_name=stratifier_name,
|
||||
total_runs=total,
|
||||
observed_mean_score=float(np.mean(all_scores)) if total else 0.0,
|
||||
observed_std_score=float(np.std(all_scores)) if total else 0.0,
|
||||
)
|
||||
|
||||
|
||||
# ── Built-in stratifiers ──────────────────────────────────────────────
|
||||
|
||||
|
||||
def stratify_by_regime(run: TaskRunResult, dyn: Dynamics) -> str:
|
||||
return dyn.regime.value
|
||||
|
||||
|
||||
def stratify_by_task(run: TaskRunResult, dyn: Dynamics) -> str:
|
||||
return run.task_id
|
||||
|
||||
|
||||
def stratify_by_tier(run: TaskRunResult, dyn: Dynamics) -> str:
|
||||
tid = run.task_id.lower()
|
||||
for i in range(1, 6):
|
||||
if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"):
|
||||
return f"tier{i}"
|
||||
return "unknown"
|
||||
|
||||
|
||||
def stratify_by_tool_mix(run: TaskRunResult, dyn: Dynamics) -> str:
|
||||
if not dyn.family_dist:
|
||||
return "unknown"
|
||||
return max(dyn.family_dist, key=dyn.family_dist.get)
|
||||
|
||||
|
||||
def stratify_by_prompt_style(run: TaskRunResult, dyn: Dynamics) -> str:
|
||||
user_msgs = [m for m in run.transcript.messages if m.role == "user"]
|
||||
if not user_msgs:
|
||||
return "unknown"
|
||||
wc = len(user_msgs[0].text.split())
|
||||
return "terse" if wc <= 6 else ("medium" if wc <= 15 else "verbose")
|
||||
|
||||
|
||||
def stratify_by_scenario(run: TaskRunResult, dyn: Dynamics) -> str:
|
||||
return run.scenario or "unknown"
|
||||
|
||||
|
||||
def stratify_by_family(run: TaskRunResult, dyn: Dynamics) -> str:
|
||||
return run.family or "unknown"
|
||||
493
clawbench/dynamics_archive.py
Normal file
493
clawbench/dynamics_archive.py
Normal file
@ -0,0 +1,493 @@
|
||||
"""Offline dynamics analysis helpers for cached ClawBench runs."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from itertools import combinations
|
||||
from pathlib import Path
|
||||
from typing import Iterable
|
||||
|
||||
import numpy as np
|
||||
|
||||
from clawbench.dynamics import (
|
||||
build_strata,
|
||||
compute_dynamics,
|
||||
compute_pca_bundle,
|
||||
compute_sensitivity,
|
||||
find_event_step,
|
||||
kaplan_meier,
|
||||
stratify_by_regime,
|
||||
stratify_by_scenario,
|
||||
stratify_by_tier,
|
||||
stratify_by_tool_mix,
|
||||
)
|
||||
from clawbench.dynamics_plots import generate_all_plots
|
||||
from clawbench.schemas import TaskRunResult
|
||||
|
||||
_TIER_PREFIXES = {
|
||||
"tier1": ("t1-", "t1_"),
|
||||
"tier2": ("t2-", "t2_"),
|
||||
"tier3": ("t3-", "t3_"),
|
||||
"tier4": ("t4-", "t4_"),
|
||||
"tier5": ("t5-", "t5_"),
|
||||
}
|
||||
|
||||
|
||||
def safe_model_name(model: str) -> str:
|
||||
return model.replace("/", "_").replace(":", "_")
|
||||
|
||||
|
||||
def _candidate_model_dir_names(model: str) -> set[str]:
|
||||
return {
|
||||
model,
|
||||
safe_model_name(model),
|
||||
model.replace("/", "_"),
|
||||
model.replace("/", "-").replace(":", "-"),
|
||||
}
|
||||
|
||||
|
||||
def _has_run_files(path: Path) -> bool:
|
||||
try:
|
||||
for child in path.iterdir():
|
||||
if child.is_file() and child.name.startswith("run") and child.suffix == ".json":
|
||||
return True
|
||||
except FileNotFoundError:
|
||||
return False
|
||||
return False
|
||||
|
||||
|
||||
def _is_task_collection_root(path: Path) -> bool:
|
||||
try:
|
||||
for child in path.iterdir():
|
||||
if child.is_dir() and _has_run_files(child):
|
||||
return True
|
||||
except FileNotFoundError:
|
||||
return False
|
||||
return False
|
||||
|
||||
|
||||
def _resolve_model_roots(archive_dir: Path, model: str | None) -> list[Path]:
|
||||
if _is_task_collection_root(archive_dir):
|
||||
if model is not None and archive_dir.name not in _candidate_model_dir_names(model):
|
||||
raise ValueError(
|
||||
f"Archive dir {archive_dir} does not match requested model {model}."
|
||||
)
|
||||
return [archive_dir]
|
||||
|
||||
roots = [
|
||||
child
|
||||
for child in sorted(archive_dir.iterdir())
|
||||
if child.is_dir() and _is_task_collection_root(child)
|
||||
]
|
||||
if model is not None:
|
||||
candidates = _candidate_model_dir_names(model)
|
||||
roots = [root for root in roots if root.name in candidates]
|
||||
elif len(roots) > 1:
|
||||
raise ValueError(
|
||||
"Archive root contains multiple model directories. Pass --model or point "
|
||||
"--archive-dir at a specific model directory."
|
||||
)
|
||||
return roots
|
||||
|
||||
|
||||
def discover_model_roots(archive_dir: Path) -> dict[str, Path]:
|
||||
"""Discover model directories inside an archive root.
|
||||
|
||||
Returns a mapping of model directory name to its path. If archive_dir is
|
||||
itself a model cache root (contains task directories with run*.json), the
|
||||
mapping contains a single entry.
|
||||
"""
|
||||
if not archive_dir.exists():
|
||||
raise ValueError(f"Archive dir does not exist: {archive_dir}")
|
||||
|
||||
if _is_task_collection_root(archive_dir):
|
||||
return {archive_dir.name: archive_dir}
|
||||
|
||||
roots = {
|
||||
child.name: child
|
||||
for child in sorted(archive_dir.iterdir())
|
||||
if child.is_dir() and _is_task_collection_root(child)
|
||||
}
|
||||
return roots
|
||||
|
||||
|
||||
def _matches_tier(task_id: str, tier: str | None) -> bool:
|
||||
if tier is None:
|
||||
return True
|
||||
return task_id.lower().startswith(_TIER_PREFIXES[tier])
|
||||
|
||||
|
||||
def load_task_runs_archive(
|
||||
archive_dir: Path,
|
||||
model: str | None = None,
|
||||
task_ids: Iterable[str] | None = None,
|
||||
tier: str | None = None,
|
||||
) -> dict[str, list[TaskRunResult]]:
|
||||
"""Load cached TaskRunResult objects from a run cache/archive directory."""
|
||||
task_filter = set(task_ids or [])
|
||||
task_runs: dict[str, list[TaskRunResult]] = {}
|
||||
|
||||
if not archive_dir.exists():
|
||||
raise ValueError(f"Archive dir does not exist: {archive_dir}")
|
||||
|
||||
roots = _resolve_model_roots(archive_dir, model)
|
||||
if not roots:
|
||||
return {}
|
||||
|
||||
for root in roots:
|
||||
for task_dir in sorted(child for child in root.iterdir() if child.is_dir()):
|
||||
task_id = task_dir.name
|
||||
if task_filter and task_id not in task_filter:
|
||||
continue
|
||||
if not _matches_tier(task_id, tier):
|
||||
continue
|
||||
|
||||
runs = []
|
||||
for run_file in sorted(task_dir.glob("run*.json")):
|
||||
try:
|
||||
run = TaskRunResult.model_validate_json(
|
||||
run_file.read_text(encoding="utf-8")
|
||||
)
|
||||
except Exception:
|
||||
continue
|
||||
runs.append(run)
|
||||
|
||||
if runs:
|
||||
task_runs.setdefault(task_id, []).extend(runs)
|
||||
|
||||
for task_id, runs in task_runs.items():
|
||||
runs.sort(key=lambda run: run.run_index)
|
||||
|
||||
return task_runs
|
||||
|
||||
|
||||
def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]:
|
||||
if not arrays:
|
||||
return np.array([]), np.array([])
|
||||
max_len = max(len(arr) for arr in arrays)
|
||||
if max_len == 0:
|
||||
return np.array([]), np.array([])
|
||||
mat = np.full((len(arrays), max_len), np.nan)
|
||||
for idx, arr in enumerate(arrays):
|
||||
mat[idx, :len(arr)] = arr
|
||||
return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0)
|
||||
|
||||
|
||||
def _round_list(values: np.ndarray, digits: int = 4) -> list[float]:
|
||||
return [round(float(value), digits) for value in values.tolist()]
|
||||
|
||||
|
||||
def _empty_sensitivity_summary() -> dict[str, object]:
|
||||
return {
|
||||
"n_pairs": 0,
|
||||
"mean_score_delta": 0.0,
|
||||
"mean_tool_edit_distance": 0.0,
|
||||
"mean_family_js_divergence": 0.0,
|
||||
"mean_lyapunov_proxy": 0.0,
|
||||
"mean_initial_divergence": 0.0,
|
||||
"mean_final_divergence": 0.0,
|
||||
"mean_contraction_delta": 0.0,
|
||||
"mean_contraction_ratio": 0.0,
|
||||
"fraction_converging_pairs": 0.0,
|
||||
"mean_divergence_curve": [],
|
||||
"std_divergence_curve": [],
|
||||
"pair_points": [],
|
||||
}
|
||||
|
||||
|
||||
def _summarize_sensitivity_group(pairs: list) -> dict[str, object]:
|
||||
if not pairs:
|
||||
return _empty_sensitivity_summary()
|
||||
|
||||
divergence_curves = [pair.embedding_divergence for pair in pairs if len(pair.embedding_divergence) > 0]
|
||||
curve_mean, curve_std = _aligned_mean_std(divergence_curves)
|
||||
|
||||
pair_points = []
|
||||
for pair in pairs:
|
||||
if len(pair.embedding_divergence) > 0:
|
||||
initial_divergence = float(pair.embedding_divergence[0])
|
||||
final_divergence = float(pair.embedding_divergence[-1])
|
||||
contraction_delta = final_divergence - initial_divergence
|
||||
contraction_ratio = final_divergence / max(initial_divergence, 1e-6)
|
||||
else:
|
||||
initial_divergence = 0.0
|
||||
final_divergence = 0.0
|
||||
contraction_delta = 0.0
|
||||
contraction_ratio = 0.0
|
||||
pair_points.append(
|
||||
{
|
||||
"score_delta": round(float(pair.score_delta), 4),
|
||||
"tool_edit_distance": int(pair.tool_edit_distance),
|
||||
"family_js_divergence": round(float(pair.family_js_divergence), 4),
|
||||
"lyapunov_proxy": round(float(pair.lyapunov_proxy), 4),
|
||||
"initial_divergence": round(initial_divergence, 4),
|
||||
"final_divergence": round(final_divergence, 4),
|
||||
"contraction_delta": round(contraction_delta, 4),
|
||||
"contraction_ratio": round(contraction_ratio, 4),
|
||||
}
|
||||
)
|
||||
|
||||
converging_pairs = sum(
|
||||
1 for point in pair_points if point["final_divergence"] < point["initial_divergence"]
|
||||
)
|
||||
|
||||
return {
|
||||
"n_pairs": len(pairs),
|
||||
"mean_score_delta": round(float(np.mean([pair.score_delta for pair in pairs])), 4),
|
||||
"mean_tool_edit_distance": round(float(np.mean([pair.tool_edit_distance for pair in pairs])), 4),
|
||||
"mean_family_js_divergence": round(float(np.mean([pair.family_js_divergence for pair in pairs])), 4),
|
||||
"mean_lyapunov_proxy": round(float(np.mean([pair.lyapunov_proxy for pair in pairs])), 4),
|
||||
"mean_initial_divergence": round(float(np.mean([point["initial_divergence"] for point in pair_points])), 4),
|
||||
"mean_final_divergence": round(float(np.mean([point["final_divergence"] for point in pair_points])), 4),
|
||||
"mean_contraction_delta": round(float(np.mean([point["contraction_delta"] for point in pair_points])), 4),
|
||||
"mean_contraction_ratio": round(float(np.mean([point["contraction_ratio"] for point in pair_points])), 4),
|
||||
"fraction_converging_pairs": round(converging_pairs / len(pair_points), 4),
|
||||
"mean_divergence_curve": _round_list(curve_mean),
|
||||
"std_divergence_curve": _round_list(curve_std),
|
||||
"pair_points": pair_points,
|
||||
}
|
||||
|
||||
|
||||
def _build_sensitivity_sections(
|
||||
valid_runs_by_task: dict[str, list[TaskRunResult]],
|
||||
) -> tuple[list, dict[str, object]]:
|
||||
same_task_pairs = []
|
||||
per_task: dict[str, object] = {}
|
||||
for task_id, runs in sorted(valid_runs_by_task.items()):
|
||||
if len(runs) < 2:
|
||||
continue
|
||||
task_pairs = [
|
||||
compute_sensitivity(run_a, run_b, task_id=task_id)
|
||||
for run_a, run_b in combinations(runs, 2)
|
||||
]
|
||||
if task_pairs:
|
||||
same_task_pairs.extend(task_pairs)
|
||||
per_task[task_id] = _summarize_sensitivity_group(task_pairs)
|
||||
|
||||
same_task_summary = _summarize_sensitivity_group(same_task_pairs)
|
||||
same_task_summary["per_task"] = per_task
|
||||
|
||||
perturbation_pairs = []
|
||||
per_variant_group: dict[str, object] = {}
|
||||
runs_by_variant_group: dict[str, list[TaskRunResult]] = {}
|
||||
for runs in valid_runs_by_task.values():
|
||||
for run in runs:
|
||||
runs_by_variant_group.setdefault(run.variant_group or run.task_id, []).append(run)
|
||||
|
||||
for variant_group, runs in sorted(runs_by_variant_group.items()):
|
||||
distinct_members = {
|
||||
(run.task_id, run.prompt_variant, run.variant_id)
|
||||
for run in runs
|
||||
}
|
||||
if len(distinct_members) < 2:
|
||||
continue
|
||||
|
||||
group_pairs = []
|
||||
for run_a, run_b in combinations(runs, 2):
|
||||
if (
|
||||
run_a.task_id == run_b.task_id
|
||||
and run_a.prompt_variant == run_b.prompt_variant
|
||||
and run_a.variant_id == run_b.variant_id
|
||||
):
|
||||
continue
|
||||
group_pairs.append(compute_sensitivity(run_a, run_b, task_id=variant_group))
|
||||
|
||||
if not group_pairs:
|
||||
continue
|
||||
|
||||
perturbation_pairs.extend(group_pairs)
|
||||
group_summary = _summarize_sensitivity_group(group_pairs)
|
||||
group_summary["members"] = [
|
||||
{
|
||||
"task_id": task_id,
|
||||
"prompt_variant": prompt_variant,
|
||||
"variant_id": variant_id,
|
||||
}
|
||||
for task_id, prompt_variant, variant_id in sorted(distinct_members)
|
||||
]
|
||||
per_variant_group[variant_group] = group_summary
|
||||
|
||||
perturbation_summary = _summarize_sensitivity_group(perturbation_pairs)
|
||||
perturbation_summary["per_variant_group"] = per_variant_group
|
||||
|
||||
return same_task_pairs, {
|
||||
"same_task": same_task_summary,
|
||||
"prompt_perturbation": perturbation_summary,
|
||||
}
|
||||
|
||||
|
||||
def build_dynamics_report(
|
||||
task_runs: dict[str, list[TaskRunResult]],
|
||||
include_pca: bool = True,
|
||||
) -> tuple[dict, list]:
|
||||
"""Compute stratified dynamics report data from cached runs."""
|
||||
all_runs = [run for runs in task_runs.values() for run in runs]
|
||||
if not all_runs:
|
||||
raise ValueError("No cached runs were loaded.")
|
||||
|
||||
dynamics_list = []
|
||||
scores = []
|
||||
valid_runs = []
|
||||
for run in all_runs:
|
||||
if not run.transcript.messages:
|
||||
continue
|
||||
dynamics_list.append(compute_dynamics(run.transcript))
|
||||
scores.append(run.run_score)
|
||||
valid_runs.append(run)
|
||||
|
||||
if not valid_runs:
|
||||
raise ValueError("No runs with transcripts were found in the archive.")
|
||||
|
||||
valid_runs_by_task: dict[str, list[TaskRunResult]] = {}
|
||||
for run in valid_runs:
|
||||
valid_runs_by_task.setdefault(run.task_id, []).append(run)
|
||||
|
||||
same_task_sensitivities, sensitivity_summary = _build_sensitivity_sections(valid_runs_by_task)
|
||||
|
||||
stratifiers = {
|
||||
"tier": stratify_by_tier,
|
||||
"regime": stratify_by_regime,
|
||||
"tool_mix": stratify_by_tool_mix,
|
||||
"scenario": stratify_by_scenario,
|
||||
}
|
||||
|
||||
report: dict[str, object] = {
|
||||
"n_runs": len(valid_runs),
|
||||
"n_tasks": len(task_runs),
|
||||
"strata": {},
|
||||
}
|
||||
|
||||
stratified = {}
|
||||
for name, fn in stratifiers.items():
|
||||
assessment = build_strata(
|
||||
valid_runs,
|
||||
dynamics_list,
|
||||
scores,
|
||||
fn,
|
||||
name,
|
||||
sensitivities=same_task_sensitivities,
|
||||
)
|
||||
stratified[name] = assessment
|
||||
strata_summary = []
|
||||
for stratum in assessment.strata:
|
||||
strata_summary.append(
|
||||
{
|
||||
"name": stratum.name,
|
||||
"n_runs": stratum.n_runs,
|
||||
"weight": round(stratum.weight, 4),
|
||||
"score_mean": round(stratum.score_mean, 4),
|
||||
"score_std": round(stratum.score_std, 4),
|
||||
"score_quantiles": {
|
||||
key: round(value, 4)
|
||||
for key, value in stratum.score_quantiles.items()
|
||||
},
|
||||
"entropy_mean": round(float(stratum.entropy_dist.mean()), 4)
|
||||
if len(stratum.entropy_dist)
|
||||
else 0.0,
|
||||
"error_rate_mean": round(float(stratum.error_rate_dist.mean()), 4)
|
||||
if len(stratum.error_rate_dist)
|
||||
else 0.0,
|
||||
"constraint_mean": round(float(stratum.constraint_dist.mean()), 4)
|
||||
if len(stratum.constraint_dist)
|
||||
else 0.0,
|
||||
"memory_depth_mean": round(float(stratum.memory_depth_dist.mean()), 4)
|
||||
if len(stratum.memory_depth_dist)
|
||||
else 0.0,
|
||||
"sensitivity_pairs": int(len(stratum.sensitivity_deltas)),
|
||||
"sensitivity_mean_score_delta": round(float(stratum.sensitivity_deltas.mean()), 4)
|
||||
if len(stratum.sensitivity_deltas)
|
||||
else 0.0,
|
||||
"regime_counts": stratum.regime_counts,
|
||||
}
|
||||
)
|
||||
report["strata"][name] = {
|
||||
"observed_mean_score": round(assessment.observed_mean_score, 4),
|
||||
"observed_std_score": round(assessment.observed_std_score, 4),
|
||||
"strata": strata_summary,
|
||||
}
|
||||
|
||||
report["per_run"] = [
|
||||
{
|
||||
"task_id": run.task_id,
|
||||
"run_index": run.run_index,
|
||||
"score": round(run.run_score, 4),
|
||||
"regime": dynamics.regime.value,
|
||||
"entropy": round(dynamics.tool_entropy, 4),
|
||||
"error_rate": round(dynamics.error_rate, 4),
|
||||
"constraint_index": round(dynamics.constraint_index, 4),
|
||||
"memory_depth": round(dynamics.memory_depth, 4),
|
||||
"n_steps": dynamics.n_steps,
|
||||
"mean_drift": round(dynamics.mean_drift, 4),
|
||||
"mean_step_size": round(dynamics.mean_step_size, 4),
|
||||
}
|
||||
for run, dynamics in zip(valid_runs, dynamics_list)
|
||||
]
|
||||
report["sensitivity"] = sensitivity_summary
|
||||
|
||||
if include_pca:
|
||||
compute_pca_bundle(dynamics_list)
|
||||
|
||||
events = []
|
||||
censored = []
|
||||
for run in valid_runs:
|
||||
step = find_event_step(run.transcript, "first_correct_write")
|
||||
if step is not None:
|
||||
events.append(step)
|
||||
censored.append(False)
|
||||
else:
|
||||
events.append(float(len(run.transcript.assistant_messages)))
|
||||
censored.append(True)
|
||||
km_points = kaplan_meier(events, censored)
|
||||
return report, generate_all_plots, {
|
||||
"valid_runs": valid_runs,
|
||||
"dynamics_list": dynamics_list,
|
||||
"stratified": stratified,
|
||||
"km_points": km_points,
|
||||
"sensitivity": sensitivity_summary,
|
||||
}
|
||||
|
||||
|
||||
def write_dynamics_report(
|
||||
task_runs: dict[str, list[TaskRunResult]],
|
||||
out_dir: Path,
|
||||
report_name: str = "dynamics.json",
|
||||
generate_plots: bool = True,
|
||||
) -> tuple[Path, list[Path]]:
|
||||
"""Write the dynamics report JSON and plots to an output directory."""
|
||||
report, plotter, plot_data = build_dynamics_report(task_runs, include_pca=generate_plots)
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
report_path = out_dir / report_name
|
||||
report_path.write_text(json.dumps(report, indent=2), encoding="utf-8")
|
||||
|
||||
plots: list[Path] = []
|
||||
if generate_plots:
|
||||
plots = plotter(
|
||||
plot_data["dynamics_list"],
|
||||
plot_data["valid_runs"],
|
||||
plot_data["stratified"],
|
||||
km_points=plot_data["km_points"],
|
||||
event_name="first_correct_write",
|
||||
out_dir=out_dir,
|
||||
sensitivity_summary=plot_data["sensitivity"],
|
||||
)
|
||||
return report_path, plots
|
||||
|
||||
|
||||
def load_task_runs_by_model(
|
||||
archive_dir: Path,
|
||||
tier: str | None = None,
|
||||
task_ids: Iterable[str] | None = None,
|
||||
) -> dict[str, dict[str, list[TaskRunResult]]]:
|
||||
"""Load cached TaskRunResult objects grouped by model directory name."""
|
||||
grouped: dict[str, dict[str, list[TaskRunResult]]] = {}
|
||||
for model_name, model_dir in discover_model_roots(archive_dir).items():
|
||||
task_runs = load_task_runs_archive(
|
||||
archive_dir=model_dir,
|
||||
model=None,
|
||||
task_ids=task_ids,
|
||||
tier=tier,
|
||||
)
|
||||
if task_runs:
|
||||
grouped[model_name] = task_runs
|
||||
return grouped
|
||||
411
clawbench/dynamics_plots.py
Normal file
411
clawbench/dynamics_plots.py
Normal file
@ -0,0 +1,411 @@
|
||||
"""Plotting utilities for dynamics analysis.
|
||||
|
||||
Generates publication-ready figures from dynamics data and saves to a
|
||||
results directory. All plots use matplotlib with the Agg backend so they
|
||||
work headlessly.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
from clawbench.dynamics import (
|
||||
Dynamics,
|
||||
StratifiedAssessment,
|
||||
StratumStats,
|
||||
SurvivalPoint,
|
||||
)
|
||||
|
||||
|
||||
def _savefig(fig: plt.Figure, path: Path) -> None:
|
||||
fig.savefig(path, dpi=150, bbox_inches="tight")
|
||||
plt.close(fig)
|
||||
|
||||
|
||||
def _plot_series_curves(
|
||||
dynamics_list: list[Dynamics],
|
||||
labels: list[str],
|
||||
out_path: Path,
|
||||
*,
|
||||
series_attr: str,
|
||||
ylabel: str,
|
||||
title: str,
|
||||
) -> None:
|
||||
"""Plot a step-aligned per-run series coloured by label."""
|
||||
fig, ax = plt.subplots(figsize=(10, 5))
|
||||
cmap = plt.cm.tab10
|
||||
unique = sorted(set(labels))
|
||||
colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)}
|
||||
|
||||
for d, lbl in zip(dynamics_list, labels):
|
||||
series = np.asarray(getattr(d, series_attr), dtype=float)
|
||||
if len(series) < 2:
|
||||
continue
|
||||
ax.plot(series, alpha=0.6, color=colour_map[lbl], linewidth=1)
|
||||
|
||||
for lbl in unique:
|
||||
ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2)
|
||||
ax.legend(fontsize=8, loc="upper left")
|
||||
ax.set_xlabel("Step")
|
||||
ax.set_ylabel(ylabel)
|
||||
ax.set_title(title)
|
||||
_savefig(fig, out_path)
|
||||
|
||||
|
||||
def plot_drift_curves(
|
||||
dynamics_list: list[Dynamics],
|
||||
labels: list[str],
|
||||
out_path: Path,
|
||||
) -> None:
|
||||
"""Drift-from-origin curves coloured by label (e.g. task_id or regime)."""
|
||||
_plot_series_curves(
|
||||
dynamics_list,
|
||||
labels,
|
||||
out_path,
|
||||
series_attr="drift",
|
||||
ylabel="Cosine distance from step 0",
|
||||
title="Drift from Origin",
|
||||
)
|
||||
|
||||
|
||||
def plot_step_size_curves(
|
||||
dynamics_list: list[Dynamics],
|
||||
labels: list[str],
|
||||
out_path: Path,
|
||||
) -> None:
|
||||
"""Step-to-step movement curves coloured by label."""
|
||||
_plot_series_curves(
|
||||
dynamics_list,
|
||||
labels,
|
||||
out_path,
|
||||
series_attr="step_size",
|
||||
ylabel="Cosine distance from previous step",
|
||||
title="Step-to-Step Movement",
|
||||
)
|
||||
|
||||
|
||||
def plot_pca_trajectories(
|
||||
dynamics_list: list[Dynamics],
|
||||
labels: list[str],
|
||||
out_path: Path,
|
||||
) -> None:
|
||||
"""PCA phase portraits (PC1 vs PC2) coloured by label."""
|
||||
fig, ax = plt.subplots(figsize=(8, 8))
|
||||
cmap = plt.cm.tab10
|
||||
unique = sorted(set(labels))
|
||||
colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)}
|
||||
|
||||
for d, lbl in zip(dynamics_list, labels):
|
||||
if d.pca_trajectory is None or len(d.pca_trajectory) < 2:
|
||||
continue
|
||||
traj = d.pca_trajectory
|
||||
ax.plot(traj[:, 0], traj[:, 1], alpha=0.5, color=colour_map[lbl], linewidth=1)
|
||||
ax.scatter(traj[0, 0], traj[0, 1], color=colour_map[lbl], marker="o", s=30, zorder=5)
|
||||
ax.scatter(traj[-1, 0], traj[-1, 1], color=colour_map[lbl], marker="x", s=30, zorder=5)
|
||||
|
||||
for lbl in unique:
|
||||
ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2)
|
||||
ax.legend(fontsize=8)
|
||||
ax.set_xlabel("PC1")
|
||||
ax.set_ylabel("PC2")
|
||||
ax.set_title("PCA Phase Portrait (o=start, x=end)")
|
||||
_savefig(fig, out_path)
|
||||
|
||||
|
||||
def plot_regime_distribution(
|
||||
strata: list[StratumStats],
|
||||
stratifier_name: str,
|
||||
out_path: Path,
|
||||
) -> None:
|
||||
"""Stacked bar chart of regime counts per stratum."""
|
||||
fig, ax = plt.subplots(figsize=(10, 5))
|
||||
all_regimes = sorted({r for s in strata for r in s.regime_counts})
|
||||
x = np.arange(len(strata))
|
||||
bottom = np.zeros(len(strata))
|
||||
cmap = plt.cm.Set2
|
||||
|
||||
for j, regime in enumerate(all_regimes):
|
||||
counts = [s.regime_counts.get(regime, 0) for s in strata]
|
||||
ax.bar(x, counts, bottom=bottom, label=regime, color=cmap(j / max(len(all_regimes) - 1, 1)))
|
||||
bottom += np.array(counts)
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels([s.name for s in strata], rotation=30, ha="right")
|
||||
ax.set_ylabel("Count")
|
||||
ax.set_title(f"Regime Distribution by {stratifier_name}")
|
||||
ax.legend(fontsize=8)
|
||||
_savefig(fig, out_path)
|
||||
|
||||
|
||||
def plot_score_distributions(
|
||||
strata: list[StratumStats],
|
||||
stratifier_name: str,
|
||||
out_path: Path,
|
||||
) -> None:
|
||||
"""Box plots of score distributions per stratum."""
|
||||
fig, ax = plt.subplots(figsize=(10, 5))
|
||||
data = [s.scores for s in strata if len(s.scores) > 0]
|
||||
labels = [s.name for s in strata if len(s.scores) > 0]
|
||||
|
||||
if data:
|
||||
ax.boxplot(data, labels=labels, patch_artist=True,
|
||||
boxprops=dict(facecolor="lightblue", alpha=0.7))
|
||||
ax.set_ylabel("Score")
|
||||
ax.set_title(f"Score Distribution by {stratifier_name}")
|
||||
plt.xticks(rotation=30, ha="right")
|
||||
_savefig(fig, out_path)
|
||||
|
||||
|
||||
def plot_survival_curve(
|
||||
km_points: list[SurvivalPoint],
|
||||
event_name: str,
|
||||
out_path: Path,
|
||||
) -> None:
|
||||
"""Kaplan-Meier survival curve."""
|
||||
if not km_points:
|
||||
return
|
||||
fig, ax = plt.subplots(figsize=(8, 5))
|
||||
times = [p.time for p in km_points]
|
||||
surv = [p.survival for p in km_points]
|
||||
ax.step(times, surv, where="post", linewidth=2, color="steelblue")
|
||||
ax.fill_between(times, surv, step="post", alpha=0.15, color="steelblue")
|
||||
ax.set_xlabel("Step")
|
||||
ax.set_ylabel("Survival probability")
|
||||
ax.set_title(f"Kaplan-Meier: {event_name}")
|
||||
ax.set_ylim(-0.05, 1.05)
|
||||
_savefig(fig, out_path)
|
||||
|
||||
|
||||
def plot_stratum_dynamics_heatmap(
|
||||
strata: list[StratumStats],
|
||||
stratifier_name: str,
|
||||
out_path: Path,
|
||||
) -> None:
|
||||
"""Heatmap of mean dynamics metrics across strata."""
|
||||
metrics = ["entropy", "error_rate", "constraint", "memory_depth", "mean_drift", "mean_step_size"]
|
||||
data = np.zeros((len(strata), len(metrics)))
|
||||
for i, s in enumerate(strata):
|
||||
arrays = [s.entropy_dist, s.error_rate_dist, s.constraint_dist,
|
||||
s.memory_depth_dist, s.mean_drift_dist, s.mean_step_size_dist]
|
||||
for j, arr in enumerate(arrays):
|
||||
data[i, j] = float(np.mean(arr)) if len(arr) > 0 else 0.0
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, max(3, len(strata) * 0.6)))
|
||||
im = ax.imshow(data, aspect="auto", cmap="YlOrRd")
|
||||
ax.set_xticks(range(len(metrics)))
|
||||
ax.set_xticklabels(metrics, rotation=30, ha="right")
|
||||
ax.set_yticks(range(len(strata)))
|
||||
ax.set_yticklabels([s.name for s in strata])
|
||||
for i in range(len(strata)):
|
||||
for j in range(len(metrics)):
|
||||
ax.text(j, i, f"{data[i, j]:.2f}", ha="center", va="center", fontsize=8)
|
||||
fig.colorbar(im, ax=ax, shrink=0.8)
|
||||
ax.set_title(f"Dynamics Metrics by {stratifier_name}")
|
||||
_savefig(fig, out_path)
|
||||
|
||||
|
||||
def plot_pairwise_divergence_curves(
|
||||
per_task_sensitivity: dict[str, dict],
|
||||
out_path: Path,
|
||||
) -> bool:
|
||||
"""Plot mean pairwise trajectory divergence over aligned steps."""
|
||||
if not per_task_sensitivity:
|
||||
return False
|
||||
|
||||
fig, ax = plt.subplots(figsize=(10, 5))
|
||||
cmap = plt.cm.tab10
|
||||
tasks = sorted(per_task_sensitivity)
|
||||
colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)}
|
||||
|
||||
plotted = False
|
||||
for task in tasks:
|
||||
summary = per_task_sensitivity[task]
|
||||
mean_curve = np.asarray(summary.get("mean_divergence_curve", []), dtype=float)
|
||||
std_curve = np.asarray(summary.get("std_divergence_curve", []), dtype=float)
|
||||
if len(mean_curve) == 0:
|
||||
continue
|
||||
steps = np.arange(len(mean_curve))
|
||||
ax.plot(steps, mean_curve, linewidth=2, color=colour_map[task], label=task)
|
||||
if len(std_curve) == len(mean_curve):
|
||||
ax.fill_between(steps, mean_curve - std_curve, mean_curve + std_curve, color=colour_map[task], alpha=0.12)
|
||||
plotted = True
|
||||
|
||||
if not plotted:
|
||||
plt.close(fig)
|
||||
return False
|
||||
|
||||
ax.set_xlabel("Aligned step")
|
||||
ax.set_ylabel("Pairwise embedding divergence")
|
||||
ax.set_title("Do Repeated Trajectories Converge or Diverge?")
|
||||
ax.legend(fontsize=8)
|
||||
_savefig(fig, out_path)
|
||||
return True
|
||||
|
||||
|
||||
def plot_pairwise_contraction_scatter(
|
||||
per_task_sensitivity: dict[str, dict],
|
||||
out_path: Path,
|
||||
) -> bool:
|
||||
"""Scatter initial vs final pairwise divergence; below diagonal means convergence."""
|
||||
if not per_task_sensitivity:
|
||||
return False
|
||||
|
||||
fig, ax = plt.subplots(figsize=(7, 6))
|
||||
cmap = plt.cm.tab10
|
||||
tasks = sorted(per_task_sensitivity)
|
||||
colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)}
|
||||
|
||||
max_seen = 0.0
|
||||
plotted = False
|
||||
for task in tasks:
|
||||
points = per_task_sensitivity[task].get("pair_points", [])
|
||||
if not points:
|
||||
continue
|
||||
xs = [point["initial_divergence"] for point in points]
|
||||
ys = [point["final_divergence"] for point in points]
|
||||
max_seen = max(max_seen, *(xs + ys))
|
||||
ax.scatter(xs, ys, s=60, alpha=0.8, color=colour_map[task], label=task)
|
||||
plotted = True
|
||||
|
||||
if not plotted:
|
||||
plt.close(fig)
|
||||
return False
|
||||
|
||||
limit = max(max_seen, 0.1)
|
||||
ax.plot([0, limit], [0, limit], linestyle="--", color="black", linewidth=1)
|
||||
ax.set_xlabel("Initial pairwise divergence")
|
||||
ax.set_ylabel("Final pairwise divergence")
|
||||
ax.set_title("Pairwise Trajectory Contraction")
|
||||
ax.legend(fontsize=8)
|
||||
_savefig(fig, out_path)
|
||||
return True
|
||||
|
||||
|
||||
def plot_sensitivity_heatmap(
|
||||
per_task_sensitivity: dict[str, dict],
|
||||
out_path: Path,
|
||||
) -> bool:
|
||||
"""Heatmap of per-task sensitivity metrics."""
|
||||
if not per_task_sensitivity:
|
||||
return False
|
||||
|
||||
metrics = [
|
||||
("mean_score_delta", "score_delta"),
|
||||
("mean_tool_edit_distance", "tool_edit"),
|
||||
("mean_family_js_divergence", "js_div"),
|
||||
("mean_lyapunov_proxy", "lyapunov"),
|
||||
("fraction_converging_pairs", "frac_converging"),
|
||||
]
|
||||
tasks = sorted(per_task_sensitivity)
|
||||
data = np.zeros((len(tasks), len(metrics)))
|
||||
for row_idx, task in enumerate(tasks):
|
||||
summary = per_task_sensitivity[task]
|
||||
for col_idx, (key, _label) in enumerate(metrics):
|
||||
data[row_idx, col_idx] = float(summary.get(key, 0.0))
|
||||
|
||||
fig, ax = plt.subplots(figsize=(9, max(3, len(tasks) * 0.7)))
|
||||
im = ax.imshow(data, aspect="auto", cmap="Blues")
|
||||
ax.set_xticks(range(len(metrics)))
|
||||
ax.set_xticklabels([label for _key, label in metrics], rotation=30, ha="right")
|
||||
ax.set_yticks(range(len(tasks)))
|
||||
ax.set_yticklabels(tasks)
|
||||
for row_idx in range(len(tasks)):
|
||||
for col_idx in range(len(metrics)):
|
||||
ax.text(col_idx, row_idx, f"{data[row_idx, col_idx]:.2f}", ha="center", va="center", fontsize=8)
|
||||
fig.colorbar(im, ax=ax, shrink=0.8)
|
||||
ax.set_title("Pairwise Sensitivity by Task")
|
||||
_savefig(fig, out_path)
|
||||
return True
|
||||
|
||||
|
||||
def generate_all_plots(
|
||||
dynamics_list: list[Dynamics],
|
||||
runs: list,
|
||||
stratified: dict[str, StratifiedAssessment],
|
||||
km_points: list[SurvivalPoint] | None = None,
|
||||
event_name: str = "first_correct_write",
|
||||
out_dir: Path = Path("results"),
|
||||
sensitivity_summary: dict[str, dict] | None = None,
|
||||
) -> list[Path]:
|
||||
"""Generate all dynamics plots and return list of saved paths."""
|
||||
out_dir.mkdir(parents=True, exist_ok=True)
|
||||
saved: list[Path] = []
|
||||
|
||||
# Labels by regime
|
||||
regime_labels = [d.regime.value for d in dynamics_list]
|
||||
tier_labels = []
|
||||
for r in runs:
|
||||
tid = r.task_id.lower()
|
||||
tier = "unknown"
|
||||
for i in range(1, 6):
|
||||
if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"):
|
||||
tier = f"tier{i}"
|
||||
break
|
||||
tier_labels.append(tier)
|
||||
|
||||
# Drift curves by regime
|
||||
p = out_dir / "drift_by_regime.png"
|
||||
plot_drift_curves(dynamics_list, regime_labels, p)
|
||||
saved.append(p)
|
||||
|
||||
# Drift curves by tier
|
||||
p = out_dir / "drift_by_tier.png"
|
||||
plot_drift_curves(dynamics_list, tier_labels, p)
|
||||
saved.append(p)
|
||||
|
||||
p = out_dir / "step_size_by_regime.png"
|
||||
plot_step_size_curves(dynamics_list, regime_labels, p)
|
||||
saved.append(p)
|
||||
|
||||
p = out_dir / "step_size_by_tier.png"
|
||||
plot_step_size_curves(dynamics_list, tier_labels, p)
|
||||
saved.append(p)
|
||||
|
||||
# PCA trajectories
|
||||
has_pca = any(d.pca_trajectory is not None for d in dynamics_list)
|
||||
if has_pca:
|
||||
p = out_dir / "pca_by_regime.png"
|
||||
plot_pca_trajectories(dynamics_list, regime_labels, p)
|
||||
saved.append(p)
|
||||
p = out_dir / "pca_by_tier.png"
|
||||
plot_pca_trajectories(dynamics_list, tier_labels, p)
|
||||
saved.append(p)
|
||||
|
||||
# Per-stratifier plots
|
||||
for name, sa in stratified.items():
|
||||
p = out_dir / f"regimes_by_{name}.png"
|
||||
plot_regime_distribution(sa.strata, name, p)
|
||||
saved.append(p)
|
||||
|
||||
p = out_dir / f"scores_by_{name}.png"
|
||||
plot_score_distributions(sa.strata, name, p)
|
||||
saved.append(p)
|
||||
|
||||
p = out_dir / f"dynamics_heatmap_{name}.png"
|
||||
plot_stratum_dynamics_heatmap(sa.strata, name, p)
|
||||
saved.append(p)
|
||||
|
||||
# Survival curve
|
||||
if km_points:
|
||||
p = out_dir / f"survival_{event_name}.png"
|
||||
plot_survival_curve(km_points, event_name, p)
|
||||
saved.append(p)
|
||||
|
||||
per_task_sensitivity = (sensitivity_summary or {}).get("same_task", {}).get("per_task", {})
|
||||
p = out_dir / "pairwise_divergence_by_task.png"
|
||||
if plot_pairwise_divergence_curves(per_task_sensitivity, p):
|
||||
saved.append(p)
|
||||
|
||||
p = out_dir / "pairwise_contraction_scatter.png"
|
||||
if plot_pairwise_contraction_scatter(per_task_sensitivity, p):
|
||||
saved.append(p)
|
||||
|
||||
p = out_dir / "sensitivity_heatmap.png"
|
||||
if plot_sensitivity_heatmap(per_task_sensitivity, p):
|
||||
saved.append(p)
|
||||
|
||||
return saved
|
||||
@ -103,6 +103,7 @@ class BenchmarkHarness:
|
||||
self.concurrency = max(1, int(concurrency))
|
||||
self.browser_concurrency = max(1, int(browser_concurrency))
|
||||
self.repo_root = Path(__file__).parent.parent
|
||||
self.last_task_runs: dict[str, list[TaskRunResult]] = {}
|
||||
|
||||
async def run(self) -> BenchmarkResult:
|
||||
tasks = load_all_tasks(
|
||||
@ -148,6 +149,7 @@ class BenchmarkHarness:
|
||||
f"({mean_run:.1f}s avg, concurrency={self.concurrency})[/dim]"
|
||||
)
|
||||
|
||||
self.last_task_runs = all_results
|
||||
return self._aggregate(tasks, all_results)
|
||||
|
||||
async def _execute_runs(
|
||||
|
||||
147
clawbench/submission_models.py
Normal file
147
clawbench/submission_models.py
Normal file
@ -0,0 +1,147 @@
|
||||
"""Preset model catalog and selection helpers for the Space submit UI."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
||||
CUSTOM_PRESET_LABEL = "(custom)"
|
||||
|
||||
PRESET_AUDIENCE_ALL = "All Presets"
|
||||
PRESET_AUDIENCE_CLAW = "Claw Users"
|
||||
PRESET_AUDIENCE_BUDGET = "Budget Researchers"
|
||||
|
||||
PRESET_AUDIENCE_CHOICES = (
|
||||
PRESET_AUDIENCE_ALL,
|
||||
PRESET_AUDIENCE_CLAW,
|
||||
PRESET_AUDIENCE_BUDGET,
|
||||
)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class PresetModel:
|
||||
label: str
|
||||
model_id: str
|
||||
provider: str
|
||||
audiences: tuple[str, ...]
|
||||
|
||||
|
||||
PRESET_MODELS = (
|
||||
PresetModel(
|
||||
label="GPT-OSS 20B (Ollama)",
|
||||
model_id="ollama/gpt-oss:20b",
|
||||
provider="ollama",
|
||||
audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
|
||||
),
|
||||
PresetModel(
|
||||
label="Qwen 3.5 27B (Ollama)",
|
||||
model_id="ollama/qwen3.5:27b",
|
||||
provider="ollama",
|
||||
audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
|
||||
),
|
||||
PresetModel(
|
||||
label="Qwen3 32B",
|
||||
model_id="huggingface/Qwen/Qwen3-32B",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
|
||||
),
|
||||
PresetModel(
|
||||
label="Gemma 4 26B MoE",
|
||||
model_id="huggingface/google/gemma-4-26B-A4B-it",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
|
||||
),
|
||||
PresetModel(
|
||||
label="GLM 5.1 (754B MoE)",
|
||||
model_id="huggingface/zai-org/GLM-5.1",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
PresetModel(
|
||||
label="GLM 5 (400B MoE)",
|
||||
model_id="huggingface/zai-org/GLM-5",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
PresetModel(
|
||||
label="DeepSeek R1",
|
||||
model_id="huggingface/deepseek-ai/DeepSeek-R1",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
PresetModel(
|
||||
label="Kimi K2 Instruct",
|
||||
model_id="huggingface/moonshotai/Kimi-K2-Instruct",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
PresetModel(
|
||||
label="MiniMax M2.5",
|
||||
model_id="huggingface/MiniMaxAI/MiniMax-M2.5",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
PresetModel(
|
||||
label="Llama 3.3 70B",
|
||||
model_id="huggingface/meta-llama/Llama-3.3-70B-Instruct",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
PresetModel(
|
||||
label="Llama 3.1 70B",
|
||||
model_id="huggingface/meta-llama/Llama-3.1-70B-Instruct",
|
||||
provider="huggingface",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
PresetModel(
|
||||
label="Claude Sonnet 4.6",
|
||||
model_id="anthropic/claude-sonnet-4-6",
|
||||
provider="anthropic",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
PresetModel(
|
||||
label="Claude Opus 4.6",
|
||||
model_id="anthropic/claude-opus-4-6",
|
||||
provider="anthropic",
|
||||
audiences=(PRESET_AUDIENCE_CLAW,),
|
||||
),
|
||||
)
|
||||
|
||||
PRESET_MODEL_MAP = {preset.label: preset.model_id for preset in PRESET_MODELS}
|
||||
_PRESET_BY_LABEL = {preset.label: preset for preset in PRESET_MODELS}
|
||||
|
||||
|
||||
def infer_provider(model_id: str) -> str:
|
||||
normalized = model_id.strip()
|
||||
if not normalized or "/" not in normalized:
|
||||
return ""
|
||||
return normalized.split("/", 1)[0].strip().lower()
|
||||
|
||||
|
||||
def preset_models_for_audience(audience: str | None) -> list[PresetModel]:
|
||||
if not audience or audience == PRESET_AUDIENCE_ALL:
|
||||
return list(PRESET_MODELS)
|
||||
return [preset for preset in PRESET_MODELS if audience in preset.audiences]
|
||||
|
||||
|
||||
def preset_labels_for_audience(audience: str | None) -> list[str]:
|
||||
return [preset.label for preset in preset_models_for_audience(audience)]
|
||||
|
||||
|
||||
def resolve_model_selection(
|
||||
model: str,
|
||||
preset_label: str,
|
||||
provider: str = "",
|
||||
) -> tuple[str, str]:
|
||||
selected_model = model.strip()
|
||||
selected_provider = provider.strip()
|
||||
|
||||
preset = _PRESET_BY_LABEL.get(preset_label)
|
||||
if preset is not None:
|
||||
selected_model = preset.model_id
|
||||
if not selected_provider:
|
||||
selected_provider = preset.provider
|
||||
|
||||
if not selected_provider:
|
||||
selected_provider = infer_provider(selected_model)
|
||||
|
||||
return selected_model, selected_provider
|
||||
@ -1,140 +1,112 @@
|
||||
"""Classify each archived run's dynamical regime from its turn trajectory.
|
||||
#!/usr/bin/env python3
|
||||
"""Classify posterior run trajectories into dynamical regimes.
|
||||
|
||||
Following "When LLMs Are Dreaming..." §What We Expect to See:
|
||||
We embed each assistant turn using bag-of-words text plus tool-call summaries,
|
||||
then compute simple geometric proxies:
|
||||
|
||||
TRAPPED/ATTRACTOR — low support (Vol_log), high recurrence, high BOPS.
|
||||
Agent converged to a point; may be good (solved it)
|
||||
or bad (got stuck in a loop on a single idea).
|
||||
drift_mean = mean ||x_t - x_{t-1}||
|
||||
from_start = max ||x_t - x_0||
|
||||
recurrence = max cosine(x_i, x_j) for non-adjacent turns
|
||||
vol_log = log det(Sigma + eps I)
|
||||
|
||||
LIMIT-CYCLE — high recurrence + bounded drift + quasi-periodic revisits.
|
||||
Agent loops between a few states.
|
||||
|
||||
DIFFUSIVE/WANDERING — growing support, rising drift, low recurrence.
|
||||
Agent explores without converging; often "goal drift".
|
||||
|
||||
SENSITIVE — (requires paraphrased-pair runs; skip here.)
|
||||
|
||||
TOO-SHORT — trajectory < 3 assistant turns; can't classify dynamics.
|
||||
|
||||
We work in a TF-IDF bag-of-words embedding space (same vocab as C(q)),
|
||||
with each turn's state vector = its assistant text + tool-call args.
|
||||
|
||||
Metrics per run:
|
||||
- drift_mean: mean ||e_t − e_{t−1}|| across turns
|
||||
- from_start: max ||e_t − e_0|| (farthest the run drifted from origin)
|
||||
- recurrence: max_{i<j, j−i≥2} cos(e_i, e_j) — best return-after-gap match
|
||||
- vol_log: log det(Σ + εI) over turn states — support volume proxy
|
||||
|
||||
Classifier rules (tuned empirically on the distribution):
|
||||
if n_turns < 3 → too_short
|
||||
elif drift_mean < 0.15 and vol_log < −6 → trapped
|
||||
elif recurrence > 0.80 and drift_mean < 0.25 → limit_cycle
|
||||
elif drift_mean > 0.35 and vol_log > −3 → diffusive
|
||||
else → mixed
|
||||
|
||||
Output: reports/regimes.json with per-run classification.
|
||||
|
||||
Usage:
|
||||
.venv/bin/python3 scripts/classify_regimes.py
|
||||
Runs are then bucketed into coarse regimes such as trapped, limit_cycle, and
|
||||
diffusive using quartile-based thresholds estimated from the observed archive.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
MODELS = [
|
||||
"anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7",
|
||||
"anthropic_claude-sonnet-4-6", "openai_gpt-5.4",
|
||||
"google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1",
|
||||
"openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5",
|
||||
"openrouter_qwen_qwen3.6-plus",
|
||||
]
|
||||
from clawbench.dynamics_archive import load_task_runs_by_model
|
||||
|
||||
WORD_RE = re.compile(r"[a-z]{3,}")
|
||||
STOPWORDS = set("the and that with this have from what your will can but not "
|
||||
"was will are been one would there been they will their has "
|
||||
"had its were only some than about these which into also each "
|
||||
"when where them how who them very much more most other then "
|
||||
"here such does like just make many like want need take".split())
|
||||
STOPWORDS = set(
|
||||
"the and that with this have from what your will can but not "
|
||||
"was are been one would there they their has had its were only some "
|
||||
"than about these which into also each when where them how who very "
|
||||
"much more most other then here such does like just make many want need take".split()
|
||||
)
|
||||
|
||||
|
||||
def tokenize(text: str) -> list[str]:
|
||||
return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS]
|
||||
|
||||
|
||||
def build_vocab(all_turn_texts: list[str], top_k: int = 500) -> dict[str, int]:
|
||||
c = Counter()
|
||||
for t in all_turn_texts:
|
||||
c.update(set(tokenize(t)))
|
||||
return {w: i for i, (w, _) in enumerate(c.most_common(top_k))}
|
||||
def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]:
|
||||
counter = Counter()
|
||||
for text in texts:
|
||||
counter.update(set(tokenize(text)))
|
||||
return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))}
|
||||
|
||||
|
||||
def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray:
|
||||
v = np.zeros(len(vocab), dtype=np.float32)
|
||||
for w, c in Counter(tokenize(text)).items():
|
||||
if w in vocab:
|
||||
v[vocab[w]] = c
|
||||
n = np.linalg.norm(v)
|
||||
return v / n if n > 0 else v
|
||||
vec = np.zeros(len(vocab), dtype=np.float32)
|
||||
for word, cnt in Counter(tokenize(text)).items():
|
||||
if word in vocab:
|
||||
vec[vocab[word]] = cnt
|
||||
norm = np.linalg.norm(vec)
|
||||
return vec / norm if norm > 0 else vec
|
||||
|
||||
|
||||
def turn_texts(run_data: dict) -> list[str]:
|
||||
"""Extract one text string per assistant turn (text + tool-call summary)."""
|
||||
def turn_texts(run, fallback_any_message: bool = False) -> list[str]:
|
||||
source = run.transcript.messages if fallback_any_message else run.transcript.assistant_messages
|
||||
out = []
|
||||
for m in run_data.get("transcript", {}).get("messages", []):
|
||||
if m.get("role") != "assistant":
|
||||
continue
|
||||
for msg in source:
|
||||
parts = []
|
||||
if m.get("text"):
|
||||
parts.append(m["text"])
|
||||
for tc in (m.get("tool_calls") or []):
|
||||
name = tc.get("name", "")
|
||||
args_str = json.dumps(tc.get("arguments", {}))[:200]
|
||||
parts.append(f"{name} {args_str}")
|
||||
if msg.text:
|
||||
parts.append(msg.text)
|
||||
for tc in msg.tool_calls:
|
||||
parts.append(tc.name)
|
||||
if tc.input:
|
||||
parts.append(json.dumps(tc.input, sort_keys=True)[:200])
|
||||
if parts:
|
||||
out.append(" ".join(parts))
|
||||
return out
|
||||
|
||||
|
||||
def trajectory_metrics(vecs: np.ndarray) -> dict:
|
||||
"""Compute dynamical metrics over a (n_turns, d) trajectory matrix."""
|
||||
def trajectory_metrics(vecs: np.ndarray) -> dict[str, float]:
|
||||
"""Compute drift, recurrence, and support-volume proxies for one run."""
|
||||
n = vecs.shape[0]
|
||||
if n < 2:
|
||||
return {"n_turns": n, "drift_mean": 0.0, "from_start": 0.0,
|
||||
"recurrence": 0.0, "vol_log": -12.0}
|
||||
# Drift: consecutive distances
|
||||
return {
|
||||
"n_turns": float(n),
|
||||
"drift_mean": 0.0,
|
||||
"from_start": 0.0,
|
||||
"recurrence": 0.0,
|
||||
"vol_log": -12.0,
|
||||
}
|
||||
|
||||
diffs = np.linalg.norm(np.diff(vecs, axis=0), axis=1)
|
||||
drift_mean = float(diffs.mean())
|
||||
# From start: max distance from turn 0
|
||||
dists_from_0 = np.linalg.norm(vecs - vecs[0:1], axis=1)
|
||||
from_start = float(dists_from_0.max())
|
||||
# Recurrence: best non-adjacent cosine similarity (ignoring immediate neighbors)
|
||||
from_start = float(np.linalg.norm(vecs - vecs[0:1], axis=1).max())
|
||||
|
||||
recurrence = 0.0
|
||||
for i in range(n):
|
||||
for j in range(i + 2, n):
|
||||
ni, nj = np.linalg.norm(vecs[i]), np.linalg.norm(vecs[j])
|
||||
ni = np.linalg.norm(vecs[i])
|
||||
nj = np.linalg.norm(vecs[j])
|
||||
if ni > 0 and nj > 0:
|
||||
c = float(vecs[i] @ vecs[j] / (ni * nj))
|
||||
if c > recurrence:
|
||||
recurrence = c
|
||||
# Vol_log: log det of turn-state covariance
|
||||
sim = float(vecs[i] @ vecs[j] / (ni * nj))
|
||||
recurrence = max(recurrence, sim)
|
||||
|
||||
if n >= 3:
|
||||
Sigma = np.cov(vecs.T)
|
||||
# Use log|Σ + εI|; since d is large (500) we take eigenvalues + clip
|
||||
eigs = np.linalg.eigvalsh(Sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32))
|
||||
sigma = np.cov(vecs.T)
|
||||
eigs = np.linalg.eigvalsh(sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32))
|
||||
vol_log = float(np.log(np.clip(eigs, 1e-12, None)).sum())
|
||||
else:
|
||||
vol_log = -12.0
|
||||
|
||||
return {
|
||||
"n_turns": n,
|
||||
"n_turns": float(n),
|
||||
"drift_mean": drift_mean,
|
||||
"from_start": from_start,
|
||||
"recurrence": recurrence,
|
||||
@ -142,109 +114,105 @@ def trajectory_metrics(vecs: np.ndarray) -> dict:
|
||||
}
|
||||
|
||||
|
||||
def classify(m: dict, thresholds: dict) -> str:
|
||||
"""Classify based on quartile thresholds of the actual distribution.
|
||||
|
||||
Thresholds (set empirically from observed distribution):
|
||||
drift_low = p25 drift_hi = p75
|
||||
vol_low = p25 vol_hi = p75
|
||||
rec_hi = p75
|
||||
|
||||
Rules (priority order):
|
||||
n_turns < 3 → too_short
|
||||
drift < drift_low AND vol < vol_low → trapped
|
||||
rec > rec_hi AND drift < median → limit_cycle
|
||||
drift > drift_hi AND vol > vol_hi → diffusive
|
||||
else → mixed
|
||||
"""
|
||||
n = m["n_turns"]
|
||||
if n < 3:
|
||||
def classify(metrics: dict[str, float], thresholds: dict[str, float]) -> str:
|
||||
"""Map trajectory metrics to a coarse regime label."""
|
||||
n_turns = int(metrics["n_turns"])
|
||||
if n_turns < 3:
|
||||
return "too_short"
|
||||
d = m["drift_mean"]
|
||||
rec = m["recurrence"]
|
||||
vol = m["vol_log"]
|
||||
if d < thresholds["drift_low"] and vol < thresholds["vol_low"]:
|
||||
drift = metrics["drift_mean"]
|
||||
recurrence = metrics["recurrence"]
|
||||
vol = metrics["vol_log"]
|
||||
|
||||
if drift < thresholds["drift_low"] and vol < thresholds["vol_low"]:
|
||||
return "trapped"
|
||||
if rec > thresholds["rec_hi"] and d < thresholds["drift_med"]:
|
||||
if recurrence > thresholds["rec_hi"] and drift < thresholds["drift_med"]:
|
||||
return "limit_cycle"
|
||||
if d > thresholds["drift_hi"] and vol > thresholds["vol_hi"]:
|
||||
if drift > thresholds["drift_hi"] and vol > thresholds["vol_hi"]:
|
||||
return "diffusive"
|
||||
return "mixed"
|
||||
|
||||
|
||||
def main() -> None:
|
||||
# First pass: collect turn texts to build vocab
|
||||
parser = argparse.ArgumentParser(description="Classify cached run regimes")
|
||||
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
|
||||
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
|
||||
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
|
||||
if not grouped:
|
||||
raise SystemExit(f"No cached runs found under {args.archive_dir}")
|
||||
|
||||
all_turn_texts: list[str] = []
|
||||
run_turns: dict[tuple, list[str]] = {}
|
||||
for model in MODELS:
|
||||
for rf in (ARCH / model).rglob("run*.json"):
|
||||
try:
|
||||
d = json.loads(rf.read_text())
|
||||
except Exception:
|
||||
continue
|
||||
task = rf.parent.name
|
||||
run_idx = int(re.match(r"run(\d+)", rf.stem).group(1))
|
||||
ts = turn_texts(d)
|
||||
run_turns[(model, task, run_idx)] = ts
|
||||
all_turn_texts.extend(ts)
|
||||
run_turns: dict[str, list[str]] = {}
|
||||
|
||||
for model_name, task_runs in grouped.items():
|
||||
for task_id, runs in task_runs.items():
|
||||
for run in runs:
|
||||
ts = turn_texts(run, fallback_any_message=False)
|
||||
key = f"{model_name}/{task_id}/run{run.run_index}"
|
||||
run_turns[key] = ts
|
||||
all_turn_texts.extend(ts)
|
||||
|
||||
used_fallback_messages = False
|
||||
if not all_turn_texts:
|
||||
used_fallback_messages = True
|
||||
all_turn_texts = []
|
||||
run_turns = {}
|
||||
for model_name, task_runs in grouped.items():
|
||||
for task_id, runs in task_runs.items():
|
||||
for run in runs:
|
||||
ts = turn_texts(run, fallback_any_message=True)
|
||||
key = f"{model_name}/{task_id}/run{run.run_index}"
|
||||
run_turns[key] = ts
|
||||
all_turn_texts.extend(ts)
|
||||
|
||||
if not all_turn_texts:
|
||||
raise SystemExit("No usable turn text found in archive.")
|
||||
|
||||
vocab = build_vocab(all_turn_texts, top_k=500)
|
||||
print(f"Runs collected: {len(run_turns)} vocab size: {len(vocab)}")
|
||||
|
||||
# Second pass: vectorize + compute metrics
|
||||
per_run: dict[str, dict] = {}
|
||||
per_run: dict[str, dict[str, float | str]] = {}
|
||||
for key, ts in run_turns.items():
|
||||
model, task, run_idx = key
|
||||
if not ts:
|
||||
continue
|
||||
vecs = np.stack([vectorize(t, vocab) for t in ts])
|
||||
m = trajectory_metrics(vecs)
|
||||
per_run[f"{model}/{task}/run{run_idx}"] = m
|
||||
vecs = np.stack([vectorize(text, vocab) for text in ts])
|
||||
per_run[key] = trajectory_metrics(vecs)
|
||||
|
||||
# Derive thresholds from actual distribution of n_turns>=3 runs
|
||||
drifts = np.array([v["drift_mean"] for v in per_run.values() if v["n_turns"] >= 3])
|
||||
recs = np.array([v["recurrence"] for v in per_run.values() if v["n_turns"] >= 3])
|
||||
vols = np.array([v["vol_log"] for v in per_run.values() if v["n_turns"] >= 3])
|
||||
thresholds = {
|
||||
"drift_low": float(np.percentile(drifts, 25)),
|
||||
"drift_med": float(np.percentile(drifts, 50)),
|
||||
"drift_hi": float(np.percentile(drifts, 75)),
|
||||
"vol_low": float(np.percentile(vols, 25)),
|
||||
"vol_hi": float(np.percentile(vols, 75)),
|
||||
"rec_hi": float(np.percentile(recs, 75)),
|
||||
}
|
||||
print(f"\nThresholds (quartile-based from observed distribution):")
|
||||
for k, v in thresholds.items():
|
||||
print(f" {k:<12} {v:>10.3f}")
|
||||
eligible = [r for r in per_run.values() if int(r["n_turns"]) >= 3]
|
||||
if eligible:
|
||||
drifts = np.array([float(v["drift_mean"]) for v in eligible])
|
||||
recs = np.array([float(v["recurrence"]) for v in eligible])
|
||||
vols = np.array([float(v["vol_log"]) for v in eligible])
|
||||
thresholds = {
|
||||
"drift_low": float(np.percentile(drifts, 25)),
|
||||
"drift_med": float(np.percentile(drifts, 50)),
|
||||
"drift_hi": float(np.percentile(drifts, 75)),
|
||||
"vol_low": float(np.percentile(vols, 25)),
|
||||
"vol_hi": float(np.percentile(vols, 75)),
|
||||
"rec_hi": float(np.percentile(recs, 75)),
|
||||
}
|
||||
else:
|
||||
thresholds = {
|
||||
"drift_low": 0.15,
|
||||
"drift_med": 0.25,
|
||||
"drift_hi": 0.35,
|
||||
"vol_low": -6.0,
|
||||
"vol_hi": -3.0,
|
||||
"rec_hi": 0.8,
|
||||
}
|
||||
|
||||
# Apply classifier with thresholds
|
||||
for key in per_run:
|
||||
per_run[key]["regime"] = classify(per_run[key], thresholds)
|
||||
for key, metrics in per_run.items():
|
||||
metrics["regime"] = classify(metrics, thresholds)
|
||||
metrics["turn_source"] = "any_message" if used_fallback_messages else "assistant"
|
||||
|
||||
# Summary by regime
|
||||
counts = Counter(v["regime"] for v in per_run.values())
|
||||
print(f"\nRegime distribution (n={len(per_run)} runs):")
|
||||
for regime, n in counts.most_common():
|
||||
print(f" {regime:<14} {n:>4} ({100*n/len(per_run):>4.1f}%)")
|
||||
args.reports_dir.mkdir(parents=True, exist_ok=True)
|
||||
out = args.reports_dir / "regimes.json"
|
||||
out.write_text(json.dumps(per_run, indent=2), encoding="utf-8")
|
||||
|
||||
# Per-model regime breakdown
|
||||
print(f"\n{'Model':<10} " + " ".join(f"{r:>11}" for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]))
|
||||
print("-" * 70)
|
||||
pm_counts = defaultdict(Counter)
|
||||
for key, v in per_run.items():
|
||||
model = key.split("/")[0]
|
||||
pm_counts[model][v["regime"]] += 1
|
||||
for model in MODELS:
|
||||
row = [f"{model.split('_')[-1][:9]:<10}"]
|
||||
for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]:
|
||||
row.append(f"{pm_counts[model][r]:>11}")
|
||||
print(" ".join(row))
|
||||
|
||||
# Write output
|
||||
out = ROOT / "reports" / "regimes.json"
|
||||
out.parent.mkdir(exist_ok=True)
|
||||
out.write_text(json.dumps(per_run, indent=2))
|
||||
print(f"\nWrote: {out}")
|
||||
counts = Counter(str(v["regime"]) for v in per_run.values())
|
||||
print(f"Wrote: {out}")
|
||||
print(f"Regime counts: {dict(counts)}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@ -1,145 +1,127 @@
|
||||
"""Compute Constraint Index C(q) per task from existing v4-19-full archive.
|
||||
#!/usr/bin/env python3
|
||||
"""Compute posterior Constraint Index C(q) from cached runs.
|
||||
|
||||
Following "When LLMs Are Dreaming..." paper §Query-design:
|
||||
Task-level constraint index:
|
||||
|
||||
C(q) = z(PR(q)) + z(entropy(q)) + z(BOPS(q))
|
||||
C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q))
|
||||
|
||||
Where:
|
||||
- PR(q): participation ratio = (tr Σ)² / tr(Σ²) of response embeddings
|
||||
across all (model, run) responses to query q. Low PR = everyone
|
||||
writes similar thing (prompt is constrained). High PR = responses
|
||||
spread out (prompt is open-ended).
|
||||
- entropy(q): Shannon entropy of (discretized) response-feature distribution.
|
||||
- BOPS(q): Bayesian Optimal Prediction Score — how well can we predict
|
||||
response given q? Proxied here as inter-run cosine similarity
|
||||
for the same model (high similarity = high predictability).
|
||||
|
||||
Since we don't have sentence-transformers, we use TF-IDF-style bag-of-words
|
||||
from the final assistant message per run. This is crude but measures the
|
||||
same signal — whether models produce similar vs divergent output.
|
||||
PR(q) = participation ratio of the task response covariance
|
||||
H(q) = Shannon entropy of the covariance eigenspectrum
|
||||
BOPS(q) = within-model inter-run predictability proxy
|
||||
|
||||
Output: reports/constraint_index.json with per-task C(q) components +
|
||||
combined z-score.
|
||||
High C(q) means a task is more constrained: models and repeated runs tend to
|
||||
land in a narrower response manifold. Low C(q) means the task is more open or
|
||||
stylistically underconstrained.
|
||||
|
||||
Usage:
|
||||
.venv/bin/python3 scripts/compute_constraint_index.py
|
||||
This implementation uses a normalized bag-of-words representation built from
|
||||
the full assistant trajectory text plus tool-call names and compacted inputs.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import glob
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import numpy as np
|
||||
from scipy.stats import entropy as shannon_entropy
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
MODELS = [
|
||||
"anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7",
|
||||
"anthropic_claude-sonnet-4-6", "openai_gpt-5.4",
|
||||
"google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1",
|
||||
"openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5",
|
||||
"openrouter_qwen_qwen3.6-plus",
|
||||
]
|
||||
from clawbench.dynamics_archive import load_task_runs_by_model
|
||||
|
||||
WORD_RE = re.compile(r"[a-z]{3,}")
|
||||
STOPWORDS = set("the and that with this have from what your will can but not "
|
||||
"was will are been one would there been they will their has "
|
||||
"had its were only some than about these which into also each "
|
||||
"when where them how who them very much more most other then "
|
||||
"here such does like just make many like want need take".split())
|
||||
STOPWORDS = set(
|
||||
"the and that with this have from what your will can but not "
|
||||
"was are been one would there they their has had its were only some "
|
||||
"than about these which into also each when where them how who very "
|
||||
"much more most other then here such does like just make many want need take".split()
|
||||
)
|
||||
|
||||
|
||||
def final_assistant_text(run_path: Path, max_chars: int = 4000) -> str:
|
||||
"""Extract the last assistant message text + tool-call arg summary."""
|
||||
try:
|
||||
d = json.loads(run_path.read_text())
|
||||
except Exception:
|
||||
return ""
|
||||
msgs = d.get("transcript", {}).get("messages", [])
|
||||
texts = []
|
||||
for m in msgs:
|
||||
if m.get("role") != "assistant":
|
||||
continue
|
||||
if m.get("text"):
|
||||
texts.append(m["text"])
|
||||
for tc in (m.get("tool_calls") or []):
|
||||
name = tc.get("name", "")
|
||||
args_str = json.dumps(tc.get("arguments", {}))[:200]
|
||||
texts.append(f"{name} {args_str}")
|
||||
blob = " ".join(texts)[:max_chars]
|
||||
return blob
|
||||
def _assistant_trajectory_text(run, max_chars: int = 4000) -> str:
|
||||
parts = []
|
||||
for message in run.transcript.assistant_messages:
|
||||
if message.text:
|
||||
parts.append(message.text)
|
||||
for call in message.tool_calls:
|
||||
parts.append(call.name)
|
||||
if call.input:
|
||||
parts.append(json.dumps(call.input, sort_keys=True)[:200])
|
||||
return " ".join(p for p in parts if p).strip()[:max_chars]
|
||||
|
||||
|
||||
def _fallback_text_from_any_message(run) -> str:
|
||||
for msg in reversed(run.transcript.messages):
|
||||
parts = []
|
||||
if msg.text:
|
||||
parts.append(msg.text)
|
||||
for call in msg.tool_calls:
|
||||
parts.append(call.name)
|
||||
if call.input:
|
||||
parts.append(json.dumps(call.input, sort_keys=True)[:200])
|
||||
if parts:
|
||||
return " ".join(parts).strip()
|
||||
return ""
|
||||
|
||||
|
||||
def tokenize(text: str) -> list[str]:
|
||||
return [w for w in WORD_RE.findall(text.lower()) if w not in STOPWORDS]
|
||||
return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS]
|
||||
|
||||
|
||||
def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]:
|
||||
"""Build a vocab of the top-k most common tokens across all texts."""
|
||||
counter = Counter()
|
||||
for t in texts:
|
||||
counter.update(set(tokenize(t)))
|
||||
return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))}
|
||||
counts = Counter()
|
||||
for text in texts:
|
||||
counts.update(set(tokenize(text)))
|
||||
return {word: idx for idx, (word, _) in enumerate(counts.most_common(top_k))}
|
||||
|
||||
|
||||
def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray:
|
||||
"""TF-IDF-ish: token frequency normalized to unit L2 for cosine geometry."""
|
||||
v = np.zeros(len(vocab), dtype=np.float32)
|
||||
vec = np.zeros(len(vocab), dtype=np.float32)
|
||||
toks = tokenize(text)
|
||||
if not toks:
|
||||
return v
|
||||
return vec
|
||||
counts = Counter(toks)
|
||||
for w, c in counts.items():
|
||||
if w in vocab:
|
||||
v[vocab[w]] = c
|
||||
n = np.linalg.norm(v)
|
||||
return v / n if n > 0 else v
|
||||
for word, cnt in counts.items():
|
||||
if word in vocab:
|
||||
vec[vocab[word]] = cnt
|
||||
norm = np.linalg.norm(vec)
|
||||
return vec / norm if norm > 0 else vec
|
||||
|
||||
|
||||
def participation_ratio(X: np.ndarray) -> float:
|
||||
"""PR(X) = (tr Σ)² / tr(Σ²). Measures effective dimensionality 1–d."""
|
||||
"""PR(X) = (tr Sigma)^2 / tr(Sigma^2), an effective dimensionality proxy."""
|
||||
if X.shape[0] < 2:
|
||||
return 1.0
|
||||
Sigma = np.cov(X.T)
|
||||
if Sigma.ndim == 0:
|
||||
sigma = np.cov(X.T)
|
||||
if sigma.ndim == 0:
|
||||
return 1.0
|
||||
tr = np.trace(Sigma)
|
||||
tr_sq = np.trace(Sigma @ Sigma)
|
||||
tr = np.trace(sigma)
|
||||
tr_sq = np.trace(sigma @ sigma)
|
||||
if tr_sq < 1e-12:
|
||||
return 1.0
|
||||
return float(tr ** 2 / tr_sq)
|
||||
return float((tr**2) / tr_sq)
|
||||
|
||||
|
||||
def response_entropy(X: np.ndarray, n_clusters: int = 8) -> float:
|
||||
"""Entropy of a k-means-like discretization of responses.
|
||||
|
||||
Since we have small n per task (~27 responses), we cluster by nearest-
|
||||
centroid using the top-few PCA directions. Simpler: use normalized
|
||||
eigenvalues of covariance as a proxy for entropy over principal modes.
|
||||
"""
|
||||
def response_entropy(X: np.ndarray) -> float:
|
||||
"""Entropy over normalized covariance eigenvalues, in bits."""
|
||||
if X.shape[0] < 2:
|
||||
return 0.0
|
||||
Sigma = np.cov(X.T)
|
||||
eigs = np.linalg.eigvalsh(Sigma)
|
||||
sigma = np.cov(X.T)
|
||||
eigs = np.linalg.eigvalsh(sigma)
|
||||
eigs = np.clip(eigs, 1e-12, None)
|
||||
eigs = eigs / eigs.sum()
|
||||
return float(shannon_entropy(eigs, base=2))
|
||||
probs = eigs / eigs.sum()
|
||||
return float(-np.sum(probs * np.log2(probs)))
|
||||
|
||||
|
||||
def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> float:
|
||||
"""BOPS proxy: inter-run cosine similarity within same model.
|
||||
|
||||
High similarity = predictable (high BOPS). Low similarity = novel each run.
|
||||
Returns mean cosine across all pairs within each model, averaged across models.
|
||||
"""
|
||||
"""Mean within-model pairwise cosine similarity across repeated runs."""
|
||||
per_model_means = []
|
||||
for _model, vecs in run_vecs.items():
|
||||
for vecs in run_vecs.values():
|
||||
if len(vecs) < 2:
|
||||
continue
|
||||
sims = []
|
||||
@ -154,91 +136,88 @@ def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> floa
|
||||
return float(np.mean(per_model_means)) if per_model_means else 0.0
|
||||
|
||||
|
||||
def zscore(value: float, arr: np.ndarray) -> float:
|
||||
std = arr.std()
|
||||
return float((value - arr.mean()) / std) if std > 1e-12 else 0.0
|
||||
|
||||
|
||||
def main() -> None:
|
||||
# Gather: per-task list of texts + per-model list of per-run vectors
|
||||
parser = argparse.ArgumentParser(description="Compute posterior constraint index per task")
|
||||
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
|
||||
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
|
||||
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
|
||||
if not grouped:
|
||||
raise SystemExit(f"No cached runs found under {args.archive_dir}")
|
||||
|
||||
per_task_texts: dict[str, list[str]] = defaultdict(list)
|
||||
per_task_model_runs: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
|
||||
for model in MODELS:
|
||||
model_dir = ARCH / model
|
||||
if not model_dir.exists():
|
||||
continue
|
||||
for task_dir in model_dir.iterdir():
|
||||
if not task_dir.is_dir():
|
||||
continue
|
||||
task = task_dir.name
|
||||
for rf in sorted(task_dir.glob("run*.json")):
|
||||
text = final_assistant_text(rf)
|
||||
per_task_model_texts: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
|
||||
|
||||
use_fallback_messages = False
|
||||
for model_name, task_runs in grouped.items():
|
||||
for task_id, runs in task_runs.items():
|
||||
for run in runs:
|
||||
text = _assistant_trajectory_text(run)
|
||||
if text:
|
||||
per_task_texts[task].append(text)
|
||||
per_task_model_runs[task][model].append(text)
|
||||
per_task_texts[task_id].append(text)
|
||||
per_task_model_texts[task_id][model_name].append(text)
|
||||
|
||||
print(f"Tasks with responses: {len(per_task_texts)}")
|
||||
all_texts = [text for texts in per_task_texts.values() for text in texts]
|
||||
if not all_texts:
|
||||
use_fallback_messages = True
|
||||
for model_name, task_runs in grouped.items():
|
||||
for task_id, runs in task_runs.items():
|
||||
for run in runs:
|
||||
text = _fallback_text_from_any_message(run)
|
||||
if text:
|
||||
per_task_texts[task_id].append(text)
|
||||
per_task_model_texts[task_id][model_name].append(text)
|
||||
all_texts = [text for texts in per_task_texts.values() for text in texts]
|
||||
|
||||
if not all_texts:
|
||||
raise SystemExit("No usable text found in cached transcripts.")
|
||||
|
||||
# Build a GLOBAL vocab across all tasks for comparable vector spaces
|
||||
all_texts = [t for ts in per_task_texts.values() for t in ts]
|
||||
vocab = build_vocab(all_texts, top_k=500)
|
||||
print(f"Global vocab size: {len(vocab)}")
|
||||
|
||||
# Compute per-task metrics
|
||||
per_task: dict[str, dict] = {}
|
||||
for task, texts in sorted(per_task_texts.items()):
|
||||
if len(texts) < 5:
|
||||
continue
|
||||
X = np.stack([vectorize(t, vocab) for t in texts]) # (n_responses, vocab_dim)
|
||||
per_task: dict[str, dict[str, float | str]] = {}
|
||||
for task_id, texts in sorted(per_task_texts.items()):
|
||||
X = np.stack([vectorize(text, vocab) for text in texts])
|
||||
pr = participation_ratio(X)
|
||||
ent = response_entropy(X)
|
||||
# BOPS: within-model run predictability
|
||||
model_vecs: dict[str, list[np.ndarray]] = {}
|
||||
for m, ts in per_task_model_runs[task].items():
|
||||
model_vecs[m] = [vectorize(t, vocab) for t in ts]
|
||||
model_vecs = {
|
||||
model_name: [vectorize(text, vocab) for text in model_texts]
|
||||
for model_name, model_texts in per_task_model_texts[task_id].items()
|
||||
}
|
||||
bops = bops_inter_run_predictability(model_vecs)
|
||||
per_task[task] = {
|
||||
per_task[task_id] = {
|
||||
"n_responses": len(texts),
|
||||
"PR": pr,
|
||||
"entropy": ent,
|
||||
"BOPS": bops,
|
||||
"data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",
|
||||
}
|
||||
|
||||
# Z-score each component across tasks → combine into C(q)
|
||||
if not per_task:
|
||||
raise SystemExit("Not enough data to compute C(q).")
|
||||
|
||||
prs = np.array([v["PR"] for v in per_task.values()])
|
||||
ents = np.array([v["entropy"] for v in per_task.values()])
|
||||
bopss = np.array([v["BOPS"] for v in per_task.values()])
|
||||
|
||||
def z(x, arr):
|
||||
return float((x - arr.mean()) / (arr.std() or 1.0))
|
||||
for task_id, v in per_task.items():
|
||||
z_pr = zscore(v["PR"], prs)
|
||||
z_ent = zscore(v["entropy"], ents)
|
||||
z_bops = zscore(v["BOPS"], bopss)
|
||||
v["z_PR"] = z_pr
|
||||
v["z_entropy"] = z_ent
|
||||
v["z_BOPS"] = z_bops
|
||||
v["C_q"] = -z_pr - z_ent + z_bops
|
||||
|
||||
for task, v in per_task.items():
|
||||
zpr = z(v["PR"], prs)
|
||||
zent = z(v["entropy"], ents)
|
||||
zbops = z(v["BOPS"], bopss)
|
||||
# Paper: higher PR/entropy = MORE open-ended. Higher BOPS = MORE predictable.
|
||||
# "Constraint" = opposite of openness. C(q) high ⇒ constrained task.
|
||||
# So: C(q) = −z(PR) − z(entropy) + z(BOPS)
|
||||
v["z_PR"] = zpr
|
||||
v["z_entropy"] = zent
|
||||
v["z_BOPS"] = zbops
|
||||
v["C_q"] = -zpr - zent + zbops
|
||||
|
||||
# Sort + print
|
||||
ranked = sorted(per_task.items(), key=lambda kv: -kv[1]["C_q"])
|
||||
print(f"\n{'Task':<38} {'n':>3} {'PR':>5} {'H':>5} {'BOPS':>5} {'C(q)':>6} (constraint level)")
|
||||
print("-" * 78)
|
||||
for task, v in ranked:
|
||||
print(f"{task:<38} {v['n_responses']:>3} {v['PR']:>5.2f} {v['entropy']:>5.2f} "
|
||||
f"{v['BOPS']:>5.2f} {v['C_q']:>+6.2f}")
|
||||
|
||||
out_path = ROOT / "reports" / "constraint_index.json"
|
||||
out_path.parent.mkdir(exist_ok=True)
|
||||
out_path.write_text(json.dumps(per_task, indent=2))
|
||||
print(f"\nWrote: {out_path}")
|
||||
|
||||
# Bucket summary
|
||||
highs = [t for t, v in per_task.items() if v["C_q"] > 0.5]
|
||||
lows = [t for t, v in per_task.items() if v["C_q"] < -0.5]
|
||||
mids = [t for t, v in per_task.items() if -0.5 <= v["C_q"] <= 0.5]
|
||||
print(f"\nHigh-constraint (C>+0.5): {len(highs)} tasks (responses converge)")
|
||||
print(f"Mid: {len(mids)} tasks")
|
||||
print(f"Low-constraint (C<-0.5): {len(lows)} tasks (responses diverge — open-ended)")
|
||||
args.reports_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_path = args.reports_dir / "constraint_index.json"
|
||||
out_path.write_text(json.dumps(per_task, indent=2), encoding="utf-8")
|
||||
print(f"Wrote: {out_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@ -1,221 +1,144 @@
|
||||
"""Assemble a combined dynamical-systems report integrating:
|
||||
- Constraint Index C(q) per task
|
||||
- Regime classification per run
|
||||
- Seed vs capability variance
|
||||
- Survival / hazard analysis
|
||||
#!/usr/bin/env python3
|
||||
"""Assemble a combined posterior dynamical-systems markdown report.
|
||||
|
||||
Requires: reports/constraint_index.json, reports/regimes.json,
|
||||
reports/variance_decomposition.json, reports/survival_analysis.json
|
||||
Inputs:
|
||||
- constraint_index.json
|
||||
- regimes.json
|
||||
- variance_decomposition.json
|
||||
- survival_analysis.json
|
||||
- snr_weighted_ranking.json (optional)
|
||||
|
||||
Output: reports/EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md
|
||||
Output:
|
||||
- EVAL_REPORT_DYNAMICAL.md
|
||||
|
||||
The goal is to keep a compact human-readable summary next to the machine
|
||||
outputs produced by the posterior analysis pipeline.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
from collections import Counter, defaultdict
|
||||
from pathlib import Path
|
||||
from statistics import mean
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
REPORTS = ROOT / "reports"
|
||||
|
||||
MODEL_MAP = {
|
||||
"opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
|
||||
"opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
|
||||
"sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
|
||||
"gpt54": ("openai_gpt-5.4", "GPT 5.4"),
|
||||
"gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
|
||||
"glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
|
||||
"minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
|
||||
"kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
|
||||
"qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
|
||||
}
|
||||
def _read_json(path: Path):
|
||||
if not path.exists():
|
||||
raise SystemExit(f"Missing required report file: {path}")
|
||||
return json.loads(path.read_text(encoding="utf-8"))
|
||||
|
||||
|
||||
def main() -> None:
|
||||
cq = json.loads((REPORTS / "constraint_index.json").read_text())
|
||||
regimes = json.loads((REPORTS / "regimes.json").read_text())
|
||||
variance = json.loads((REPORTS / "variance_decomposition.json").read_text())
|
||||
survival = json.loads((REPORTS / "survival_analysis.json").read_text())
|
||||
|
||||
lines = []
|
||||
L = lines.append
|
||||
L("# ClawBench — Dynamical Systems Analysis (v2026-4-19-full)")
|
||||
L("")
|
||||
L("Inspired by *\"When LLMs Are Dreaming, Where Do They Go?\"* — treats")
|
||||
L("agent runs as dynamical systems and extracts signal ClawBench's flat")
|
||||
L("run_score can't: task constraint level, per-run regime, noise vs")
|
||||
L("signal ratio, and per-turn survival curves.")
|
||||
L("")
|
||||
|
||||
# ----------------- 1. Constraint Index summary -----------------
|
||||
L("## 1. Constraint Index C(q) per task")
|
||||
L("")
|
||||
L("C(q) = −z(PR) − z(entropy) + z(BOPS). High C(q) = task is constrained")
|
||||
L("(responses converge); low C(q) = open-ended (responses diverge).")
|
||||
L("")
|
||||
high = sorted([(t, v) for t, v in cq.items() if v["C_q"] > 0.5],
|
||||
key=lambda kv: -kv[1]["C_q"])
|
||||
low = sorted([(t, v) for t, v in cq.items() if v["C_q"] < -0.5],
|
||||
key=lambda kv: kv[1]["C_q"])
|
||||
mid = [t for t, v in cq.items() if -0.5 <= v["C_q"] <= 0.5]
|
||||
L(f"- **High-constraint ({len(high)} tasks, C>+0.5):** {', '.join(t for t, _ in high[:5])}, …")
|
||||
L(f"- **Low-constraint ({len(low)} tasks, C<−0.5):** {', '.join(t for t, _ in low[:5])}, …")
|
||||
L(f"- **Middle ({len(mid)} tasks):** {', '.join(mid[:5])}, …")
|
||||
L("")
|
||||
L("Top 5 most-constrained and most-divergent tasks:")
|
||||
L("")
|
||||
L("| Constraint | Task | PR | Entropy | BOPS | C(q) |")
|
||||
L("|---|---|:---:|:---:|:---:|:---:|")
|
||||
for t, v in high[:5]:
|
||||
L(f"| HIGH | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |")
|
||||
for t, v in low[:5]:
|
||||
L(f"| LOW | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |")
|
||||
L("")
|
||||
|
||||
# ----------------- 2. Regime distribution -----------------
|
||||
L("## 2. Dynamical regime per run")
|
||||
L("")
|
||||
L("Each run's turn-by-turn trajectory classified by drift, recurrence,")
|
||||
L("and support volume thresholds (quartile-based).")
|
||||
L("")
|
||||
pm = defaultdict(Counter)
|
||||
for key, v in regimes.items():
|
||||
model_sub = key.split("/")[0]
|
||||
# Reverse-map to label
|
||||
label = next((l for l, (s, _) in MODEL_MAP.items() if s == model_sub), None)
|
||||
if label:
|
||||
pm[label][v["regime"]] += 1
|
||||
L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |")
|
||||
L("|---|:---:|:---:|:---:|:---:|:---:|")
|
||||
for label, (_sub, pretty) in MODEL_MAP.items():
|
||||
c = pm[label]
|
||||
L(f"| {pretty} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | "
|
||||
f"{c['diffusive']} | {c['mixed']} |")
|
||||
L("")
|
||||
L("**Interpretation:**")
|
||||
L("- `trapped` = low drift + small support: agent converges to a point.")
|
||||
L(" Often good on constrained tasks, sometimes 'stuck'.")
|
||||
L("- `limit_cycle` = repeats similar states non-consecutively: tool-use loop.")
|
||||
L("- `diffusive` = keeps exploring without converging. Goal drift risk.")
|
||||
L("- `mixed` = no strong signature.")
|
||||
L("")
|
||||
L("Notable findings:")
|
||||
L("")
|
||||
# Find outliers
|
||||
trap_counts = [(label, pm[label]["trapped"]) for label in MODEL_MAP]
|
||||
cycle_counts = [(label, pm[label]["limit_cycle"]) for label in MODEL_MAP]
|
||||
trap_counts.sort(key=lambda x: -x[1])
|
||||
cycle_counts.sort(key=lambda x: -x[1])
|
||||
L(f"- Most `trapped` runs: **{MODEL_MAP[trap_counts[0][0]][1]}** ({trap_counts[0][1]} runs) —")
|
||||
L(f" converges aggressively; often one-shot answer without iteration.")
|
||||
L(f"- Most `limit_cycle` runs: **{MODEL_MAP[cycle_counts[0][0]][1]}** ({cycle_counts[0][1]} runs) —")
|
||||
L(f" repeats tool patterns between turns; check for productive vs stuck loops.")
|
||||
L("")
|
||||
|
||||
# ----------------- 3. Variance decomposition -----------------
|
||||
L("## 3. Seed-noise vs capability-signal")
|
||||
L("")
|
||||
agg = variance["aggregate"]
|
||||
L(f"- **Seed-noise variance** (same model, 3 runs): **{agg['mean_seed_var']:.4f}**")
|
||||
L(f"- **Capability variance** (across models): **{agg['mean_cap_var']:.4f}**")
|
||||
L(f"- **Capability fraction: {agg['capability_fraction']:.1%}**")
|
||||
L(f" (= fraction of benchmark variance that reflects real model differences)")
|
||||
L("")
|
||||
L("**The other ~47% is seed noise.** Any ranking gap < √(2·seed_var) ≈")
|
||||
L(f"0.20 between two models is within noise. Top-5 models' gap is 0.02 →")
|
||||
L("**statistically indistinguishable.**")
|
||||
L("")
|
||||
L("### SNR tiers across 40 tasks")
|
||||
L("")
|
||||
per_task = variance["per_task"]
|
||||
hi = [r for r in per_task if r["snr"] >= 5]
|
||||
mid = [r for r in per_task if 1 <= r["snr"] < 5]
|
||||
lo = [r for r in per_task if r["snr"] < 1]
|
||||
L(f"- **High-SNR ({len(hi)} tasks, SNR ≥ 5):** reliably discriminate models")
|
||||
for r in hi[:3]:
|
||||
L(f" - `{r['task']}` (SNR={r['snr']:.1f})")
|
||||
L(f"- **Mid-SNR ({len(mid)} tasks, 1 ≤ SNR < 5):** moderate signal")
|
||||
L(f"- **Low-SNR ({len(lo)} tasks, SNR < 1):** seed noise dominates; these")
|
||||
L(f" tasks give essentially random rankings")
|
||||
for r in sorted(lo, key=lambda x: x['snr'])[:3]:
|
||||
L(f" - `{r['task']}` (SNR={r['snr']:.2f}) — random")
|
||||
L("")
|
||||
|
||||
# ----------------- 4. Survival analysis -----------------
|
||||
L("## 4. Per-turn survival: when do runs fail?")
|
||||
L("")
|
||||
L("T_F = first turn where agent emits empty response or run ends in failure.")
|
||||
L("S(t) = fraction of runs still on-track past turn t. Low = dies early.")
|
||||
L("")
|
||||
L("| Model | Median fail turn | S(3) | S(5) | S(8) | S(12) | S(20) |")
|
||||
L("|---|:---:|:---:|:---:|:---:|:---:|:---:|")
|
||||
for label, (_sub, pretty) in MODEL_MAP.items():
|
||||
d = survival.get(label, {})
|
||||
surv = d.get("survival", [0]*20)
|
||||
med = d.get("median_fail_turn", "—")
|
||||
med_str = f"{med:.1f}" if isinstance(med, (int, float)) and med != float("inf") else str(med)
|
||||
L(f"| {pretty} | {med_str} | {surv[2]:.2f} | {surv[4]:.2f} | "
|
||||
f"{surv[7]:.2f} | {surv[11]:.2f} | {surv[19]:.2f} |")
|
||||
L("")
|
||||
# Narrative
|
||||
surv_rank_t8 = sorted(
|
||||
[(label, survival[label]["survival"][7])
|
||||
for label in MODEL_MAP if label in survival],
|
||||
key=lambda x: -x[1]
|
||||
parser = argparse.ArgumentParser(description="Generate a combined dynamical report markdown")
|
||||
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
|
||||
parser.add_argument(
|
||||
"--output",
|
||||
type=Path,
|
||||
default=None,
|
||||
help="Markdown output path; defaults to <reports-dir>/EVAL_REPORT_DYNAMICAL.md",
|
||||
)
|
||||
best = MODEL_MAP[surv_rank_t8[0][0]][1]
|
||||
worst = MODEL_MAP[surv_rank_t8[-1][0]][1]
|
||||
L(f"- **{best}** survives longest — {surv_rank_t8[0][1]:.0%} of runs still")
|
||||
L(f" producing output at turn 8.")
|
||||
L(f"- **{worst}** dies earliest — only {surv_rank_t8[-1][1]:.0%} make it to turn 8.")
|
||||
args = parser.parse_args()
|
||||
|
||||
reports = args.reports_dir
|
||||
output_path = args.output or (reports / "EVAL_REPORT_DYNAMICAL.md")
|
||||
cq = _read_json(reports / "constraint_index.json")
|
||||
regimes = _read_json(reports / "regimes.json")
|
||||
variance = _read_json(reports / "variance_decomposition.json")
|
||||
survival = _read_json(reports / "survival_analysis.json")
|
||||
ranking_path = reports / "snr_weighted_ranking.json"
|
||||
ranking = json.loads(ranking_path.read_text(encoding="utf-8")) if ranking_path.exists() else None
|
||||
|
||||
lines: list[str] = []
|
||||
L = lines.append
|
||||
|
||||
L("# ClawBench Posterior Dynamical Report")
|
||||
L("")
|
||||
L("This is signal invisible in flat run_score: two models can score")
|
||||
L("similarly but have very different failure profiles. Pick accordingly")
|
||||
L("for long-horizon deployments.")
|
||||
L("This report combines posterior-only diagnostics from cached run artifacts.")
|
||||
L("")
|
||||
|
||||
# ----------------- 5. Integrated view -----------------
|
||||
L("## 5. Integrated view — combining all four lenses")
|
||||
L("## 1. Constraint Index C(q)")
|
||||
L("")
|
||||
L("For a model to be **reliably good** at a task, we need:")
|
||||
L("- (a) It scores well (run_score high)")
|
||||
L("- (b) Variance across seeds is low (predictable)")
|
||||
L("- (c) It doesn't exhibit pathological regime (trapped on wrong answer / cycling)")
|
||||
L("- (d) It survives multi-turn without dying early")
|
||||
values = [(task, float(data.get("C_q", 0.0))) for task, data in cq.items()]
|
||||
values.sort(key=lambda row: row[1], reverse=True)
|
||||
highs = [row for row in values if row[1] > 0.5]
|
||||
lows = [row for row in values if row[1] < -0.5]
|
||||
L(f"- High-constraint tasks (C > 0.5): {len(highs)}")
|
||||
L(f"- Low-constraint tasks (C < -0.5): {len(lows)}")
|
||||
L("")
|
||||
L("These lenses disagree constructively:")
|
||||
if values:
|
||||
L("Top tasks by C(q):")
|
||||
L("")
|
||||
L("| Task | C(q) |")
|
||||
L("|---|---:|")
|
||||
for task, c_q in values[:10]:
|
||||
L(f"| {task} | {c_q:+.3f} |")
|
||||
L("")
|
||||
|
||||
L("## 2. Regime Classification")
|
||||
L("")
|
||||
L("- **Opus 4.6** tops flat run_score but median failure at turn 5.5 (earlier than Opus 4.7's 7).")
|
||||
L("- **GPT 5.4** is mid-pack on flat score but has highest S(8)=0.60 — long-horizon champion.")
|
||||
L("- **Sonnet 4.6** most `trapped` runs — it commits early and sticks. Good on")
|
||||
L(" constrained tasks, bad on open-ended (cf. memory-recall-continuation 0.15).")
|
||||
L("- **GLM 5.1** most balanced regime distribution; justifies broad performance.")
|
||||
L("- **Kimi K2.5** median fail at turn 3 — it's not just low-scoring, it's")
|
||||
L(" specifically fragile under multi-turn execution.")
|
||||
by_model = defaultdict(Counter)
|
||||
for key, row in regimes.items():
|
||||
model = key.split("/")[0]
|
||||
regime = row.get("regime", "unknown")
|
||||
by_model[model][regime] += 1
|
||||
|
||||
L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |")
|
||||
L("|---|---:|---:|---:|---:|---:|")
|
||||
for model in sorted(by_model):
|
||||
c = by_model[model]
|
||||
L(
|
||||
f"| {model} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | "
|
||||
f"{c['diffusive']} | {c['mixed']} |"
|
||||
)
|
||||
L("")
|
||||
|
||||
# ----------------- 6. What to do next -----------------
|
||||
L("## 6. Implications for the benchmark")
|
||||
L("## 3. Variance Decomposition")
|
||||
L("")
|
||||
agg = variance.get("aggregate", {})
|
||||
L(f"- Mean seed variance: {agg.get('mean_seed_var', 0.0):.6f}")
|
||||
L(f"- Mean capability variance: {agg.get('mean_cap_var', 0.0):.6f}")
|
||||
L(f"- Capability fraction: {agg.get('capability_fraction', 0.0):.1%}")
|
||||
L(f"- High-SNR tasks: {agg.get('high_snr_tasks', 0)}")
|
||||
L(f"- Mid-SNR tasks: {agg.get('mid_snr_tasks', 0)}")
|
||||
L(f"- Low-SNR tasks: {agg.get('low_snr_tasks', 0)}")
|
||||
L("")
|
||||
L("- **47% seed noise** means any gap < 0.02 is meaningless. Treat top-5")
|
||||
L(" as a statistical tie. Dropping the 21 low-SNR tasks would sharpen")
|
||||
L(" remaining rankings considerably.")
|
||||
L("- **Weight tasks by SNR × |C(q)|** instead of flat mean. High-SNR,")
|
||||
L(" high-|C(q)| tasks give the cleanest capability signal.")
|
||||
L("- **Report survival curves alongside run_score** to surface long-horizon")
|
||||
L(" capability that single-number metrics hide.")
|
||||
L("- **Flag 'trapped' runs that scored high** — the model may have")
|
||||
L(" guessed-and-committed rather than reasoned; not same reliability.")
|
||||
L("- **Add a Tier 6 long-horizon (100+ turn) task set** to actually")
|
||||
L(" measure the dynamical regimes the paper proposes — current")
|
||||
L(" trajectories are too short (median 6 assistant turns) for clean")
|
||||
L(" Lyapunov or attractor diagnostics.")
|
||||
|
||||
out = REPORTS / "EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md"
|
||||
out.write_text("\n".join(lines) + "\n")
|
||||
print(f"Wrote: {out}")
|
||||
L("## 4. Survival Analysis")
|
||||
L("")
|
||||
L("| Model | Runs | Events | Median failure turn | S(3) | S(5) | S(8) |")
|
||||
L("|---|---:|---:|---:|---:|---:|---:|")
|
||||
for model in sorted(survival):
|
||||
row = survival[model]
|
||||
surv = row.get("survival", [0.0] * 8)
|
||||
med = row.get("median_fail_turn", "inf")
|
||||
if isinstance(med, float) and med == float("inf"):
|
||||
med_display = "inf"
|
||||
else:
|
||||
med_display = f"{float(med):.1f}"
|
||||
L(
|
||||
f"| {model} | {row.get('n_runs', 0)} | {row.get('n_events', 0)} | "
|
||||
f"{med_display} | {surv[2] if len(surv) > 2 else 0.0:.2f} | "
|
||||
f"{surv[4] if len(surv) > 4 else 0.0:.2f} | {surv[7] if len(surv) > 7 else 0.0:.2f} |"
|
||||
)
|
||||
L("")
|
||||
|
||||
if ranking is not None:
|
||||
L("## 5. SNR-weighted Ranking")
|
||||
L("")
|
||||
L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |")
|
||||
L("|---:|---|---:|---:|---:|---:|")
|
||||
for idx, row in enumerate(ranking.get("results", []), start=1):
|
||||
L(
|
||||
f"| {idx} | {row.get('model', '')} | {row.get('flat', 0.0):.4f} | "
|
||||
f"{row.get('snr_x_abs_cq', 0.0):.4f} | {row.get('snr_x_abs_cq_winsorized', 0.0):.4f} | "
|
||||
f"{row.get('coverage', 0)} |"
|
||||
)
|
||||
L("")
|
||||
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
output_path.write_text("\n".join(lines) + "\n", encoding="utf-8")
|
||||
print(f"Wrote: {output_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
89
scripts/run_posterior_dynamics_pipeline.py
Normal file
89
scripts/run_posterior_dynamics_pipeline.py
Normal file
@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Run the full posterior dynamical analysis pipeline."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
REPO_ROOT = Path(__file__).resolve().parent.parent
|
||||
sys.path.insert(0, str(REPO_ROOT))
|
||||
|
||||
from clawbench.dynamics_archive import discover_model_roots, load_task_runs_archive, write_dynamics_report
|
||||
|
||||
|
||||
def _run(cmd: list[str]) -> None:
|
||||
print("$", " ".join(cmd))
|
||||
result = subprocess.run(cmd, cwd=REPO_ROOT)
|
||||
if result.returncode != 0:
|
||||
raise SystemExit(result.returncode)
|
||||
|
||||
|
||||
def _resolve_path(path: Path) -> Path:
|
||||
return path if path.is_absolute() else (REPO_ROOT / path)
|
||||
|
||||
|
||||
def _write_dynamics_reports(
|
||||
archive_dir: Path,
|
||||
output_dir: Path,
|
||||
tier: str | None,
|
||||
) -> None:
|
||||
roots = discover_model_roots(archive_dir)
|
||||
if not roots:
|
||||
raise SystemExit(f"No cached runs found under {archive_dir}")
|
||||
|
||||
multiple_models = len(roots) > 1
|
||||
wrote_any = False
|
||||
for model_name, model_dir in roots.items():
|
||||
task_runs = load_task_runs_archive(model_dir, tier=tier)
|
||||
if not task_runs:
|
||||
continue
|
||||
|
||||
wrote_any = True
|
||||
model_output_dir = output_dir / model_name if multiple_models else output_dir
|
||||
report_path, plots = write_dynamics_report(task_runs, model_output_dir)
|
||||
n_runs = sum(len(runs) for runs in task_runs.values())
|
||||
|
||||
print(f"[dynamics] {model_name}: loaded {n_runs} cached runs across {len(task_runs)} tasks")
|
||||
print(f"[dynamics] {model_name}: wrote {report_path}")
|
||||
print(f"[dynamics] {model_name}: saved {len(plots)} plots to {model_output_dir}/")
|
||||
|
||||
if not wrote_any:
|
||||
raise SystemExit(f"No cached runs found under {archive_dir}")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description="Run posterior dynamics pipeline end to end")
|
||||
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
|
||||
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
|
||||
parser.add_argument("--output-dir", type=Path, default=Path("results/posterior_dynamics"))
|
||||
parser.add_argument(
|
||||
"--include-dynamics-report",
|
||||
action="store_true",
|
||||
help="Also build per-model dynamics.json files and plots from the archive.",
|
||||
)
|
||||
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
py = sys.executable
|
||||
archive_dir = _resolve_path(args.archive_dir)
|
||||
reports_dir = _resolve_path(args.reports_dir)
|
||||
output_dir = _resolve_path(args.output_dir)
|
||||
tier_args = ["--tier", args.tier] if args.tier else []
|
||||
scripts_dir = REPO_ROOT / "scripts"
|
||||
|
||||
_run([py, str(scripts_dir / "compute_constraint_index.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
|
||||
_run([py, str(scripts_dir / "classify_regimes.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
|
||||
_run([py, str(scripts_dir / "variance_decomp.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
|
||||
_run([py, str(scripts_dir / "survival_analysis.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
|
||||
_run([py, str(scripts_dir / "snr_weighted_ranking.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
|
||||
_run([py, str(scripts_dir / "generate_dynamical_report.py"), "--reports-dir", str(reports_dir)])
|
||||
if args.include_dynamics_report:
|
||||
_write_dynamics_reports(archive_dir, output_dir, args.tier)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@ -1,148 +1,130 @@
|
||||
"""SNR × |C(q)|-weighted ranking — the dynamical-systems-informed metric.
|
||||
#!/usr/bin/env python3
|
||||
"""SNR x |C(q)| weighted ranking from posterior cached runs.
|
||||
|
||||
Motivation: from variance_decomp.py we know 47% of run_score variance is
|
||||
seed noise. From compute_constraint_index.py we know some tasks are
|
||||
high-constraint (everyone converges) and others are open-ended (responses
|
||||
diverge for style reasons, not capability).
|
||||
Weighted headline score:
|
||||
|
||||
Weighted mean:
|
||||
w(task) = SNR(task) × |C(q)(task)|
|
||||
score(model) = Σ_task w(task) · mean_run_score(task, model) / Σ_task w(task)
|
||||
w(q) = max(0, SNR(q)) * |C(q)|
|
||||
score(model) = sum_q w(q) * mean_run_score(model, q) / sum_q w(q)
|
||||
|
||||
Why:
|
||||
- High SNR tasks contribute more than low-SNR tasks (noise-weighted)
|
||||
- |C(q)| amplifies tasks that are either strongly constrained OR strongly
|
||||
open-ended (i.e. measures what they're supposed to measure, regardless
|
||||
of polarity)
|
||||
- Moderate C(q) tasks (C near 0) are inherently ambiguous — down-weighted
|
||||
We also report:
|
||||
|
||||
Outputs:
|
||||
- Per-model weighted score
|
||||
- Comparison against flat-mean ranking
|
||||
- Published to reports/snr_weighted_ranking.json
|
||||
snr_only = SNR-weighted mean
|
||||
snr_x_abs_cq = SNR x |C(q)| weighted mean
|
||||
snr_x_abs_cq_winsorized = same, but top task weights are clamped at p95
|
||||
|
||||
This keeps noisy low-SNR tasks from dominating and upweights tasks whose
|
||||
response geometry suggests a stronger capability signal.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import glob
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from statistics import mean
|
||||
|
||||
import numpy as np
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
|
||||
REPORTS = ROOT / "reports"
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
MODELS = {
|
||||
"opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
|
||||
"opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
|
||||
"sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
|
||||
"gpt54": ("openai_gpt-5.4", "GPT 5.4"),
|
||||
"gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
|
||||
"glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
|
||||
"minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
|
||||
"kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
|
||||
"qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
|
||||
}
|
||||
from clawbench.dynamics_archive import load_task_runs_by_model
|
||||
|
||||
|
||||
def main() -> None:
|
||||
cq = json.loads((REPORTS / "constraint_index.json").read_text())
|
||||
var = json.loads((REPORTS / "variance_decomposition.json").read_text())
|
||||
snr_by_task = {r["task"]: r["snr"] for r in var["per_task"]}
|
||||
parser = argparse.ArgumentParser(description="Compute SNR-weighted posterior model ranking")
|
||||
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
|
||||
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
|
||||
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Per (model, task): mean run_score over the 3 runs
|
||||
per_mt: dict[str, dict[str, list[float]]] = defaultdict(dict)
|
||||
for label, (sub, _) in MODELS.items():
|
||||
for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
|
||||
try:
|
||||
d = json.loads(Path(p).read_text())
|
||||
except Exception:
|
||||
continue
|
||||
task = p.split("/")[-2]
|
||||
per_mt[label].setdefault(task, []).append(d.get("run_score", 0))
|
||||
per_mt_mean = {
|
||||
m: {t: mean(v) for t, v in d.items() if v} for m, d in per_mt.items()
|
||||
cq_path = args.reports_dir / "constraint_index.json"
|
||||
var_path = args.reports_dir / "variance_decomposition.json"
|
||||
if not cq_path.exists() or not var_path.exists():
|
||||
raise SystemExit("Missing prerequisite reports: run compute_constraint_index.py and variance_decomp.py first.")
|
||||
|
||||
cq = json.loads(cq_path.read_text(encoding="utf-8"))
|
||||
var = json.loads(var_path.read_text(encoding="utf-8"))
|
||||
snr_by_task = {row["task"]: row["snr"] for row in var.get("per_task", [])}
|
||||
|
||||
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
|
||||
if not grouped:
|
||||
raise SystemExit(f"No cached runs found under {args.archive_dir}")
|
||||
|
||||
per_model_task_scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
|
||||
for model_name, task_runs in grouped.items():
|
||||
for task_id, runs in task_runs.items():
|
||||
per_model_task_scores[model_name][task_id] = [float(run.run_score) for run in runs]
|
||||
|
||||
per_model_task_mean = {
|
||||
model_name: {
|
||||
task_id: mean(vals)
|
||||
for task_id, vals in task_scores.items()
|
||||
if vals
|
||||
}
|
||||
for model_name, task_scores in per_model_task_scores.items()
|
||||
}
|
||||
|
||||
# Only consider tasks present in both C(q) and SNR
|
||||
common_tasks = sorted(set(cq) & set(snr_by_task))
|
||||
print(f"Using {len(common_tasks)} tasks with both C(q) and SNR.")
|
||||
if not common_tasks:
|
||||
raise SystemExit("No overlap between constraint_index and variance_decomposition task sets.")
|
||||
|
||||
# Compute weights w(task) = SNR × |C(q)|, clamped to [0, ∞)
|
||||
weights = {}
|
||||
for t in common_tasks:
|
||||
w = max(0.0, snr_by_task[t]) * abs(cq[t]["C_q"])
|
||||
weights[t] = w
|
||||
# Also: SNR-only weighting (simpler, no C(q))
|
||||
snr_weights = {t: max(0.0, snr_by_task[t]) for t in common_tasks}
|
||||
# Also: Winsorize — clamp top-1 task's weight to 95th percentile to
|
||||
# prevent single task from dominating
|
||||
import numpy as _np
|
||||
_w95 = float(_np.percentile(list(weights.values()), 95))
|
||||
weights_wins = {t: min(w, _w95) for t, w in weights.items()}
|
||||
wsum = sum(weights.values())
|
||||
if wsum == 0:
|
||||
print("All weights zero — bail.")
|
||||
return
|
||||
weights = {task: max(0.0, snr_by_task[task]) * abs(cq[task].get("C_q", 0.0)) for task in common_tasks}
|
||||
snr_weights = {task: max(0.0, snr_by_task[task]) for task in common_tasks}
|
||||
|
||||
# Compute per-model scores under 3 variants
|
||||
results = []
|
||||
w95 = float(np.percentile(list(weights.values()), 95)) if weights else 0.0
|
||||
winsorized = {task: min(weight, w95) for task, weight in weights.items()}
|
||||
|
||||
w_sum = sum(weights.values())
|
||||
snr_sum = sum(snr_weights.values())
|
||||
wins_sum = sum(weights_wins.values())
|
||||
for label, (sub, pretty) in MODELS.items():
|
||||
task_means = per_mt_mean.get(label, {})
|
||||
if not task_means:
|
||||
wins_sum = sum(winsorized.values())
|
||||
|
||||
results = []
|
||||
for model_name, task_means in per_model_task_mean.items():
|
||||
covered = [task for task in common_tasks if task in task_means]
|
||||
if not covered:
|
||||
continue
|
||||
num_cq = sum(weights[t] * task_means.get(t, 0) for t in common_tasks)
|
||||
num_snr = sum(snr_weights[t] * task_means.get(t, 0) for t in common_tasks)
|
||||
num_wins = sum(weights_wins[t] * task_means.get(t, 0) for t in common_tasks)
|
||||
wscore = num_cq / wsum
|
||||
snr_only = num_snr / snr_sum if snr_sum > 0 else 0
|
||||
wins_score = num_wins / wins_sum if wins_sum > 0 else 0
|
||||
flat = mean(task_means[t] for t in common_tasks if t in task_means)
|
||||
results.append((label, pretty, flat, wscore, snr_only, wins_score))
|
||||
|
||||
print()
|
||||
print(f"{'Model':<16} {'Flat':>7} {'SNR×|C|':>8} {'Winsorized':>11} {'SNR-only':>9}")
|
||||
print("-" * 66)
|
||||
# Rank by winsorized variant (primary)
|
||||
for label, pretty, flat, w, snr_only, wins in sorted(results, key=lambda x: -x[5]):
|
||||
print(f"{pretty:<16} {flat:>7.4f} {w:>8.4f} {wins:>11.4f} {snr_only:>9.4f}")
|
||||
flat = mean(task_means[task] for task in covered)
|
||||
weighted = (
|
||||
sum(weights[task] * task_means.get(task, 0.0) for task in common_tasks) / w_sum
|
||||
if w_sum > 1e-12
|
||||
else 0.0
|
||||
)
|
||||
snr_only = (
|
||||
sum(snr_weights[task] * task_means.get(task, 0.0) for task in common_tasks) / snr_sum
|
||||
if snr_sum > 1e-12
|
||||
else 0.0
|
||||
)
|
||||
wins_score = (
|
||||
sum(winsorized[task] * task_means.get(task, 0.0) for task in common_tasks) / wins_sum
|
||||
if wins_sum > 1e-12
|
||||
else 0.0
|
||||
)
|
||||
|
||||
# Rank comparisons
|
||||
print("\n=== Ranking shifts vs flat-mean (winsorized) ===")
|
||||
flat_rank_order = sorted(results, key=lambda x: -x[2])
|
||||
flat_rank = {r[0]: i + 1 for i, r in enumerate(flat_rank_order)}
|
||||
wins_rank_order = sorted(results, key=lambda x: -x[5])
|
||||
print(f"{'Rank':<5}{'Model':<16} {'Flat':>8} {'Winsorized':>11} {'Δrank':>6}")
|
||||
for i, (label, pretty, flat, _w, _snr, wins) in enumerate(wins_rank_order, 1):
|
||||
fr = flat_rank[label]
|
||||
move = ""
|
||||
if fr > i: move = f"↑{fr-i}"
|
||||
elif fr < i: move = f"↓{i-fr}"
|
||||
print(f"{i:<5}{pretty:<16} {flat:>8.4f} {wins:>11.4f} {move:>6}")
|
||||
results.append(
|
||||
{
|
||||
"model": model_name,
|
||||
"flat": float(flat),
|
||||
"snr_x_abs_cq": float(weighted),
|
||||
"snr_only": float(snr_only),
|
||||
"snr_x_abs_cq_winsorized": float(wins_score),
|
||||
"coverage": len(covered),
|
||||
}
|
||||
)
|
||||
|
||||
results.sort(key=lambda row: row["snr_x_abs_cq_winsorized"], reverse=True)
|
||||
|
||||
# Save
|
||||
out = {
|
||||
"flat_score": {r[0]: r[2] for r in results},
|
||||
"snr_x_cq_weighted": {r[0]: r[3] for r in results},
|
||||
"snr_x_cq_winsorized": {r[0]: r[5] for r in results},
|
||||
"snr_only_weighted": {r[0]: r[4] for r in results},
|
||||
"weights_per_task": weights,
|
||||
"common_tasks": common_tasks,
|
||||
"weights_per_task": weights,
|
||||
"results": results,
|
||||
}
|
||||
(REPORTS / "snr_weighted_ranking.json").write_text(json.dumps(out, indent=2))
|
||||
print(f"\nWrote reports/snr_weighted_ranking.json")
|
||||
|
||||
# Show top-5 contributing tasks (highest weight) for context
|
||||
print()
|
||||
print("Top-10 tasks by weight (SNR × |C(q)|):")
|
||||
for t, w in sorted(weights.items(), key=lambda kv: -kv[1])[:10]:
|
||||
print(f" {t:<38} SNR={snr_by_task[t]:>5.1f} |C(q)|={abs(cq[t]['C_q']):>5.2f} w={w:>6.2f}")
|
||||
out_path = args.reports_dir / "snr_weighted_ranking.json"
|
||||
out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
|
||||
print(f"Wrote: {out_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@ -1,164 +1,118 @@
|
||||
"""Per-turn survival analysis: when do agent runs fail?
|
||||
#!/usr/bin/env python3
|
||||
"""Per-turn survival analysis on posterior cached runs.
|
||||
|
||||
Following paper §Latent-state survival:
|
||||
T_F = inf { t ≥ 0 : failure at time t }
|
||||
S(t) = P(T_F > t) — survival function
|
||||
h(t) = P(T_F = t | T_F ≥ t) — hazard rate
|
||||
For each run, define a failure time T_F as the first assistant turn where the
|
||||
agent emits neither text nor tool calls, or the final assistant turn of an
|
||||
unsuccessful run with delivery outcome in {fail, partial}.
|
||||
|
||||
For each run, we define FAILURE as the first turn where:
|
||||
(a) the assistant emits no text AND no tool calls, OR
|
||||
(b) the run's delivery_outcome is 'fail'/'partial' AND the transcript
|
||||
ended at this turn (no more assistant turns follow).
|
||||
We then estimate:
|
||||
|
||||
T_F = assistant-turn index of first failure (starting at 1).
|
||||
If the run succeeded (run_score ≥ 0.7), T_F is right-censored at the
|
||||
final turn count N (i.e. survived the whole trajectory).
|
||||
S(t) = P(T_F > t)
|
||||
h(t) = P(T_F = t | T_F >= t)
|
||||
|
||||
Output per model:
|
||||
- Median turn-to-failure
|
||||
- Empirical survival curve S(t) for t = 1..20
|
||||
- Hazard profile h(t)
|
||||
- Stratified by task-constraint bucket (using C(q) from earlier)
|
||||
|
||||
Usage:
|
||||
.venv/bin/python3 scripts/survival_analysis.py
|
||||
This exposes long-horizon fragility that is easy to hide in flat mean scores.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import glob
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
from collections import defaultdict
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from statistics import median
|
||||
|
||||
import numpy as np
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
|
||||
|
||||
MODELS = {
|
||||
"opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
|
||||
"opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
|
||||
"sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
|
||||
"gpt54": ("openai_gpt-5.4", "GPT 5.4"),
|
||||
"gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
|
||||
"glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
|
||||
"minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
|
||||
"kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
|
||||
"qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
|
||||
}
|
||||
from clawbench.dynamics_archive import load_task_runs_by_model
|
||||
|
||||
SUCCESS_THRESHOLD = 0.7
|
||||
|
||||
|
||||
def assistant_turns(d: dict) -> list[dict]:
|
||||
return [m for m in d.get("transcript", {}).get("messages", [])
|
||||
if m.get("role") == "assistant"]
|
||||
def assistant_turns(run) -> list:
|
||||
return run.transcript.assistant_messages
|
||||
|
||||
|
||||
def find_failure_turn(d: dict) -> tuple[int, bool]:
|
||||
"""Return (T_F, is_event). T_F is 1-indexed turn of failure.
|
||||
|
||||
is_event=True means failure actually happened; False means the run was
|
||||
censored (survived to end without failing).
|
||||
"""
|
||||
turns = assistant_turns(d)
|
||||
def find_failure_turn(run) -> tuple[int, bool]:
|
||||
"""Return (failure_turn, is_event) with 1-indexed assistant turns."""
|
||||
turns = assistant_turns(run)
|
||||
n = len(turns)
|
||||
run_score = d.get("run_score", 0) or 0
|
||||
delivery = d.get("delivery_outcome", "")
|
||||
|
||||
# Scan for first empty-turn
|
||||
for i, t in enumerate(turns, 1):
|
||||
has_text = bool((t.get("text") or "").strip())
|
||||
has_tool_call = bool(t.get("tool_calls"))
|
||||
for idx, turn in enumerate(turns, 1):
|
||||
has_text = bool((turn.text or "").strip())
|
||||
has_tool_call = bool(turn.tool_calls)
|
||||
if not has_text and not has_tool_call:
|
||||
return i, True # failure event
|
||||
return idx, True
|
||||
|
||||
# If run was unsuccessful and ended early, mark last turn as failure
|
||||
if run_score < SUCCESS_THRESHOLD and delivery in ("fail", "partial"):
|
||||
if run.run_score < SUCCESS_THRESHOLD and run.delivery_outcome.value in {"fail", "partial"}:
|
||||
return max(n, 1), True
|
||||
|
||||
# Survived: right-censored at n
|
||||
return max(n, 1), False
|
||||
|
||||
|
||||
def empirical_survival(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]:
|
||||
"""Kaplan-Meier-like survival curve, non-parametric.
|
||||
|
||||
S(t) = fraction of runs that survived past turn t.
|
||||
"""
|
||||
survival = []
|
||||
"""Empirical survival curve S(t) over assistant-turn index."""
|
||||
total = len(times_events)
|
||||
if total == 0:
|
||||
return [0.0] * max_t
|
||||
|
||||
survival = []
|
||||
for t in range(1, max_t + 1):
|
||||
# Survived past t = either censored at ≥t or event at >t
|
||||
survived = sum(1 for tf, is_event in times_events
|
||||
if (not is_event and tf >= t) or (is_event and tf > t))
|
||||
survival.append(survived / total if total > 0 else 0.0)
|
||||
survived = sum(
|
||||
1
|
||||
for tf, is_event in times_events
|
||||
if (not is_event and tf >= t) or (is_event and tf > t)
|
||||
)
|
||||
survival.append(survived / total)
|
||||
return survival
|
||||
|
||||
|
||||
def hazard(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]:
|
||||
"""Hazard rate h(t) = events at t / at-risk at t."""
|
||||
h = []
|
||||
"""Discrete hazard h(t) = events_at_t / at_risk_at_t."""
|
||||
hazard_vals = []
|
||||
for t in range(1, max_t + 1):
|
||||
at_risk = sum(1 for tf, _ in times_events if tf >= t)
|
||||
events_at_t = sum(1 for tf, is_event in times_events
|
||||
if is_event and tf == t)
|
||||
h.append(events_at_t / at_risk if at_risk > 0 else 0.0)
|
||||
return h
|
||||
events_at_t = sum(1 for tf, is_event in times_events if is_event and tf == t)
|
||||
hazard_vals.append(events_at_t / at_risk if at_risk > 0 else 0.0)
|
||||
return hazard_vals
|
||||
|
||||
|
||||
def main() -> None:
|
||||
per_model: dict[str, list[tuple[int, bool]]] = defaultdict(list)
|
||||
for label, (sub, _) in MODELS.items():
|
||||
for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
|
||||
try:
|
||||
d = json.loads(Path(p).read_text())
|
||||
except Exception:
|
||||
continue
|
||||
tf, is_event = find_failure_turn(d)
|
||||
per_model[label].append((tf, is_event))
|
||||
parser = argparse.ArgumentParser(description="Survival analysis on cached runs")
|
||||
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
|
||||
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
|
||||
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
|
||||
parser.add_argument("--max-turn", type=int, default=20)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Load C(q) to stratify
|
||||
cq_path = ROOT / "reports" / "constraint_index.json"
|
||||
cq_by_task = {}
|
||||
if cq_path.exists():
|
||||
cq = json.loads(cq_path.read_text())
|
||||
cq_by_task = {t: v["C_q"] for t, v in cq.items()}
|
||||
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
|
||||
if not grouped:
|
||||
raise SystemExit(f"No cached runs found under {args.archive_dir}")
|
||||
|
||||
# Print summary
|
||||
print(f"{'Model':<14} {'n_runs':>6} {'events':>6} {'med_tf':>8} "
|
||||
f"{'S(3)':>6} {'S(5)':>6} {'S(8)':>6} {'S(12)':>6} {'S(20)':>6}")
|
||||
print("-" * 90)
|
||||
out = {}
|
||||
for label, (_sub, pretty) in MODELS.items():
|
||||
evs = per_model[label]
|
||||
n = len(evs)
|
||||
n_events = sum(1 for _, e in evs if e)
|
||||
tfs_events = [tf for tf, e in evs if e]
|
||||
med = median(tfs_events) if tfs_events else float("inf")
|
||||
surv = empirical_survival(evs, max_t=20)
|
||||
haz = hazard(evs, max_t=20)
|
||||
print(f"{pretty:<14} {n:>6} {n_events:>6} {med:>8.1f} "
|
||||
f"{surv[2]:>6.2f} {surv[4]:>6.2f} {surv[7]:>6.2f} "
|
||||
f"{surv[11]:>6.2f} {surv[19]:>6.2f}")
|
||||
out[label] = {
|
||||
"pretty": pretty,
|
||||
"n_runs": n,
|
||||
for model_name, task_runs in grouped.items():
|
||||
events = []
|
||||
for runs in task_runs.values():
|
||||
for run in runs:
|
||||
events.append(find_failure_turn(run))
|
||||
|
||||
n_runs = len(events)
|
||||
n_events = sum(1 for _, is_event in events if is_event)
|
||||
event_times = [t for t, is_event in events if is_event]
|
||||
med = median(event_times) if event_times else float("inf")
|
||||
|
||||
out[model_name] = {
|
||||
"pretty": model_name,
|
||||
"n_runs": n_runs,
|
||||
"n_events": n_events,
|
||||
"median_fail_turn": med,
|
||||
"survival": surv,
|
||||
"hazard": haz,
|
||||
"survival": empirical_survival(events, max_t=args.max_turn),
|
||||
"hazard": hazard(events, max_t=args.max_turn),
|
||||
}
|
||||
|
||||
print("\n(Interpretation: S(t) = fraction of runs still on-track past turn t.")
|
||||
print(" Lower values = more frequent early failure.)")
|
||||
|
||||
out_path = ROOT / "reports" / "survival_analysis.json"
|
||||
out_path.write_text(json.dumps(out, indent=2))
|
||||
print(f"\nWrote: {out_path}")
|
||||
args.reports_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_path = args.reports_dir / "survival_analysis.json"
|
||||
out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
|
||||
print(f"Wrote: {out_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@ -1,132 +1,118 @@
|
||||
"""Decompose run_score variance into seed-noise vs capability-signal.
|
||||
#!/usr/bin/env python3
|
||||
"""Decompose posterior run_score variance into seed noise and capability signal.
|
||||
|
||||
Each task has 3 runs per model (same prompt, different random seed).
|
||||
σ²_seed(task, model) = variance across the 3 runs of (task, model)
|
||||
σ²_capability(task) = variance across model means for the task
|
||||
Each task has repeated runs per model.
|
||||
|
||||
sigma^2_seed(task, model) = variance across repeated runs for one model
|
||||
sigma^2_capability(task) = variance across model means for that task
|
||||
|
||||
Signal-to-noise ratio per task:
|
||||
SNR(task) = σ²_capability / σ²_seed
|
||||
|
||||
High SNR → differences between models on this task are REAL (not noise).
|
||||
Low SNR → the 3-run variance per model is so large that cross-model gaps
|
||||
are indistinguishable from seed noise. These tasks don't
|
||||
discriminate models reliably.
|
||||
SNR(task) = sigma^2_capability / mean_model sigma^2_seed
|
||||
|
||||
Aggregated over all 40 tasks, we also decompose TOTAL variance:
|
||||
total_var = mean_capability_var + mean_seed_var
|
||||
capability_fraction = mean_capability_var / total_var
|
||||
High SNR means cross-model differences are likely real. Low SNR means the
|
||||
benchmark signal is dominated by run-to-run variance rather than capability.
|
||||
|
||||
This answers "what fraction of the benchmark signal is real model
|
||||
capability vs. run-to-run luck?"
|
||||
Aggregate decomposition:
|
||||
|
||||
Usage:
|
||||
.venv/bin/python3 scripts/variance_decomp.py
|
||||
total_var = mean_task seed_var + mean_task cap_var
|
||||
capability_fraction = mean_task cap_var / total_var
|
||||
|
||||
This script keeps the posterior/archive-based workflow used by the current
|
||||
pipeline, but the statistical meaning is the same as the earlier analysis.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import glob
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from statistics import mean, variance
|
||||
|
||||
import numpy as np
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
ROOT = Path(__file__).resolve().parent.parent
|
||||
ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
|
||||
|
||||
MODELS = {
|
||||
"opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
|
||||
"opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
|
||||
"sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
|
||||
"gpt54": ("openai_gpt-5.4", "GPT 5.4"),
|
||||
"gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
|
||||
"glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
|
||||
"minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
|
||||
"kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
|
||||
"qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
|
||||
}
|
||||
from clawbench.dynamics_archive import load_task_runs_by_model
|
||||
|
||||
|
||||
def main() -> None:
|
||||
# {task: {model: [run_scores]}}
|
||||
scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
|
||||
for label, (sub, _) in MODELS.items():
|
||||
for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
|
||||
task = p.split("/")[-2]
|
||||
try:
|
||||
d = json.loads(Path(p).read_text())
|
||||
except Exception:
|
||||
continue
|
||||
scores[task].setdefault(label, []).append(d.get("run_score", 0))
|
||||
parser = argparse.ArgumentParser(description="Variance decomposition on cached runs")
|
||||
parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
|
||||
parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
|
||||
parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
|
||||
args = parser.parse_args()
|
||||
|
||||
grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
|
||||
if not grouped:
|
||||
raise SystemExit(f"No cached runs found under {args.archive_dir}")
|
||||
|
||||
# Collect repeated run scores as {task -> {model -> [run_scores]}}.
|
||||
scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
|
||||
for model_name, task_runs in grouped.items():
|
||||
for task_id, runs in task_runs.items():
|
||||
vals = [float(run.run_score) for run in runs]
|
||||
if vals:
|
||||
scores[task_id][model_name] = vals
|
||||
|
||||
# Per-task: seed var per model, cross-model var of means, SNR
|
||||
task_stats = []
|
||||
for task, per_model in scores.items():
|
||||
# Only use models with all 3 runs for clean seed-variance estimate
|
||||
for task_id, per_model in scores.items():
|
||||
model_vars = []
|
||||
model_means = []
|
||||
for m, runs in per_model.items():
|
||||
for runs in per_model.values():
|
||||
if len(runs) >= 2:
|
||||
model_vars.append(variance(runs))
|
||||
if runs:
|
||||
model_means.append(mean(runs))
|
||||
if len(model_means) < 2 or not model_vars:
|
||||
continue
|
||||
mean_seed_var = mean(model_vars) # noise
|
||||
cap_var = variance(model_means) # signal
|
||||
|
||||
# Mean within-model variance is the seed-noise term.
|
||||
mean_seed_var = mean(model_vars) if model_vars else 0.0
|
||||
# Variance of model means is the capability-signal term.
|
||||
cap_var = variance(model_means) if len(model_means) >= 2 else 0.0
|
||||
snr = cap_var / (mean_seed_var + 1e-9)
|
||||
task_stats.append({
|
||||
"task": task,
|
||||
"seed_var": mean_seed_var,
|
||||
"cap_var": cap_var,
|
||||
"snr": snr,
|
||||
"n_models": len(model_means),
|
||||
})
|
||||
task_stats.append(
|
||||
{
|
||||
"task": task_id,
|
||||
"seed_var": float(mean_seed_var),
|
||||
"cap_var": float(cap_var),
|
||||
"snr": float(snr),
|
||||
"n_models": len(model_means),
|
||||
"limited_model_diversity": len(model_means) < 2,
|
||||
}
|
||||
)
|
||||
|
||||
# Sort by SNR
|
||||
task_stats.sort(key=lambda x: -x["snr"])
|
||||
task_stats.sort(key=lambda row: row["snr"], reverse=True)
|
||||
if not task_stats:
|
||||
raise SystemExit("No task-level scores found in archive.")
|
||||
|
||||
print(f"{'Task':<38} {'seed_var':>9} {'cap_var':>9} {'SNR':>8}")
|
||||
print("-" * 70)
|
||||
for r in task_stats:
|
||||
print(f"{r['task']:<38} {r['seed_var']:>9.4f} {r['cap_var']:>9.4f} "
|
||||
f"{r['snr']:>8.2f}")
|
||||
|
||||
# Aggregate decomposition
|
||||
total_seed = mean(r["seed_var"] for r in task_stats)
|
||||
total_cap = mean(r["cap_var"] for r in task_stats)
|
||||
# Aggregate over tasks to estimate how much of benchmark variance is real
|
||||
# capability signal versus run-to-run noise.
|
||||
total_seed = mean(row["seed_var"] for row in task_stats)
|
||||
total_cap = mean(row["cap_var"] for row in task_stats)
|
||||
total = total_seed + total_cap
|
||||
cap_frac = total_cap / (total + 1e-9)
|
||||
capability_fraction = total_cap / total if total > 1e-12 else 0.0
|
||||
|
||||
print("\n=== AGGREGATE VARIANCE DECOMPOSITION ===")
|
||||
print(f" Mean seed variance (noise): {total_seed:.5f}")
|
||||
print(f" Mean capability variance (signal): {total_cap:.5f}")
|
||||
print(f" Capability fraction: {cap_frac:.1%}")
|
||||
print(f" (= what % of run_score variance comes from real model differences)")
|
||||
# Coarse SNR buckets help downstream reporting and task weighting.
|
||||
high_snr = [row for row in task_stats if row["snr"] >= 5]
|
||||
mid_snr = [row for row in task_stats if 1 <= row["snr"] < 5]
|
||||
low_snr = [row for row in task_stats if row["snr"] < 1]
|
||||
|
||||
# Classify tasks by SNR tiers
|
||||
high_snr = [r for r in task_stats if r["snr"] >= 5]
|
||||
mid_snr = [r for r in task_stats if 1 <= r["snr"] < 5]
|
||||
low_snr = [r for r in task_stats if r["snr"] < 1]
|
||||
print(f"\n=== SNR TIERS ===")
|
||||
print(f" High SNR (≥5): {len(high_snr)} tasks — differentiate models reliably")
|
||||
print(f" Mid SNR (1–5): {len(mid_snr)} tasks — moderate signal")
|
||||
print(f" Low SNR (<1): {len(low_snr)} tasks — seed noise ≥ capability signal")
|
||||
print(f" (these tasks give random-ish results; weight down)")
|
||||
|
||||
# Write output
|
||||
out_path = ROOT / "reports" / "variance_decomposition.json"
|
||||
out_path.write_text(json.dumps({
|
||||
out = {
|
||||
"per_task": task_stats,
|
||||
"aggregate": {
|
||||
"mean_seed_var": total_seed,
|
||||
"mean_cap_var": total_cap,
|
||||
"capability_fraction": cap_frac,
|
||||
"mean_seed_var": float(total_seed),
|
||||
"mean_cap_var": float(total_cap),
|
||||
"capability_fraction": float(capability_fraction),
|
||||
"high_snr_tasks": len(high_snr),
|
||||
"mid_snr_tasks": len(mid_snr),
|
||||
"low_snr_tasks": len(low_snr),
|
||||
},
|
||||
}, indent=2))
|
||||
print(f"\nWrote: {out_path}")
|
||||
}
|
||||
|
||||
args.reports_dir.mkdir(parents=True, exist_ok=True)
|
||||
out_path = args.reports_dir / "variance_decomposition.json"
|
||||
out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
|
||||
print(f"Wrote: {out_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
356
tests/test_dynamics.py
Normal file
356
tests/test_dynamics.py
Normal file
@ -0,0 +1,356 @@
|
||||
"""Tests for clawbench.dynamics."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import math
|
||||
|
||||
import numpy as np
|
||||
import pytest
|
||||
|
||||
from clawbench.dynamics import (
|
||||
TOOL_FAMILIES,
|
||||
Dynamics,
|
||||
Regime,
|
||||
Sensitivity,
|
||||
SurvivalPoint,
|
||||
StratumStats,
|
||||
StratifiedAssessment,
|
||||
_classify_tool,
|
||||
_cosine_dist,
|
||||
_entropy,
|
||||
_js_divergence,
|
||||
_levenshtein,
|
||||
build_strata,
|
||||
compute_dynamics,
|
||||
compute_sensitivity,
|
||||
find_event_step,
|
||||
kaplan_meier,
|
||||
stratify_by_regime,
|
||||
stratify_by_tier,
|
||||
)
|
||||
from clawbench.schemas import (
|
||||
TokenUsage,
|
||||
ToolCall,
|
||||
Transcript,
|
||||
TranscriptMessage,
|
||||
TaskRunResult,
|
||||
)
|
||||
|
||||
|
||||
# ── helpers ──────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _msg(role, text="", family=None, success=True, error="", ts=0, tok=100):
|
||||
tcs = []
|
||||
if family:
|
||||
tcs.append(ToolCall(
|
||||
name=f"tool_{family}", family=family,
|
||||
success=success, error=error, mutating=family == "edit",
|
||||
))
|
||||
return TranscriptMessage(
|
||||
role=role, text=text, tool_calls=tcs, timestamp_ms=ts,
|
||||
usage=TokenUsage(input_tokens=tok, output_tokens=tok // 2,
|
||||
total_tokens=tok + tok // 2),
|
||||
)
|
||||
|
||||
|
||||
def _simple_transcript(families, errors=None):
|
||||
if errors is None:
|
||||
errors = [False] * len(families)
|
||||
msgs = [_msg("user", "task")]
|
||||
for i, (fam, err) in enumerate(zip(families, errors)):
|
||||
msgs.append(_msg("assistant", f"step {i}", family=fam,
|
||||
success=not err, error="err" if err else "",
|
||||
ts=(i + 1) * 1000, tok=100 + i * 10))
|
||||
return Transcript(messages=msgs)
|
||||
|
||||
|
||||
def _run(transcript, score=0.5, task_id="t1"):
|
||||
return TaskRunResult(
|
||||
task_id=task_id, run_index=0, transcript=transcript,
|
||||
run_score=score, duration_ms=10000,
|
||||
token_usage=transcript.total_usage,
|
||||
)
|
||||
|
||||
|
||||
# ── _cosine_dist ─────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_cosine_dist_identical():
|
||||
a = np.array([1.0, 0.0, 0.5])
|
||||
assert _cosine_dist(a, a) == pytest.approx(0.0, abs=1e-9)
|
||||
|
||||
|
||||
def test_cosine_dist_orthogonal():
|
||||
assert _cosine_dist(np.array([1, 0, 0.0]), np.array([0, 1, 0.0])) == pytest.approx(1.0)
|
||||
|
||||
|
||||
def test_cosine_dist_zero_vector():
|
||||
assert _cosine_dist(np.zeros(3), np.array([1, 2, 3.0])) == 1.0
|
||||
|
||||
|
||||
# ── _entropy ─────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_entropy_uniform():
|
||||
assert _entropy({"a": 10, "b": 10}) == pytest.approx(1.0)
|
||||
|
||||
|
||||
def test_entropy_single():
|
||||
assert _entropy({"a": 100}) == pytest.approx(0.0)
|
||||
|
||||
|
||||
def test_entropy_empty():
|
||||
assert _entropy({}) == 0.0
|
||||
|
||||
|
||||
# ── _js_divergence ───────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_jsd_identical():
|
||||
d = {"a": 5, "b": 5}
|
||||
assert _js_divergence(d, d) == pytest.approx(0.0, abs=1e-9)
|
||||
|
||||
|
||||
def test_jsd_disjoint():
|
||||
assert _js_divergence({"a": 10}, {"b": 10}) > 0.5
|
||||
|
||||
|
||||
# ── _levenshtein ────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_levenshtein_equal():
|
||||
assert _levenshtein([1, 2, 3], [1, 2, 3]) == 0
|
||||
|
||||
|
||||
def test_levenshtein_empty():
|
||||
assert _levenshtein([], [1, 2]) == 2
|
||||
|
||||
|
||||
def test_levenshtein_different():
|
||||
assert _levenshtein(["a", "b"], ["c", "d"]) == 2
|
||||
|
||||
|
||||
# ── _classify_tool ──────────────────────────────────────────────────
|
||||
|
||||
|
||||
@pytest.mark.parametrize("name,expected", [
|
||||
("bash_execute", "execute"),
|
||||
("file_read", "read"),
|
||||
("tool_edit", "edit"),
|
||||
("web_browser", "browser"),
|
||||
("grep_search", "search"),
|
||||
("write_file", "edit"),
|
||||
("run_tests", "execute"),
|
||||
])
|
||||
def test_classify_tool(name, expected):
|
||||
assert _classify_tool(name) == expected
|
||||
|
||||
|
||||
# ── compute_dynamics ─────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_dynamics_basic():
|
||||
t = _simple_transcript(["read", "edit", "execute", "read", "edit"])
|
||||
d = compute_dynamics(t)
|
||||
assert d.n_steps == 5
|
||||
assert len(d.drift) == 5
|
||||
assert len(d.step_size) == 5
|
||||
assert len(d.entropy_series) == 5
|
||||
assert len(d.tool_sequence) == 5
|
||||
assert d.tool_entropy > 0
|
||||
|
||||
|
||||
def test_dynamics_empty():
|
||||
t = Transcript(messages=[_msg("user", "hi")])
|
||||
d = compute_dynamics(t)
|
||||
assert d.n_steps == 0
|
||||
assert d.regime == Regime.unknown
|
||||
|
||||
|
||||
def test_dynamics_trapped():
|
||||
t = _simple_transcript(["execute"] * 15, errors=[True] * 15)
|
||||
d = compute_dynamics(t)
|
||||
assert d.regime == Regime.trapped
|
||||
assert d.error_rate > 0.5
|
||||
|
||||
|
||||
def test_dynamics_convergent():
|
||||
cycle = ["read", "search", "edit", "read", "execute"] * 6
|
||||
t = _simple_transcript(cycle[:30])
|
||||
d = compute_dynamics(t)
|
||||
assert d.regime in (Regime.convergent, Regime.limit_cycle, Regime.diffusive, Regime.unknown)
|
||||
assert d.error_rate == 0.0
|
||||
|
||||
|
||||
def test_dynamics_markov_keys():
|
||||
t = _simple_transcript(["read", "edit", "read"])
|
||||
d = compute_dynamics(t)
|
||||
assert "read" in d.markov
|
||||
assert "edit" in d.markov["read"]
|
||||
|
||||
|
||||
def test_dynamics_constraint_index_range():
|
||||
t = _simple_transcript(["read", "edit", "search", "execute", "browser", "memory"] * 3)
|
||||
d = compute_dynamics(t)
|
||||
assert 0 <= d.constraint_index <= 1
|
||||
|
||||
|
||||
def test_dynamics_memory_depth():
|
||||
t = _simple_transcript(["read", "edit", "read", "edit", "read", "edit"] * 3)
|
||||
d = compute_dynamics(t)
|
||||
assert d.memory_depth >= 0
|
||||
|
||||
|
||||
def test_dynamics_normalizes_unknown_tool_family():
|
||||
transcript = Transcript(
|
||||
messages=[
|
||||
_msg("user", "task"),
|
||||
TranscriptMessage(
|
||||
role="assistant",
|
||||
text="searching",
|
||||
tool_calls=[
|
||||
ToolCall(
|
||||
name="grep_search",
|
||||
family="unknown",
|
||||
success=True,
|
||||
error="",
|
||||
mutating=False,
|
||||
)
|
||||
],
|
||||
timestamp_ms=1000,
|
||||
usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
|
||||
),
|
||||
_msg("assistant", "next", family="read", ts=2000),
|
||||
_msg("assistant", "done", family="edit", ts=3000),
|
||||
]
|
||||
)
|
||||
|
||||
dynamics = compute_dynamics(transcript)
|
||||
|
||||
assert dynamics.tool_sequence[0] == "search"
|
||||
assert "search" in dynamics.markov
|
||||
|
||||
|
||||
# ── compute_sensitivity ──────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_sensitivity_identical_runs():
|
||||
t = _simple_transcript(["read", "edit", "execute"])
|
||||
ra = _run(t, score=0.8)
|
||||
rb = _run(t, score=0.8)
|
||||
s = compute_sensitivity(ra, rb)
|
||||
assert s.score_delta == pytest.approx(0.0)
|
||||
assert s.tool_edit_distance == 0
|
||||
|
||||
|
||||
def test_sensitivity_different_runs():
|
||||
ta = _simple_transcript(["read", "edit", "execute"])
|
||||
tb = _simple_transcript(["search", "browser", "memory"])
|
||||
ra = _run(ta, score=0.9)
|
||||
rb = _run(tb, score=0.3)
|
||||
s = compute_sensitivity(ra, rb)
|
||||
assert s.score_delta == pytest.approx(0.6)
|
||||
assert s.tool_edit_distance > 0
|
||||
assert s.family_js_divergence > 0
|
||||
|
||||
|
||||
# ── kaplan_meier ─────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_km_basic():
|
||||
pts = kaplan_meier([1, 2, 3])
|
||||
assert pts[0].time == 0.0
|
||||
assert pts[0].survival == 1.0
|
||||
assert pts[-1].survival == pytest.approx(0.0)
|
||||
|
||||
|
||||
def test_km_with_censoring():
|
||||
pts = kaplan_meier([1, 5, 3], censored=[False, True, False])
|
||||
assert len(pts) == 3
|
||||
assert pts[-1].survival > 0
|
||||
|
||||
|
||||
def test_km_empty():
|
||||
assert kaplan_meier([]) == []
|
||||
|
||||
|
||||
# ── find_event_step ──────────────────────────────────────────────────
|
||||
|
||||
|
||||
def test_find_first_correct_write():
|
||||
t = _simple_transcript(["read", "search", "edit", "execute"])
|
||||
assert find_event_step(t, "first_correct_write") == 2.0
|
||||
|
||||
|
||||
def test_find_first_error_recovery():
|
||||
t = _simple_transcript(
|
||||
["read", "execute", "read"],
|
||||
errors=[False, True, False],
|
||||
)
|
||||
assert find_event_step(t, "first_error_recovery") == 2.0
|
||||
|
||||
|
||||
def test_find_task_completion():
|
||||
t = _simple_transcript(["read", "edit"])
|
||||
assert find_event_step(t, "task_completion") == 1.0
|
||||
|
||||
|
||||
def test_find_event_none():
|
||||
t = _simple_transcript(["read", "read"])
|
||||
assert find_event_step(t, "first_correct_write") is None
|
||||
|
||||
|
||||
# ── build_strata + reweight ──────────────────────────────────────────
|
||||
|
||||
|
||||
def test_build_strata_by_tier():
|
||||
runs, dyns, scores = [], [], []
|
||||
for tid, sc in [("t1-a", 0.8), ("t1-b", 0.6), ("t2-a", 0.4), ("t2-b", 0.3)]:
|
||||
t = _simple_transcript(["read", "edit", "execute"])
|
||||
r = _run(t, score=sc, task_id=tid)
|
||||
runs.append(r)
|
||||
dyns.append(compute_dynamics(t))
|
||||
scores.append(sc)
|
||||
|
||||
sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
|
||||
assert sa.total_runs == 4
|
||||
names = sa.stratum_names()
|
||||
assert "tier1" in names
|
||||
assert "tier2" in names
|
||||
for s in sa.strata:
|
||||
assert s.n_runs == 2
|
||||
assert s.weight == pytest.approx(0.5)
|
||||
|
||||
|
||||
def test_reweight_shifts_mean():
|
||||
runs, dyns, scores = [], [], []
|
||||
for tid, sc in [("t1-a", 0.9), ("t1-b", 0.8), ("t2-a", 0.2), ("t2-b", 0.1)]:
|
||||
t = _simple_transcript(["read", "edit", "execute"])
|
||||
r = _run(t, score=sc, task_id=tid)
|
||||
runs.append(r)
|
||||
dyns.append(compute_dynamics(t))
|
||||
scores.append(sc)
|
||||
|
||||
sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
|
||||
|
||||
# Reweight towards tier1 (high scores)
|
||||
high = sa.reweight({"tier1": 0.9, "tier2": 0.1})
|
||||
# Reweight towards tier2 (low scores)
|
||||
low = sa.reweight({"tier1": 0.1, "tier2": 0.9})
|
||||
|
||||
assert high["score_mean"] > low["score_mean"]
|
||||
|
||||
|
||||
def test_reweight_unknown_stratum():
|
||||
runs, dyns, scores = [], [], []
|
||||
t = _simple_transcript(["read", "edit"])
|
||||
r = _run(t, score=0.5, task_id="t1-x")
|
||||
runs.append(r)
|
||||
dyns.append(compute_dynamics(t))
|
||||
scores.append(0.5)
|
||||
|
||||
sa = build_strata(runs, dyns, scores, stratify_by_tier, "tier")
|
||||
# Reweight with a stratum that doesn't exist — should fall back
|
||||
result = sa.reweight({"nonexistent": 1.0})
|
||||
assert "score_mean" in result
|
||||
115
tests/test_dynamics_archive.py
Normal file
115
tests/test_dynamics_archive.py
Normal file
@ -0,0 +1,115 @@
|
||||
"""Tests for offline dynamics archive helpers."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
from clawbench.dynamics_archive import build_dynamics_report, load_task_runs_archive, safe_model_name, write_dynamics_report
|
||||
from clawbench.schemas import TaskRunResult, TokenUsage, ToolCall, Transcript, TranscriptMessage
|
||||
|
||||
|
||||
def _msg(role: str, text: str = "", family: str | None = None, ts: int = 0) -> TranscriptMessage:
|
||||
tool_calls = []
|
||||
if family is not None:
|
||||
tool_calls.append(
|
||||
ToolCall(
|
||||
name=f"tool_{family}",
|
||||
family=family,
|
||||
success=True,
|
||||
error="",
|
||||
mutating=family == "edit",
|
||||
)
|
||||
)
|
||||
return TranscriptMessage(
|
||||
role=role,
|
||||
text=text,
|
||||
tool_calls=tool_calls,
|
||||
timestamp_ms=ts,
|
||||
usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
|
||||
)
|
||||
|
||||
|
||||
def _run(task_id: str, score: float = 0.5, run_index: int = 0) -> TaskRunResult:
|
||||
transcript = Transcript(
|
||||
messages=[
|
||||
_msg("user", f"Solve {task_id}"),
|
||||
_msg("assistant", "inspect", family="read", ts=1000),
|
||||
_msg("assistant", "edit", family="edit", ts=2000),
|
||||
_msg("assistant", "verify", family="execute", ts=3000),
|
||||
]
|
||||
)
|
||||
return TaskRunResult(
|
||||
task_id=task_id,
|
||||
run_index=run_index,
|
||||
transcript=transcript,
|
||||
run_score=score,
|
||||
duration_ms=3000,
|
||||
token_usage=transcript.total_usage,
|
||||
)
|
||||
|
||||
|
||||
def test_load_task_runs_archive_filters_model_and_tier(tmp_path: Path):
|
||||
model_dir = tmp_path / safe_model_name("ollama/gpt-oss:20b")
|
||||
other_dir = tmp_path / safe_model_name("openai/gpt-5.4")
|
||||
for root, task_id in ((model_dir, "t1-demo-task"), (other_dir, "t2-other-task")):
|
||||
task_dir = root / task_id
|
||||
task_dir.mkdir(parents=True)
|
||||
run = _run(task_id)
|
||||
(task_dir / "run0.json").write_text(run.model_dump_json(indent=2), encoding="utf-8")
|
||||
|
||||
loaded = load_task_runs_archive(
|
||||
archive_dir=tmp_path,
|
||||
model="ollama/gpt-oss:20b",
|
||||
tier="tier1",
|
||||
)
|
||||
|
||||
assert list(loaded) == ["t1-demo-task"]
|
||||
assert loaded["t1-demo-task"][0].task_id == "t1-demo-task"
|
||||
|
||||
|
||||
def test_write_dynamics_report_creates_report_without_plots(tmp_path: Path):
|
||||
task_runs = {
|
||||
"t1-demo-task": [_run("t1-demo-task", score=0.8)],
|
||||
"t2-demo-task": [_run("t2-demo-task", score=0.4)],
|
||||
}
|
||||
|
||||
report_path, plots = write_dynamics_report(task_runs, tmp_path, generate_plots=False)
|
||||
|
||||
assert report_path.exists()
|
||||
assert report_path.name == "dynamics.json"
|
||||
assert plots == []
|
||||
|
||||
report = json.loads(report_path.read_text(encoding="utf-8"))
|
||||
assert "sensitivity" in report
|
||||
assert report["sensitivity"]["same_task"]["n_pairs"] == 0
|
||||
|
||||
|
||||
def test_build_dynamics_report_includes_pairwise_sensitivity():
|
||||
task_runs = {
|
||||
"t1-demo-task": [
|
||||
_run("t1-demo-task", score=0.8, run_index=0),
|
||||
TaskRunResult(
|
||||
task_id="t1-demo-task",
|
||||
run_index=1,
|
||||
transcript=Transcript(
|
||||
messages=[
|
||||
_msg("user", "Solve t1-demo-task"),
|
||||
_msg("assistant", "inspect", family="search", ts=1000),
|
||||
_msg("assistant", "edit", family="edit", ts=2000),
|
||||
_msg("assistant", "verify", family="execute", ts=3000),
|
||||
]
|
||||
),
|
||||
run_score=0.5,
|
||||
duration_ms=3000,
|
||||
token_usage=TokenUsage(input_tokens=30, output_tokens=15, total_tokens=45),
|
||||
),
|
||||
]
|
||||
}
|
||||
|
||||
report, _plotter, _plot_data = build_dynamics_report(task_runs, include_pca=False)
|
||||
|
||||
same_task = report["sensitivity"]["same_task"]
|
||||
assert same_task["n_pairs"] == 1
|
||||
assert "t1-demo-task" in same_task["per_task"]
|
||||
assert same_task["per_task"]["t1-demo-task"]["mean_score_delta"] > 0
|
||||
76
tests/test_dynamics_cli.py
Normal file
76
tests/test_dynamics_cli.py
Normal file
@ -0,0 +1,76 @@
|
||||
from pathlib import Path
|
||||
|
||||
from click.testing import CliRunner
|
||||
|
||||
from clawbench.cli import cli
|
||||
from clawbench.dynamics_archive import safe_model_name
|
||||
from clawbench.schemas import TaskRunResult, TokenUsage, ToolCall, Transcript, TranscriptMessage
|
||||
|
||||
|
||||
def _msg(role: str, text: str = "", family: str | None = None, ts: int = 0) -> TranscriptMessage:
|
||||
tool_calls = []
|
||||
if family is not None:
|
||||
tool_calls.append(
|
||||
ToolCall(
|
||||
name=f"tool_{family}",
|
||||
family=family,
|
||||
success=True,
|
||||
error="",
|
||||
mutating=family == "edit",
|
||||
)
|
||||
)
|
||||
return TranscriptMessage(
|
||||
role=role,
|
||||
text=text,
|
||||
tool_calls=tool_calls,
|
||||
timestamp_ms=ts,
|
||||
usage=TokenUsage(input_tokens=10, output_tokens=5, total_tokens=15),
|
||||
)
|
||||
|
||||
|
||||
def _run(task_id: str, run_index: int = 0) -> TaskRunResult:
|
||||
transcript = Transcript(
|
||||
messages=[
|
||||
_msg("user", f"Solve {task_id}"),
|
||||
_msg("assistant", "inspect", family="read", ts=1000),
|
||||
_msg("assistant", "edit", family="edit", ts=2000),
|
||||
_msg("assistant", "verify", family="execute", ts=3000),
|
||||
]
|
||||
)
|
||||
return TaskRunResult(
|
||||
task_id=task_id,
|
||||
run_index=run_index,
|
||||
transcript=transcript,
|
||||
run_score=0.8,
|
||||
duration_ms=3000,
|
||||
token_usage=transcript.total_usage,
|
||||
)
|
||||
|
||||
|
||||
def test_dynamics_report_cli_supports_no_plots(tmp_path: Path):
|
||||
model_dir = tmp_path / safe_model_name("ollama/gpt-oss:20b") / "t1-demo-task"
|
||||
model_dir.mkdir(parents=True)
|
||||
run = _run("t1-demo-task")
|
||||
(model_dir / "run0.json").write_text(run.model_dump_json(indent=2), encoding="utf-8")
|
||||
|
||||
runner = CliRunner()
|
||||
output_dir = tmp_path / "out"
|
||||
result = runner.invoke(
|
||||
cli,
|
||||
[
|
||||
"dynamics-report",
|
||||
"--archive-dir",
|
||||
str(tmp_path),
|
||||
"--model",
|
||||
"ollama/gpt-oss:20b",
|
||||
"--output-dir",
|
||||
str(output_dir),
|
||||
"--no-plots",
|
||||
],
|
||||
)
|
||||
|
||||
assert result.exit_code == 0, result.output
|
||||
assert "Loaded 1 cached runs across 1 tasks" in result.output
|
||||
assert "Saved 0 plots" in result.output
|
||||
assert (output_dir / "dynamics.json").exists()
|
||||
assert list(output_dir.glob("*.png")) == []
|
||||
44
tests/test_submission_models.py
Normal file
44
tests/test_submission_models.py
Normal file
@ -0,0 +1,44 @@
|
||||
from clawbench.submission_models import (
|
||||
CUSTOM_PRESET_LABEL,
|
||||
PRESET_AUDIENCE_BUDGET,
|
||||
PRESET_AUDIENCE_CLAW,
|
||||
infer_provider,
|
||||
preset_labels_for_audience,
|
||||
resolve_model_selection,
|
||||
)
|
||||
|
||||
|
||||
def test_budget_audience_keeps_budget_friendly_presets():
|
||||
labels = preset_labels_for_audience(PRESET_AUDIENCE_BUDGET)
|
||||
|
||||
assert "GPT-OSS 20B (Ollama)" in labels
|
||||
assert "Qwen 3.5 27B (Ollama)" in labels
|
||||
assert "Claude Opus 4.6" not in labels
|
||||
|
||||
|
||||
def test_claw_audience_keeps_full_catalog():
|
||||
labels = preset_labels_for_audience(PRESET_AUDIENCE_CLAW)
|
||||
|
||||
assert "GPT-OSS 20B (Ollama)" in labels
|
||||
assert "Claude Opus 4.6" in labels
|
||||
|
||||
|
||||
def test_resolve_model_selection_prefers_preset_provider():
|
||||
model_id, provider = resolve_model_selection("", "GPT-OSS 20B (Ollama)")
|
||||
|
||||
assert model_id == "ollama/gpt-oss:20b"
|
||||
assert provider == "ollama"
|
||||
|
||||
|
||||
def test_resolve_model_selection_infers_custom_provider():
|
||||
model_id, provider = resolve_model_selection(
|
||||
"huggingface/Qwen/Qwen3-32B",
|
||||
CUSTOM_PRESET_LABEL,
|
||||
)
|
||||
|
||||
assert model_id == "huggingface/Qwen/Qwen3-32B"
|
||||
assert provider == "huggingface"
|
||||
|
||||
|
||||
def test_infer_provider_requires_provider_prefix():
|
||||
assert infer_provider("qwen3.5:27b") == ""
|
||||
Loading…
Reference in New Issue
Block a user