ci: default crabbox owned capacity to standard (#22 )

Merge pull request #21 from sallyom/k8s-job
add docs, manifests for k8s
2026-05-07 02:47:04 -07:00 · 2026-05-06 15:02:15 -07:00 · 2026-05-06 14:51:54 -07:00 · 2026-05-06 08:19:58 -04:00 · 2026-05-04 12:25:14 -07:00 · 2026-05-04 12:19:20 -07:00
129 changed files with 15032 additions and 1525 deletions
--- a/.agents/skills/blacksmith-testbox/SKILL.md
+++ b/.agents/skills/blacksmith-testbox/SKILL.md
@ -0,0 +1,80 @@
+---
+name: blacksmith-testbox
+description: Run Blacksmith Testbox for ClawBench CI parity, live credentials, Docker builds, and benchmark sweeps.
+---
+
+# Blacksmith Testbox
+
+Use Testbox when ClawBench work needs CI parity, org-level secrets, hydrated
+agent dotfiles, Docker, or a benchmark run that is too heavy for the local
+machine. Keep normal unit-test iteration local unless the user asks for
+Testbox proof.
+
+Crabbox is the sibling lane for reusable owned-capacity proof. Use
+`.agents/skills/crabbox/SKILL.md` and `.crabbox.yaml` when ClawBench needs
+AWS-backed reusable boxes or Crabbox sync/log/result inspection. Keep this
+skill focused on Blacksmith CI parity.
+
+## Warmup
+
+Run from the repository root:
+
+```bash
+blacksmith testbox warmup ci-check-testbox.yml --ref main --idle-timeout 90
+```
+
+Save the returned `tbx_...` ID and reuse it for every command in the same
+task. Stop boxes you create when done:
+
+```bash
+blacksmith testbox stop --id <ID>
+```
+
+## Commands
+
+Always invoke `blacksmith testbox` from the repo root. The CLI syncs the
+current git working tree to the remote box; running from a subdirectory can
+delete the rest of the remote checkout.
+
+```bash
+blacksmith testbox run --id <ID> "python -m pytest -q"
+blacksmith testbox run --id <ID> "python -m pip wheel --no-deps . -w /tmp/clawbench-wheel"
+blacksmith testbox run --id <ID> "docker build -t clawbench ."
+```
+
+If a command needs HF/provider credentials or agent dotfiles, wrap it with the
+hydrated helper installed by the workflow:
+
+```bash
+blacksmith testbox run --id <ID> "clawbench-testbox-env python -m pytest -q"
+blacksmith testbox run --id <ID> "clawbench-testbox-env clawbench run --model anthropic/claude-sonnet-4-6 --adapter simulated"
+```
+
+## Sync Model
+
+The testbox starts from a clean checkout and installed Python environment.
+Tracked and untracked non-ignored files are synced before each `run`.
+Ignored files such as `.venv/`, `data/`, `.pytest_cache/`, and `dist/` are
+not synced. If `pyproject.toml` changes, rerun install remotely:
+
+```bash
+blacksmith testbox run --id <ID> "python -m pip install -e . && python -m pytest -q"
+```
+
+## Hydrated Secrets And Dotfiles
+
+The workflow writes non-empty provider and HF secrets to
+`~/.clawbench-testbox-live.profile`, and installs `~/.local/bin/clawbench-testbox-env`
+to source that profile. It also restores optional agent dotfiles from either
+ClawBench-specific secrets or the existing OpenClaw org-level secret names:
+
+- `~/.codex/auth.json`
+- `~/.codex/config.toml`
+- `~/.claude.json`
+- `~/.claude/.credentials.json`
+- `~/.claude/settings.json`
+- `~/.claude/settings.local.json`
+- `~/.gemini/settings.json`
+
+Prefer org-level secrets where possible; Blacksmith runner access is org-level,
+not repo-specific.
--- a/.agents/skills/crabbox/SKILL.md
+++ b/.agents/skills/crabbox/SKILL.md
@ -0,0 +1,122 @@
+---
+name: crabbox
+description: Use Crabbox for ClawBench remote Linux validation, warmed reusable boxes, GitHub Actions hydration, sync timing, logs, results, caches, and lease cleanup.
+---
+
+# Crabbox
+
+Use Crabbox when ClawBench needs remote Linux proof on owned capacity, a large
+runner class, reusable warm state, or a Blacksmith alternative.
+
+## Before Running
+
+- Run from the repo root. Crabbox sync mirrors the current checkout.
+- Prefer local targeted tests for tight edit loops.
+- Prefer Blacksmith Testbox when the task explicitly asks for Blacksmith or a
+  Blacksmith-specific CI comparison.
+- Use Crabbox for broad ClawBench gates when owned AWS capacity is the right
+  remote lane.
+- Check `.crabbox.yaml` for repo defaults before adding flags.
+- Sanity-check the selected binary before remote work. Prefer the local
+  `openclaw/crabbox` checkout when present because the user PATH shim can be
+  stale: `command -v crabbox; ../crabbox/bin/crabbox --version`.
+- Install with `brew install openclaw/tap/crabbox`; auth is required before use:
+  `crabbox login --url https://crabbox.openclaw.ai --provider aws`.
+- On macOS the user config is `~/Library/Application Support/crabbox/config.yaml`;
+  it must include `broker.url`, `broker.token`, and usually `provider: aws`.
+
+## ClawBench Flow
+
+AWS/owned-capacity flow for Python tests:
+
+```sh
+crabbox warmup --class standard --idle-timeout 90m
+crabbox actions hydrate --id <cbx_id-or-slug>
+crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "python -m pytest -q"
+```
+
+For commands that need hydrated HF/provider credentials or agent dotfiles, use
+the helper installed by the hydration workflow:
+
+```sh
+crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "clawbench-testbox-env python -m pytest -q"
+crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "clawbench-testbox-env clawbench run --model anthropic/claude-sonnet-4-6 --adapter simulated"
+```
+
+Blacksmith-backed Crabbox flow can delegate setup to the existing Testbox
+workflow:
+
+```sh
+crabbox run --provider blacksmith-testbox --blacksmith-org openclaw --blacksmith-workflow .github/workflows/ci-check-testbox.yml --blacksmith-job check --blacksmith-ref main --idle-timeout 90m --timing-json --shell -- "python -m pytest -q"
+```
+
+Stop boxes you created before handoff:
+
+```sh
+crabbox stop <cbx_id-or-slug>
+```
+
+## Owned AWS Capacity
+
+When AWS capacity is under pressure, do not start with `class=beast`.
+`beast` begins at 48xlarge instances and can burn 192 vCPU quota per request.
+ClawBench's owned-cloud default is `standard`; escalate to `fast`, then
+`large`, and only use `beast` when the work is explicitly CPU-bound and the
+smaller class already failed the goal.
+
+Keep capacity hints enabled so brokered AWS leases print selected
+region/market, quota pressure, Spot fallback, and high-pressure class warnings.
+The ClawBench repo config sets `capacity.hints: true`; use
+`CRABBOX_CAPACITY_HINTS=0` only when debugging hint rendering itself.
+
+Use `beast` only for exceptional lanes:
+
+- full benchmark sweeps where wall time is dominated by CPU, not dependency
+  install or network;
+- release/blocker validation where a maintainer explicitly asks for the largest
+  owned AWS class;
+- performance profiling where the point is to compare high-core behavior.
+
+Do not use `beast` for ordinary `python -m pytest -q`, docs-only work, small
+task repros, Blacksmith outage triage, or focused lint/type/test checks. Those
+should use `standard` first and `fast` only when the extra cores materially
+help.
+
+## Useful Commands
+
+```sh
+crabbox status --id <id-or-slug> --wait
+crabbox inspect --id <id-or-slug> --json
+crabbox sync-plan
+crabbox history --lease <id-or-slug>
+crabbox logs <run_id>
+crabbox results <run_id>
+crabbox cache stats --id <id-or-slug>
+crabbox ssh --id <id-or-slug>
+```
+
+Use `--debug` on `run` when measuring sync timing.
+Use `--timing-json` on warmup, hydrate, and run when comparing AWS and
+blacksmith-testbox timings.
+Use `--market spot|on-demand` on AWS warmup or one-shot run when testing quota
+or capacity behavior without changing `.crabbox.yaml`.
+
+## Hydration Boundary
+
+`.github/workflows/crabbox-hydrate.yml` is repo-specific on purpose. It owns
+ClawBench checkout, setup-python, pip install, provider/HF env hydration,
+agent-dotfile restoration, ready marker, and keepalive. Crabbox owns runner
+registration, workflow dispatch, SSH sync, command execution, logs/results,
+local lease claims, and idle cleanup.
+
+Do not add ClawBench-specific setup to Crabbox. Put repo setup in the hydration
+workflow and generic lease/sync behavior in Crabbox.
+
+## Cleanup
+
+Crabbox has coordinator-owned idle expiry and local lease claims, so ClawBench
+does not need a custom ledger. Default idle timeout is 30 minutes unless config
+or flags set a different value. Still stop boxes you created when done.
+If `crabbox list` prints `orphan=no-active-lease`, treat it as an operator
+review hint; do not delete `keep=true` machines without checking provider and
+coordinator state.
--- a/.crabbox.yaml
+++ b/.crabbox.yaml
@ -0,0 +1,48 @@
+profile: clawbench-check
+provider: aws
+class: standard
+capacity:
+  market: spot
+  strategy: most-available
+  fallback: on-demand-after-120s
+  hints: true
+  regions:
+    - eu-west-1
+actions:
+  workflow: .github/workflows/crabbox-hydrate.yml
+  job: hydrate
+  ref: main
+  runnerLabels:
+    - crabbox
+    - clawbench
+  runnerVersion: latest
+  ephemeral: true
+aws:
+  region: eu-west-1
+  rootGB: 400
+sync:
+  delete: true
+  checksum: false
+  gitSeed: true
+  fingerprint: true
+  baseRef: main
+  exclude:
+    - .artifacts
+    - .codex
+    - .DS_Store
+    - .pytest_cache
+    - .ruff_cache
+    - .venv
+    - dist
+    - htmlcov
+    - playwright-report
+    - test-results
+env:
+  allow:
+    - CI
+    - CLAWBENCH_*
+    - OPENCLAW_*
+    - PYTHON*
+ssh:
+  user: crabbox
+  port: "2222"
--- a/.env.example
+++ b/.env.example
@ -0,0 +1,23 @@
+# Copy to .env for local docker compose or shell-based runs.
+#
+# Do not commit real tokens. Keep placeholder values commented so a fresh
+# checkout cannot accidentally enable a fake provider or tracing config.
+
+# Hugging Face queue/results persistence.
+# HF_TOKEN=
+# CLAWBENCH_QUEUE_DATASET=openclaw/clawbench-results
+
+# OpenClaw gateway auth.
+# OPENCLAW_GATEWAY_TOKEN=local-dev-token-for-testing
+
+# Optional benchmark tuning.
+# CLAWBENCH_RUN_CACHE_DIR=.clawbench/run_cache
+# CLAWBENCH_CONCURRENCY=1
+# CLAWBENCH_JUDGE_MODEL=anthropic/claude-sonnet-4-6
+# CLAWBENCH_JUDGE_AFFECTS_SCORE=0
+
+# Provider credentials for live model runs.
+# ANTHROPIC_API_KEY=
+# OPENAI_API_KEY=
+# OPENROUTER_API_KEY=
+# GEMINI_API_KEY=
--- a/.github/CODEOWNERS
+++ b/.github/CODEOWNERS
@ -0,0 +1 @@
+* @openclaw/openclaw-evals
--- a/.github/ISSUE_TEMPLATE/bug_report.md
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@ -0,0 +1,31 @@
+---
+name: Bug report
+about: Something is broken or producing wrong results
+labels: bug
+---
+
+## What happened
+
+<!-- A clear description of the bug. -->
+
+## Expected behaviour
+
+<!-- What should have happened instead. -->
+
+## Steps to reproduce
+
+```bash
+# Minimal command / code snippet that triggers the bug
+```
+
+## Relevant output
+
+```
+# Full error message, stack trace, or unexpected scoring output
+```
+
+## Environment
+
+- Python version:
+- OS:
+- ClawBench version / commit:
--- a/.github/ISSUE_TEMPLATE/feature_request.md
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@ -0,0 +1,21 @@
+---
+name: Feature request
+about: Suggest a new task, scoring improvement, or other enhancement
+labels: enhancement
+---
+
+## Summary
+
+<!-- One or two sentences describing what you want. -->
+
+## Motivation
+
+<!-- Why is this valuable? What problem does it solve, or what gap does it fill? -->
+
+## Proposed approach
+
+<!-- Optional: sketch of how you'd implement it, or what the change would look like. -->
+
+## Alternatives considered
+
+<!-- Any other approaches you thought about and why you ruled them out. -->
--- a/.github/PULL_REQUEST_TEMPLATE.md
+++ b/.github/PULL_REQUEST_TEMPLATE.md
@ -0,0 +1,18 @@
+## What does this PR do?
+
+<!-- One or two sentences. -->
+
+## Why?
+
+<!-- Motivation: what bug does it fix, what gap does it fill? Link related issues with "Fixes #N". -->
+
+## Changes
+
+<!-- Bullet list of the meaningful changes. Skip files touched only for formatting. -->
+
+## Tests
+
+<!-- Describe new or updated tests. If no tests were added, explain why none are needed. -->
+
+- [ ] `python -m pytest -q` passes locally
+- [ ] `python -m ruff check clawbench app.py scripts tests` passes locally, or the change is docs-only
--- a/.github/actionlint.yaml
+++ b/.github/actionlint.yaml
@ -0,0 +1,14 @@
+# actionlint configuration
+# https://github.com/rhysd/actionlint/blob/main/docs/config.md
+
+self-hosted-runner:
+  labels:
+    - blacksmith-8vcpu-ubuntu-2404
+    - blacksmith-16vcpu-ubuntu-2404
+    - blacksmith-32vcpu-ubuntu-2404
+
+paths:
+  .github/workflows/**/*.yml:
+    ignore:
+      - "shellcheck reported issue.+"
+      - 'label "blacksmith-[0-9]+vcpu-[^"]+" is unknown\.'
--- a/.github/workflows/README.md
+++ b/.github/workflows/README.md
@ -8,20 +8,54 @@ Runs the repository test suite automatically on:
 - every `pull_request`
 - manual dispatch from the Actions tab

-It uses Python 3.12, installs the package with `pip install -e .`, then
-runs `python -m pytest -q`.
+It uses Python 3.11 and 3.12, installs the package with
+`pip install -e .[dev]`, runs full Ruff lint plus `python -m pytest -q`,
+then builds a wheel and checks that runtime data such as `tasks-public/`,
+`tasks-domain/`, `profiles/`, and `baselines/` are included. Runs under the
+`openclaw` organization use the Blacksmith Ubuntu runner; forks fall back to
+GitHub-hosted `ubuntu-latest`.
+
+## `ci-check-testbox.yml` — Blacksmith Testbox warmup
+
+This workflow exists for the Blacksmith CLI:
+
+```bash
+blacksmith testbox warmup ci-check-testbox.yml --ref main --idle-timeout 90
+blacksmith testbox run --id <tbx_id> "python -m pytest -q"
+```
+
+It installs ClawBench, hydrates provider/HF secrets into
+`~/.clawbench-testbox-live.profile`, restores optional Codex/Claude/Gemini
+dotfiles from repo or org secrets, and installs
+`~/.local/bin/clawbench-testbox-env` for commands that need that live auth.
+
+## `crabbox-hydrate.yml` — Crabbox Actions hydration
+
+This workflow exists for the Crabbox CLI from `openclaw/crabbox`:
+
+```bash
+crabbox warmup --idle-timeout 90m
+crabbox actions hydrate --id <cbx_id-or-slug>
+crabbox run --id <cbx_id-or-slug> --shell -- "python -m pytest -q"
+```
+
+It runs on the dynamic self-hosted runner label registered by Crabbox, installs
+ClawBench, hydrates the same provider/HF secrets and agent dotfiles as the
+Blacksmith Testbox workflow, writes the Crabbox ready marker under
+`~/.crabbox/actions/`, and keeps the job alive for follow-up SSH sync/run
+commands.

 ## `sync-to-hf-space.yml` — auto-mirror main to the HF Space

 Mirrors every push to `main` into the HF Space git remote so
-[huggingface.co/spaces/ScoootScooob/clawbench](https://huggingface.co/spaces/ScoootScooob/clawbench)
+[huggingface.co/spaces/openclaw/clawbench](https://huggingface.co/spaces/openclaw/clawbench)
 always tracks GitHub `main`. GitHub becomes the single source of truth;
 the HF Space is a pure deploy target.

 ## One-time setup (required before the workflow can succeed)

-The workflow needs **two repository secrets**. Neither is checked into
-the repo; you add them via the GitHub UI.
+The workflow needs one repository secret. It can also use an optional
+fallback username secret.

 ### 1. Get a Hugging Face access token

@ -34,13 +68,13 @@ the repo; you add them via the GitHub UI.

 ### 2. Add the secrets to this repo

-1. Go to <https://github.com/scoootscooob/clawbench/settings/secrets/actions>
-2. Click **"New repository secret"** and add each of these:
+1. Go to <https://github.com/openclaw/clawbench/settings/secrets/actions>
+2. Click **"New repository secret"** and add:

   | Name          | Value                                                      |
   |---------------|------------------------------------------------------------|
   | `HF_TOKEN`    | The write-scoped HF token you created in step 1            |
-   | `HF_USERNAME` | `ScoootScooob` (the owner half of the Space path)          |
+   | `HF_USERNAME` | Optional fallback if token introspection fails             |

 3. Save both.

@ -68,18 +102,18 @@ status under the Actions tab for any commit.
  workflow mirror it.
 - **Failure modes:**
  - **Missing secrets** → the `Verify required secrets` step fails with
-    a clear error message telling you what to add.
+    a clear error message telling you to add `HF_TOKEN`.
  - **Revoked token** → push fails with a 401; check that `HF_TOKEN`
    still has Write scope on <https://huggingface.co/settings/tokens>.
-  - **Wrong username** → push fails with a repo-not-found error; make
-    sure `HF_USERNAME` matches the Space owner in the URL.
+  - **Missing Space** → the workflow creates the Docker Space before
+    pushing, using `HF_SPACE_ID` or the default `openclaw/clawbench`.

 ## Optional: change the target Space

 If you ever mirror to a different Space (e.g. a staging copy), set a
 repository variable (not a secret) named `HF_SPACE_ID` to the new
 Space ID, for example `yourname/clawbench-staging`. The workflow
-defaults to `ScoootScooob/clawbench` when the variable is unset.
+defaults to `openclaw/clawbench` when the variable is unset.

 ## Why `--force`?

--- a/.github/workflows/ci-check-testbox.yml
+++ b/.github/workflows/ci-check-testbox.yml
@ -0,0 +1,97 @@
+name: Blacksmith Testbox
+
+on:
+  workflow_dispatch:
+    inputs:
+      testbox_id:
+        type: string
+        description: "Testbox session ID"
+        required: true
+
+permissions:
+  contents: read
+
+jobs:
+  check:
+    name: check
+    runs-on: blacksmith-8vcpu-ubuntu-2404
+    timeout-minutes: 25
+    steps:
+      - name: Begin Testbox
+        uses: useblacksmith/begin-testbox@v2
+        with:
+          testbox_id: ${{ inputs.testbox_id }}
+
+      - name: Checkout
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+
+      - name: Install project
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -e .
+
+      - name: Prepare Testbox shell
+        shell: bash
+        run: |
+          set -euo pipefail
+          git fetch --no-tags --depth=50 origin "+refs/heads/main:refs/remotes/origin/main"
+          python_dir="$(dirname "$(python -c 'import sys; print(sys.executable)')")"
+          sudo ln -sf "$python_dir/python" /usr/local/bin/python
+          sudo ln -sf "$python_dir/python" /usr/local/bin/python3
+          sudo ln -sf "$python_dir/pip" /usr/local/bin/pip
+          sudo ln -sf "$python_dir/pip" /usr/local/bin/pip3
+          sudo ln -sf "$python_dir/pytest" /usr/local/bin/pytest
+
+      - name: Hydrate Testbox env helper
+        shell: bash
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+          HF_USERNAME: ${{ secrets.HF_USERNAME }}
+          CLAWBENCH_QUEUE_DATASET: ${{ vars.CLAWBENCH_QUEUE_DATASET || 'openclaw/clawbench-results' }}
+          CLAWBENCH_JUDGE_MODEL: ${{ vars.CLAWBENCH_JUDGE_MODEL }}
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+          ANTHROPIC_API_KEY_OLD: ${{ secrets.ANTHROPIC_API_KEY_OLD }}
+          ANTHROPIC_API_TOKEN: ${{ secrets.ANTHROPIC_API_TOKEN }}
+          CEREBRAS_API_KEY: ${{ secrets.CEREBRAS_API_KEY }}
+          DEEPINFRA_API_KEY: ${{ secrets.DEEPINFRA_API_KEY }}
+          FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
+          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
+          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
+          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
+          KIMI_API_KEY: ${{ secrets.KIMI_API_KEY }}
+          MINIMAX_API_KEY: ${{ secrets.MINIMAX_API_KEY }}
+          MISTRAL_API_KEY: ${{ secrets.MISTRAL_API_KEY }}
+          MOONSHOT_API_KEY: ${{ secrets.MOONSHOT_API_KEY }}
+          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+          OPENAI_BASE_URL: ${{ secrets.OPENAI_BASE_URL }}
+          OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
+          QWEN_API_KEY: ${{ secrets.QWEN_API_KEY }}
+          TOGETHER_API_KEY: ${{ secrets.TOGETHER_API_KEY }}
+          XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
+          ZAI_API_KEY: ${{ secrets.ZAI_API_KEY }}
+          Z_AI_API_KEY: ${{ secrets.Z_AI_API_KEY }}
+          OPENCLAW_CODEX_AUTH_JSON: ${{ secrets.OPENCLAW_CODEX_AUTH_JSON }}
+          OPENCLAW_CODEX_CONFIG_TOML: ${{ secrets.OPENCLAW_CODEX_CONFIG_TOML }}
+          OPENCLAW_CLAUDE_JSON: ${{ secrets.OPENCLAW_CLAUDE_JSON }}
+          OPENCLAW_CLAUDE_CREDENTIALS_JSON: ${{ secrets.OPENCLAW_CLAUDE_CREDENTIALS_JSON }}
+          OPENCLAW_CLAUDE_SETTINGS_JSON: ${{ secrets.OPENCLAW_CLAUDE_SETTINGS_JSON }}
+          OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON: ${{ secrets.OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON }}
+          OPENCLAW_GEMINI_SETTINGS_JSON: ${{ secrets.OPENCLAW_GEMINI_SETTINGS_JSON }}
+          CLAWBENCH_CODEX_AUTH_JSON: ${{ secrets.CLAWBENCH_CODEX_AUTH_JSON }}
+          CLAWBENCH_CODEX_CONFIG_TOML: ${{ secrets.CLAWBENCH_CODEX_CONFIG_TOML }}
+          CLAWBENCH_CLAUDE_JSON: ${{ secrets.CLAWBENCH_CLAUDE_JSON }}
+          CLAWBENCH_CLAUDE_CREDENTIALS_JSON: ${{ secrets.CLAWBENCH_CLAUDE_CREDENTIALS_JSON }}
+          CLAWBENCH_CLAUDE_SETTINGS_JSON: ${{ secrets.CLAWBENCH_CLAUDE_SETTINGS_JSON }}
+          CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON: ${{ secrets.CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON }}
+          CLAWBENCH_GEMINI_SETTINGS_JSON: ${{ secrets.CLAWBENCH_GEMINI_SETTINGS_JSON }}
+        run: bash scripts/ci-hydrate-testbox-env.sh
+
+      - name: Run Testbox
+        uses: useblacksmith/run-testbox@v2
+        if: always()
--- a/.github/workflows/ci.yml
+++ b/.github/workflows/ci.yml
@ -13,24 +13,55 @@ concurrency:

 jobs:
  test:
-    name: Python 3.12 test suite
-    runs-on: ubuntu-latest
+    name: Python ${{ matrix.python-version }} test suite
+    runs-on: ${{ github.repository_owner == 'openclaw' && 'blacksmith-8vcpu-ubuntu-2404' || 'ubuntu-latest' }}
    timeout-minutes: 15
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.11", "3.12"]

    steps:
      - name: Checkout repository
        uses: actions/checkout@v4

-      - name: Set up Python 3.12
+      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v5
        with:
-          python-version: "3.12"
+          python-version: ${{ matrix.python-version }}
          cache: pip

      - name: Install project
        run: |
          python -m pip install --upgrade pip
-          python -m pip install -e .
+          python -m pip install -e .[dev]
+
+      - name: Run static lint
+        run: python -m ruff check clawbench app.py scripts tests
+
+      - name: Run runtime contract smoke tests
+        run: python -m pytest -q tests/test_runtime_contracts.py

      - name: Run test suite
        run: python -m pytest -q
+
+      - name: Verify wheel contains runtime data
+        run: |
+          python -m pip wheel --no-deps . -w /tmp/clawbench-wheel
+          python - <<'PY'
+          from pathlib import Path
+          import zipfile
+
+          wheel = next(Path("/tmp/clawbench-wheel").glob("clawbench-*.whl"))
+          with zipfile.ZipFile(wheel) as archive:
+              names = set(archive.namelist())
+          required = [
+              "tasks-public/MANIFEST.yaml",
+              "tasks-domain/MANIFEST.yaml",
+              "profiles/example_research_stack.yaml",
+              "baselines/BASELINE_SOURCES.md",
+          ]
+          missing = [name for name in required if name not in names]
+          if missing:
+              raise SystemExit(f"wheel missing runtime files: {missing}")
+          PY
--- a/.github/workflows/crabbox-hydrate.yml
+++ b/.github/workflows/crabbox-hydrate.yml
@ -0,0 +1,166 @@
+name: Crabbox Hydrate
+
+on:
+  workflow_dispatch:
+    inputs:
+      crabbox_id:
+        description: "Crabbox lease ID"
+        required: true
+        type: string
+      ref:
+        description: "Git ref to hydrate"
+        required: false
+        type: string
+      crabbox_runner_label:
+        description: "Dynamic Crabbox runner label"
+        required: true
+        type: string
+      crabbox_job:
+        description: "Hydration job identifier expected by Crabbox"
+        required: false
+        default: "hydrate"
+        type: string
+      crabbox_keep_alive_minutes:
+        description: "Minutes to keep the hydrated job alive"
+        required: false
+        default: "90"
+        type: string
+
+permissions:
+  contents: read
+
+jobs:
+  hydrate:
+    name: hydrate
+    runs-on: [self-hosted, "${{ inputs.crabbox_runner_label }}"]
+    timeout-minutes: 120
+    steps:
+      - name: Checkout
+        uses: actions/checkout@v4
+        with:
+          ref: ${{ inputs.ref || github.ref }}
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+          cache: pip
+
+      - name: Install project
+        run: |
+          python -m pip install --upgrade pip
+          python -m pip install -e .
+
+      - name: Prepare Crabbox shell
+        shell: bash
+        run: |
+          set -euo pipefail
+          git fetch --no-tags --depth=50 origin "+refs/heads/main:refs/remotes/origin/main"
+          python_dir="$(dirname "$(python -c 'import sys; print(sys.executable)')")"
+          sudo ln -sf "$python_dir/python" /usr/local/bin/python
+          sudo ln -sf "$python_dir/python" /usr/local/bin/python3
+          sudo ln -sf "$python_dir/pip" /usr/local/bin/pip
+          sudo ln -sf "$python_dir/pip" /usr/local/bin/pip3
+          sudo ln -sf "$python_dir/pytest" /usr/local/bin/pytest
+
+      - name: Hydrate Crabbox env helper
+        shell: bash
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+          HF_USERNAME: ${{ secrets.HF_USERNAME }}
+          CLAWBENCH_QUEUE_DATASET: ${{ vars.CLAWBENCH_QUEUE_DATASET || 'openclaw/clawbench-results' }}
+          CLAWBENCH_JUDGE_MODEL: ${{ vars.CLAWBENCH_JUDGE_MODEL }}
+          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
+          ANTHROPIC_API_KEY_OLD: ${{ secrets.ANTHROPIC_API_KEY_OLD }}
+          ANTHROPIC_API_TOKEN: ${{ secrets.ANTHROPIC_API_TOKEN }}
+          CEREBRAS_API_KEY: ${{ secrets.CEREBRAS_API_KEY }}
+          DEEPINFRA_API_KEY: ${{ secrets.DEEPINFRA_API_KEY }}
+          FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
+          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
+          GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
+          GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
+          KIMI_API_KEY: ${{ secrets.KIMI_API_KEY }}
+          MINIMAX_API_KEY: ${{ secrets.MINIMAX_API_KEY }}
+          MISTRAL_API_KEY: ${{ secrets.MISTRAL_API_KEY }}
+          MOONSHOT_API_KEY: ${{ secrets.MOONSHOT_API_KEY }}
+          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+          OPENAI_BASE_URL: ${{ secrets.OPENAI_BASE_URL }}
+          OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
+          QWEN_API_KEY: ${{ secrets.QWEN_API_KEY }}
+          TOGETHER_API_KEY: ${{ secrets.TOGETHER_API_KEY }}
+          XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
+          ZAI_API_KEY: ${{ secrets.ZAI_API_KEY }}
+          Z_AI_API_KEY: ${{ secrets.Z_AI_API_KEY }}
+          OPENCLAW_CODEX_AUTH_JSON: ${{ secrets.OPENCLAW_CODEX_AUTH_JSON }}
+          OPENCLAW_CODEX_CONFIG_TOML: ${{ secrets.OPENCLAW_CODEX_CONFIG_TOML }}
+          OPENCLAW_CLAUDE_JSON: ${{ secrets.OPENCLAW_CLAUDE_JSON }}
+          OPENCLAW_CLAUDE_CREDENTIALS_JSON: ${{ secrets.OPENCLAW_CLAUDE_CREDENTIALS_JSON }}
+          OPENCLAW_CLAUDE_SETTINGS_JSON: ${{ secrets.OPENCLAW_CLAUDE_SETTINGS_JSON }}
+          OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON: ${{ secrets.OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON }}
+          OPENCLAW_GEMINI_SETTINGS_JSON: ${{ secrets.OPENCLAW_GEMINI_SETTINGS_JSON }}
+          CLAWBENCH_CODEX_AUTH_JSON: ${{ secrets.CLAWBENCH_CODEX_AUTH_JSON }}
+          CLAWBENCH_CODEX_CONFIG_TOML: ${{ secrets.CLAWBENCH_CODEX_CONFIG_TOML }}
+          CLAWBENCH_CLAUDE_JSON: ${{ secrets.CLAWBENCH_CLAUDE_JSON }}
+          CLAWBENCH_CLAUDE_CREDENTIALS_JSON: ${{ secrets.CLAWBENCH_CLAUDE_CREDENTIALS_JSON }}
+          CLAWBENCH_CLAUDE_SETTINGS_JSON: ${{ secrets.CLAWBENCH_CLAUDE_SETTINGS_JSON }}
+          CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON: ${{ secrets.CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON }}
+          CLAWBENCH_GEMINI_SETTINGS_JSON: ${{ secrets.CLAWBENCH_GEMINI_SETTINGS_JSON }}
+        run: |
+          bash scripts/ci-hydrate-testbox-env.sh
+          sudo ln -sf "$HOME/.local/bin/clawbench-testbox-env" /usr/local/bin/clawbench-testbox-env
+
+      - name: Mark Crabbox ready
+        shell: bash
+        run: |
+          set -euo pipefail
+          job="${{ inputs.crabbox_job }}"
+          if [ -z "$job" ]; then job=hydrate; fi
+          mkdir -p "$HOME/.crabbox/actions"
+          state="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env"
+          env_file="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env.sh"
+          services_file="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.services"
+          write_export() {
+            key="$1"
+            value="${!key-}"
+            if [ -n "$value" ]; then
+              printf 'export %s=%q\n' "$key" "$value"
+            fi
+          }
+          {
+            for key in CI GITHUB_ACTIONS GITHUB_WORKSPACE GITHUB_REPOSITORY GITHUB_RUN_ID GITHUB_RUN_NUMBER GITHUB_RUN_ATTEMPT GITHUB_REF GITHUB_REF_NAME GITHUB_SHA GITHUB_EVENT_NAME GITHUB_ACTOR RUNNER_OS RUNNER_ARCH RUNNER_TEMP RUNNER_TOOL_CACHE; do
+              write_export "$key"
+            done
+          } > "${env_file}.tmp"
+          mv "${env_file}.tmp" "$env_file"
+          {
+            echo "# Docker containers visible from the hydrated runner"
+            docker ps --format '{{.Names}}\t{{.Image}}\t{{.Ports}}' 2>/dev/null || true
+          } > "${services_file}.tmp"
+          mv "${services_file}.tmp" "$services_file"
+          tmp="${state}.tmp"
+          {
+            echo "WORKSPACE=${GITHUB_WORKSPACE}"
+            echo "RUN_ID=${GITHUB_RUN_ID}"
+            echo "JOB=${job}"
+            echo "ENV_FILE=${env_file}"
+            echo "SERVICES_FILE=${services_file}"
+            echo "READY_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+          } > "$tmp"
+          mv "$tmp" "$state"
+
+      - name: Keep Crabbox job alive
+        shell: bash
+        run: |
+          set -euo pipefail
+          minutes="${{ inputs.crabbox_keep_alive_minutes }}"
+          case "$minutes" in
+            ''|*[!0-9]*) minutes=90 ;;
+          esac
+          stop="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.stop"
+          deadline=$(( $(date +%s) + minutes * 60 ))
+          while [ "$(date +%s)" -lt "$deadline" ]; do
+            if [ -f "$stop" ]; then
+              exit 0
+            fi
+            sleep 15
+          done
--- a/.github/workflows/sync-to-hf-space.yml
+++ b/.github/workflows/sync-to-hf-space.yml
@ -1,19 +1,17 @@
 name: Sync main to HF Space

 # Mirrors every push to `main` on GitHub into the HF Space git remote so
-# that the public ClawBench Space (https://huggingface.co/spaces/ScoootScooob/clawbench)
+# that the public ClawBench Space (https://huggingface.co/spaces/openclaw/clawbench)
 # always tracks the source-of-truth repo.
 #
 # Required repository secrets (Settings -> Secrets and variables -> Actions):
 #   HF_TOKEN     Hugging Face access token with write permission to the Space.
 #                Create at https://huggingface.co/settings/tokens
 #                (token type "Write" is sufficient; no organization scope needed).
-#   HF_USERNAME  Your Hugging Face username, e.g. "ScoootScooob".
-#                (The Space is `ScoootScooob/clawbench`, so the username is
-#                the owner half of that path.)
+#   HF_USERNAME  Optional fallback username if token introspection fails.
 #
 # Optional: set HF_SPACE_ID as a repo variable (not secret) to point the
-# workflow at a different Space; defaults to "ScoootScooob/clawbench".
+# workflow at a different Space; defaults to "openclaw/clawbench".

 on:
  push:
@ -42,20 +40,58 @@ jobs:
      - name: Verify required secrets
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
-          HF_USERNAME: ${{ secrets.HF_USERNAME }}
        run: |
-          if [ -z "$HF_TOKEN" ] || [ -z "$HF_USERNAME" ]; then
-            echo "::error::HF_TOKEN and HF_USERNAME repository secrets must both be set."
+          if [ -z "$HF_TOKEN" ]; then
+            echo "::error::HF_TOKEN repository secret must be set."
            echo "  Create HF_TOKEN at https://huggingface.co/settings/tokens (type: Write)"
-            echo "  Set HF_USERNAME to your HF username (the owner of the Space)."
            exit 1
          fi

+      - name: Ensure HF Space exists
+        id: hf
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+          HF_USERNAME: ${{ secrets.HF_USERNAME }}
+          HF_SPACE_ID: ${{ vars.HF_SPACE_ID || 'openclaw/clawbench' }}
+        run: |
+          set -euo pipefail
+          python -m pip install --quiet 'huggingface_hub>=0.24,<2'
+          python - <<'PY'
+          import os
+
+          from huggingface_hub import HfApi
+
+          token = os.environ["HF_TOKEN"]
+          space_id = os.environ["HF_SPACE_ID"]
+          fallback_username = os.environ.get("HF_USERNAME", "").strip()
+
+          api = HfApi(token=token)
+          username = fallback_username
+          try:
+              info = api.whoami(token=token)
+              username = str(info.get("name") or username).strip()
+          except Exception as exc:
+              if not username:
+                  raise RuntimeError("HF_USERNAME fallback is required when token introspection fails") from exc
+
+          api.create_repo(
+              repo_id=space_id,
+              repo_type="space",
+              space_sdk="docker",
+              token=token,
+              exist_ok=True,
+          )
+
+          with open(os.environ["GITHUB_OUTPUT"], "a", encoding="utf-8") as output:
+              output.write(f"username={username}\n")
+          print(f"HF Space ready: {space_id}")
+          PY
+
      - name: Push to HF Space remote
        env:
          HF_TOKEN: ${{ secrets.HF_TOKEN }}
-          HF_USERNAME: ${{ secrets.HF_USERNAME }}
-          HF_SPACE_ID: ${{ vars.HF_SPACE_ID || 'ScoootScooob/clawbench' }}
+          HF_USERNAME: ${{ steps.hf.outputs.username }}
+          HF_SPACE_ID: ${{ vars.HF_SPACE_ID || 'openclaw/clawbench' }}
        run: |
          set -euo pipefail
          # Authenticate via token in the URL. HF Spaces accept the
@ -83,6 +119,6 @@ jobs:
        run: |
          echo "### HF Space mirror" >> "$GITHUB_STEP_SUMMARY"
          echo "" >> "$GITHUB_STEP_SUMMARY"
-          echo "Pushed \`$(git rev-parse --short HEAD)\` to \`ScoootScooob/clawbench\` Space." >> "$GITHUB_STEP_SUMMARY"
+          echo "Pushed \`$(git rev-parse --short HEAD)\` to \`${{ vars.HF_SPACE_ID || 'openclaw/clawbench' }}\` Space." >> "$GITHUB_STEP_SUMMARY"
          echo "" >> "$GITHUB_STEP_SUMMARY"
-          echo "View the Space: <https://huggingface.co/spaces/ScoootScooob/clawbench>" >> "$GITHUB_STEP_SUMMARY"
+          echo "View the Space: <https://huggingface.co/spaces/${{ vars.HF_SPACE_ID || 'openclaw/clawbench' }}>" >> "$GITHUB_STEP_SUMMARY"
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -0,0 +1,16 @@
+repos:
+  - repo: https://github.com/astral-sh/ruff-pre-commit
+    rev: v0.14.14
+    hooks:
+      - id: ruff
+
+  - repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v6.0.0
+    hooks:
+      - id: check-added-large-files
+      - id: check-case-conflict
+      - id: check-merge-conflict
+      - id: check-toml
+      - id: check-yaml
+      - id: end-of-file-fixer
+      - id: trailing-whitespace
--- a/.python-version
+++ b/.python-version
@ -0,0 +1 @@
+3.12
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -0,0 +1,127 @@
+# Contributing to ClawBench
+
+Thank you for your interest in contributing. This document explains how to get
+set up, what kinds of contributions are welcome, and how the review process
+works.
+
+---
+
+## Getting started
+
+**Requirements:** Python 3.11+, Docker (for full end-to-end runs).
+
+```bash
+git clone https://github.com/openclaw/clawbench.git
+cd clawbench
+python -m venv .venv && source .venv/bin/activate
+python -m pip install -e ".[dev]"
+```
+
+Run the test suite to confirm everything is working:
+
+```bash
+python -m pytest -q
+python -m ruff check clawbench app.py scripts tests
+```
+
+The full local suite should pass before you make any changes.
+
+---
+
+## What we welcome
+
+| Type | Notes |
+|------|-------|
+| **Bug fixes** | Include a test that reproduces the bug before the fix |
+| **New tasks** | See [Adding tasks](#adding-tasks) below |
+| **Scoring improvements** | Changes to `trajectory.py`, `scorer.py`, or `judge.py` must include updated tests and a clear rationale |
+| **Documentation** | Fixes to README, spec docs, or inline comments |
+| **Tooling / CI** | Workflow improvements, linting, dependency updates |
+
+We are unlikely to merge:
+- Large architectural rewrites without prior discussion in an issue
+- New dependencies without justification
+- Changes that reduce test coverage
+
+---
+
+## Making a change
+
+1. **Open an issue first** for anything non-trivial. This lets us align on
+   approach before you invest time writing code.
+
+2. **Create a branch** from `main`:
+   ```bash
+   git checkout -b fix/short-description
+   ```
+   Branch names: `fix/`, `feat/`, `docs/`, `chore/` prefixes.
+
+3. **Write tests.** Bug fixes must include a test that fails before the fix
+   and passes after. New features must include tests covering the new
+   behaviour.
+
+4. **Run the test suite:**
+   ```bash
+   python -m pytest -q
+   ```
+
+5. **Open a pull request** against `main`. Fill in the PR template.
+
+---
+
+## Adding tasks
+
+Public tasks live in `tasks-public/tier{1-5}/` as YAML files. Domain and
+partner tasks live under `tasks-domain/`. Each task needs:
+
+- A unique `id` and descriptive `name`
+- The correct `tier` (1 = simple single-tool, 5 = adversarial/multi-step)
+- `completion` checks — at least one deterministic verifier (`execution_checks`,
+  `file_equality`, or a gateway assertion)
+- `trajectory` expectations that reflect how a competent agent should approach
+  the task
+- A `judge` rubric for semantic tasks
+
+Before submitting a new task, run it against at least one agent to verify the
+completion checks fire correctly.
+
+---
+
+## Commit style
+
+```
+type: short imperative summary (≤72 chars)
+
+Optional longer explanation. Wrap at 72 chars. Explain *why*, not what —
+the diff shows what changed.
+```
+
+Types: `fix`, `feat`, `docs`, `test`, `chore`, `refactor`.
+
+---
+
+## Code style
+
+The project uses Ruff and pre-commit for local guardrails. Please follow the
+style of the surrounding code: 4-space indentation, descriptive variable names,
+and comments only where the logic is not self-evident.
+
+```bash
+python -m ruff check clawbench app.py scripts tests
+pre-commit run --files <changed files>
+```
+
+---
+
+## Reporting bugs
+
+Use the [bug report template](.github/ISSUE_TEMPLATE/bug_report.md). Include:
+- The command you ran
+- The full error output or unexpected behaviour
+- The Python version and OS
+
+---
+
+## Questions
+
+Open an issue for questions that are not bug reports or feature requests.
--- a/16
+++ b/16
@ -1,7 +1,8 @@
 # ClawBench HF Docker Space
-# Layer the benchmark harness on top of the official OpenClaw image.
+# Layer the benchmark harness on top of a pinned OpenClaw image.

-FROM ghcr.io/openclaw/openclaw:latest
+ARG OPENCLAW_IMAGE=ghcr.io/openclaw/openclaw@sha256:2e32f4f2e4f653f12d5dc6e5c93cc71e60f49d1dfaf061b18e53c3e61a38fb48
+FROM ${OPENCLAW_IMAGE}

 USER root

@ -13,7 +14,7 @@ RUN apt-get update && \
 RUN ln -s /app /openclaw

 ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
-RUN npx -y playwright@1.59.1 install --with-deps chromium && \
+RUN cd /tmp && npx -y playwright@1.59.1 install --with-deps chromium && \
    CHROME_PATH="$(find /ms-playwright -path '*/chrome' -type f | sort | head -n 1)" && \
    test -x "$CHROME_PATH" && \
    ln -sf "$CHROME_PATH" /usr/bin/chromium
@ -21,10 +22,13 @@ RUN npx -y playwright@1.59.1 install --with-deps chromium && \
 ENV HOME=/home/node PATH=/home/node/.local/bin:$PATH
 WORKDIR /home/node/app

-COPY --chown=node:node pyproject.toml README.md ./
+COPY --chown=node:node pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./
 COPY --chown=node:node clawbench/ clawbench/
-COPY --chown=node:node tasks/ tasks/
+COPY --chown=node:node tasks-public/ tasks-public/
+COPY --chown=node:node tasks-domain/ tasks-domain/
+COPY --chown=node:node profiles/ profiles/
 COPY --chown=node:node baselines/ baselines/
+COPY --chown=node:node scripts/ scripts/
 COPY --chown=node:node app.py .

 RUN python3 -m pip install --break-system-packages --no-cache-dir .
@ -35,7 +39,7 @@ RUN mkdir -p \
    /home/node/.openclaw/agents/dev \
    /home/node/.openclaw/agents/main/agent && \
    chown -R node:node /data /home/node/.openclaw && \
-    chmod -R 777 /data /home/node/.openclaw
+    chmod -R 775 /data /home/node/.openclaw

 USER node

--- a/Dockerfile.main
+++ b/Dockerfile.main
@ -25,9 +25,11 @@ RUN npx -y playwright@1.59.1 install --with-deps chromium && \
 ENV HOME=/home/node PATH=/home/node/.local/bin:$PATH
 WORKDIR /home/node/app

-COPY --chown=node:node pyproject.toml README.md ./
+COPY --chown=node:node pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./
 COPY --chown=node:node clawbench/ clawbench/
-COPY --chown=node:node tasks/ tasks/
+COPY --chown=node:node tasks-public/ tasks-public/
+COPY --chown=node:node tasks-domain/ tasks-domain/
+COPY --chown=node:node profiles/ profiles/
 COPY --chown=node:node baselines/ baselines/
 COPY --chown=node:node app.py .

@ -39,7 +41,7 @@ RUN mkdir -p \
    /home/node/.openclaw/agents/dev \
    /home/node/.openclaw/agents/main/agent && \
    chown -R node:node /data /home/node/.openclaw && \
-    chmod -R 777 /data /home/node/.openclaw
+    chmod -R 775 /data /home/node/.openclaw

 USER node

--- a/21
+++ b/21
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2026 ClawBench Contributors
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/README.md
+++ b/README.md
@ -13,18 +13,34 @@ license: mit

 # ClawBench

-**The agent benchmark that measures what users actually experience.**
+**Rigorous agent evaluation. Signal-curated tasks. Dynamical-systems diagnostics.**

-[![Python 3.12+](https://img.shields.io/badge/python-3.12+-3776AB.svg?style=flat-square)](https://www.python.org/downloads/)
+[![Python 3.11+](https://img.shields.io/badge/python-3.11+-3776AB.svg?style=flat-square)](https://www.python.org/downloads/)
 [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg?style=flat-square)](LICENSE)
-[![Tasks: 40](https://img.shields.io/badge/tasks-40-blue.svg?style=flat-square)](#task-suite)
-[![Tests: 107](https://img.shields.io/badge/tests-107-success.svg?style=flat-square)](#testing)
-[![HF Dataset](https://img.shields.io/badge/HF-dataset-yellow.svg?style=flat-square)](https://huggingface.co/datasets/ScoootScooob/clawbench-results)
+[![Core v1: 19 tasks](https://img.shields.io/badge/Core%20v1-19%20tasks-blue.svg?style=flat-square)](tasks-public/)
+[![Diagnostics](https://img.shields.io/badge/diagnostics-dynamical-blueviolet.svg?style=flat-square)](#3-dynamical-systems-diagnostics-how-agents-fail-not-just-whether)
+[![HF Dataset](https://img.shields.io/badge/HF-dataset-yellow.svg?style=flat-square)](https://huggingface.co/datasets/openclaw/clawbench-results)

 </div>

 ---

+## What's new in Core v1 (2026-04-20)
+
+A reproducibility-first public release of the benchmark, informed by a full 8-model, 1,080-run sweep audit and five new methodology layers that most agent benchmarks simply don't have:
+
+| Innovation | What it means | Why it matters |
+|---|---|---|
+| **Signal-curated task set** | 19 tasks selected from 40-task dev pool by greedy SNR-preserving elimination | Drops tasks where seed noise exceeds capability signal (21 such tasks exist in the raw 40) |
+| **Variance decomposition** | Measures and reports seed-noise vs capability-signal ratio per task | **47% of 40-task variance is seed noise** — we quantify it; most benchmarks hide it |
+| **Dynamical-systems diagnostics** | Per-run regime classification (trapped / limit-cycle / diffusive / mixed) | Reveals *how* agents fail, not just whether. Inspired by Markov-kernel / attractor-basin framework |
+| **Constraint Index C(q)** | Principled task-weighting via participation ratio + entropy + Bayes prediction | Distinguishes "everyone converges" from "everyone diverges" tasks — enables honest weighted ranking |
+| **Reproducibility-first infrastructure** | Per-container state isolation, judge-infra rejudge pipeline, documented OpenRouter-routing caveats | Eliminates the cascading-failure / silent-judge-error patterns that bias most agent benchmarks |
+
+All of it lives in `scripts/` and `tasks-public/` — auditable code, not opaque numbers.
+
+---
+
 ## The problem with every agent benchmark

 You run a benchmark. Model A scores 73%. Model B scores 71%. You pick Model A.
@ -33,16 +49,14 @@ Then Model A deletes your test fixtures, hallucinates that it ran `pytest` (it d

 **The benchmark told you Model A was better. Your users would disagree.**

-This happens because every agent benchmark shipping today measures the *endpoint* — did the final file look right? — but throws away the *journey*. They treat the agent as a black box that either produces correct output or doesn't. One run, one number, move on.
+Beyond that, most benchmarks don't tell you:
+- Whether the gap is signal or noise
+- Which tasks actually discriminate models and which are coin-flips
+- How the agent *dynamically* fails — attractor, limit-cycle, goal drift
+- Whether re-running gives the same ranking (spoiler: on most benchmarks, no)
+- What's driving your score — the model, the plugin stack, or the harness version

-But that's not how users experience agents. Users experience:
- **Reliability** — does it work 3 out of 3 times, or 1 out of 3?
- **Process quality** — did it read the code before editing, or blind-patch and pray?
- **Safety** — did it `rm -rf` something it shouldn't have?
- **Failure modes** — when it fails, does it fail gracefully or hallucinate success?
- **Configuration sensitivity** — is the score coming from the model, or from the plugins wrapped around it?
-
-No existing benchmark captures any of this. ClawBench captures all of it.
+ClawBench addresses all of this. Below is how.

 ---

@ -52,18 +66,16 @@ No existing benchmark captures any of this. ClawBench captures all of it.

 Every agent run produces a full execution trace: every tool call, every file read, every `pytest` invocation, every retry after failure. Most benchmarks throw this away and check the final state. ClawBench scores *from the trace itself*.

-This is why our scoring has four axes, not one:
-
 | Axis | Weight | What it measures | Where it comes from |
 |------|--------|-----------------|-------------------|
 | **Completion** | 40% | Did the work actually get done? | Deterministic verifiers: `pytest`, exit codes, file equality, DOM assertions, memory state |
 | **Trajectory** | 30% | Did the agent work well? | Trace analysis: read-before-write ratio, self-verification, recovery after failure, tool-family fit |
 | **Behavior** | 20% | Was the agent safe and communicative? | Pattern detection: planning, progress updates, destructive command avoidance |
-| **Judge** | 10% | Is the semantic quality good? | LLM evaluation (gated — only contributes when deterministic completion is already near-perfect) |
+| **Judge** | Advisory | Is the semantic quality good? | LLM evaluation sidecar; opt-in experimental judge-weighted scoring is gated |

-**The key invariant**: the LLM judge can never rescue a failed deterministic check. If `pytest` fails, the judge score is zeroed. This is enforced in code and tested. It means you can't game ClawBench by producing output that *looks* correct to an LLM but doesn't actually work.
+**The key invariant**: the LLM judge can never rescue a failed deterministic check. Official scoring keeps judge results as a sidecar signal. Experimental judge-weighted scoring must be explicitly enabled and still gates judge contribution behind deterministic completion.

-### 2. We measure reliability, not just capability
+### 2. We measure reliability AND quantify noise

 A model that scores 90% on one run and 20% on the next is not a 55% model. It's an unreliable model. Users experience the worst run, not the average.

@ -73,13 +85,81 @@ ClawBench runs every task 3 times and reports:
 - **Taguchi Signal-to-Noise** — asymmetrically penalizes the worst runs, because that's what matters in production
 - **Bootstrap confidence intervals** — 10,000 resamples per task, so you know when a score difference is real vs. noise
 - **Worst-of-n** — the score that actually determines user trust
- **13 failure modes** — not just "pass/fail" but *how* it failed: `hallucinated_completion`, `tool_misuse`, `verification_skipped`, `state_regression`, `graceful_refusal`, and 8 more
+- **13 failure modes** — `hallucinated_completion`, `tool_misuse`, `verification_skipped`, `state_regression`, `graceful_refusal`, and 8 more (not just "pass/fail")

-### 3. We ablate configurations, not just models
+Beyond per-run reliability, we decompose **benchmark-wide variance** into seed-noise vs capability signal:

-Here's a finding that reframes the entire benchmarking conversation: on realistic tasks, **swapping the plugin configuration produces score swings 10x larger than swapping the model**. The same Claude Sonnet can beat Claude Opus when wrapped in better tooling.
+```
+SNR(task) = capability_variance(across models) / mean_seed_variance(per model)
+```

-If the configuration drives 10x more variance than the model, the benchmark should measure it. ClawBench's v0.5 Configuration Diagnostic does exactly this:
+Findings from the v4-19-full sweep audit:
+- **Only 52.7% of run_score variance is real capability signal**; 47.3% is seed noise
+- **2 tasks have SNR ≥ 5** (reliably discriminate models)
+- **21 tasks have SNR < 1** (seed noise ≥ capability signal; rankings on these tasks are essentially random)
+
+Core v1 drops the noisy tasks and reports variance decomposition alongside rankings. This is the level of rigor most benchmarks don't attempt.
+
+### 3. Dynamical-systems diagnostics: how agents fail, not just whether
+
+Inspired by *"When LLMs Are Dreaming, Where Do They Go?"* — we treat each agent run as a stochastic trajectory in semantic state space and extract signal that flat `run_score` averages away.
+
+Current code-path formulas:
+
+```text
+Per assistant step t:
+x_t = [tool_family_proportions(6), error_flag, normalized_tokens, normalized_text_len, progress]
+drift_t = cosine_distance(x_0, x_t)
+step_t = cosine_distance(x_{t-1}, x_t)
+
+Task-level Constraint Index:
+PR(q) = tr(Σ_q)^2 / tr(Σ_q^2)
+H(q) = -Σ_i p_i log2 p_i,   p_i = λ_i / Σ_j λ_j,   λ = eigvals(Σ_q)
+BOPS(q) = mean_m mean_{i<j} cos(v_{q,m,i}, v_{q,m,j})
+C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q))
+
+Per-run constraint index used inside the regime classifier:
+PR_run = 1 / Σ_i p_i^2
+constraint_index_run = 1 - (PR_run - 1) / (d - 1)
+
+Variance decomposition:
+seed_var(q) = mean_m Var(run_score_{q,m,*})
+cap_var(q) = Var_m Mean(run_score_{q,m,*})
+SNR(q) = cap_var(q) / (seed_var(q) + 1e-9)
+capability_fraction = mean_q cap_var(q) / (mean_q cap_var(q) + mean_q seed_var(q))
+
+Survival:
+T_F = first assistant turn with empty text and no tool calls,
+      else final assistant turn if run_score < 0.7 and delivery_outcome in {fail, partial}
+S(t) = P(T_F > t)
+h(t) = P(T_F = t | T_F >= t)
+```
+
+Implemented regime classifier in `clawbench/dynamics.py`:
+
+```text
+trapped      if H_tools < 0.5 or (error_rate > 0.6 and std(drift) < 0.05)
+convergent   if std(drift_last_quartile) < 0.1 and mean(step_last_quartile) < 0.15 and error_rate < 0.2
+diffusive    if H_tools > 1.5 and error_rate < 0.15 and constraint_index_run < 0.8
+chaotic      if H_tools > 2.0 and var(step[1:]) > 0.02
+limit_cycle  if max autocorr(centered step[1:], lags 2..5) > 0.3
+unknown      otherwise, or <3 assistant turns
+```
+
+The task-level `C(q)` uses a normalized bag-of-words response vector built from the full assistant trajectory text plus tool-call names and compacted inputs, not just the last assistant turn.
+
+From the v4-19 sweep data:
+- **Gemini 3.1 Pro** exhibits `trapped` regime on 42/120 runs — commits early, doesn't iterate
+- **GPT 5.4** has the most `limit_cycle` runs (20) — tool-use loops, productive or stuck
+- **Kimi K2.5** dies at median turn 3 (worst survival); **GPT 5.4** survives to turn 8 at 60% rate (best)
+
+All scripts under `scripts/` run on cached per-run JSONs with plain numpy-based tooling; no torch or sentence-transformers required.
+
+### 4. We ablate configurations, not just models
+
+On realistic tasks, **swapping the plugin configuration produces score swings 10x larger than swapping the model**. The same Claude Sonnet can beat Claude Opus when wrapped in better tooling.
+
+If the configuration drives 10x more variance than the model, the benchmark should measure it. ClawBench's Configuration Diagnostic:

 1. **Fingerprint** your plugin configuration into a typed feature vector (hooks, tools, capabilities, slots)
 2. **Predict** your score before you spend a dollar on compute (k-NN over historical submissions)
@ -87,7 +167,18 @@ If the configuration drives 10x more variance than the model, the benchmark shou
 4. **Explain** which plugins are actually driving your score (fANOVA factor importance)
 5. **Recommend** specific, evidence-backed configuration changes with estimated impact

-No other benchmark can do this, because no other benchmark has access to typed plugin manifests. OpenClaw's plugin-native architecture makes the configuration transparent, not a black box.
+No other benchmark can do this — no other benchmark has access to typed plugin manifests. OpenClaw's plugin-native architecture makes the configuration transparent, not a black box.
+
+### 5. Reproducibility-first infrastructure
+
+The v4-19-full sweep exposed multiple failure modes that silently bias numbers in other benchmarks:
+
+- **Shared state dir contamination** — accumulated `agents/` cruft across sequential sweeps caused `RPC agents.create timed out` cascades. Fixed via per-container `OPENCLAW_STATE_DIR` isolation (`scripts/container_sweep_single.sh`).
+- **Gateway judge failures** — the in-process judge returned "Gateway is restarting" / empty scores on infrastructure hiccups. Fixed via direct-API rejudge pipeline (`scripts/rejudge_all.py`).
+- **OpenRouter provider routing** — slug `z-ai/glm-5.1` canonically routes to different backing models over time. GLM 5.1 scored 0.79 at 14:00 PST, became untestable by 17:00 PST when OpenRouter repointed the slug to a reasoning-enabled variant with insufficient token budget. Numbers measured against OpenRouter-hosted models are explicitly flagged.
+- **Platform version drift** — OpenClaw 4.9 → 4.15-beta.1 shifted scores by +0.13 to +0.29 across all models. When comparing two model runs, build both against the same OpenClaw release.
+
+All of these are documented in code + commit messages. The state-isolation patch + rejudge pipeline + provider caveats turn a flaky harness into one whose drift sources are at least visible.

 ---

@ -120,40 +211,6 @@ A user doesn't see a pass/fail. They see an agent that reads their code carefull

 ---

-## How ablation works: the Configuration Diagnostic
-
-Most benchmarks answer: "which model is best?" ClawBench also answers: "which configuration change will actually improve my score?"
-
-### The pipeline
-
-```
-profile.yaml ──► Fingerprint ──► Predict ──► Run ──► Compare ──► Explain ──► Recommend
-     │              │               │          │         │           │            │
-     │         27 hooks ×       k-NN over    40 tasks   Surprise   fANOVA     Evidence-
-     │        11 tool fams ×   historical     × 3       detection  factor     backed
-     │        10 contracts     submissions    runs      (Δ≥0.15)   importance  changes
-     │                                                                         with ΔE
-```
-
-### What the diagnostic report tells you
-
-| Section | What you learn |
-|---|---|
-| **Predicted score + confidence** | What to expect before you spend compute |
-| **Surprises** | Which tasks deviated from prediction, and why |
-| **Plugin Utilization Audit** | Which plugins loaded but were never invoked (dead weight) |
-| **Manifest vs Reality Gap** | Declared capabilities vs. actually exercised capabilities |
-| **Factor Importance** | Which configuration features actually drive score variance |
-| **Recommendations** | "Add `memory-lancedb`: estimated +0.12 ± 0.04" — backed by neighbor profiles |
-
-Every recommendation cites the specific neighbor profiles that already include the suggested change. No speculative advice.
-
-### Why this matters
-
-Benchmarks today tell you "Opus scores 0.59." They don't tell you *why*, and they don't tell you what to change. ClawBench's diagnostic layer turns a benchmark from a ranking into an optimization tool. You don't just learn where you stand — you learn what to do about it.
-
---
-
 ## The 13 failure modes

 When an agent fails, "fail" is not useful information. ClawBench classifies every failure into one of 13 deterministic modes:
@ -178,17 +235,22 @@ These are surfaced per-run in the result, not hidden in logs. They make failures

 ---

-## Task suite: 40 tasks across 5 tiers
+## Core v1 task suite: 19 tasks

-Tasks are designed to mirror what agent users actually do — not contrived algorithmic puzzles, but realistic multi-step workflows with real tools:
+Core v1 is a signal-curated public release of 19 tasks from the internal 40-task dev pool. Selected for:
+- **0 ranking inversions** — the mean reproduces the reference 8-model order exactly
+- **Preserved coverage** — all 5 tiers and 6 families represented
+- **Dropped noise** — excludes tasks where cross-model SNR < 0.5

-| Tier | Tasks | What it tests | Examples |
-|------|-------|---------------|---------|
-| **Tier 1** | 6 | Basic single-tool tasks | Fix a 10-line bug, write a quick note, set a calendar reminder |
-| **Tier 2** | 14 | Multi-step with 2-3 tools | Fix a browser form, search-and-patch a repo, redact a document |
-| **Tier 3** | 11 | Complex multi-tool orchestration | Debug a timezone regression, generate a data pipeline report, triage an inbox |
-| **Tier 4** | 6 | Hard cross-system reasoning | Migrate code across repos, delegate to sub-agents, recall from long context |
-| **Tier 5** | 3 | Adversarial | Contradictory requirements, hallucination traps, impossible tasks requiring graceful refusal |
+| Tier | Core v1 count | What it tests | Examples |
+|------|:---:|---|---|
+| **Tier 1** | 2 | Single-tool basics | Bugfix discount calc, quick file note |
+| **Tier 2** | 6 | Multi-step, 2-3 tools | Config loader repair, browser form fix, priv redaction |
+| **Tier 3** | 5 | Complex orchestration | SQL query analysis, inbox triage, data pipeline report |
+| **Tier 4** | 5 | Cross-system reasoning | Cross-repo migration, delegation repair, memory continuation, browser research+code |
+| **Tier 5** | 1 | Adversarial | Hallucination-resistant evidence |
+
+Full manifest: [`tasks-public/MANIFEST.yaml`](tasks-public/MANIFEST.yaml).

 ### Task design principles

@ -200,6 +262,13 @@ Tasks are designed to mirror what agent users actually do — not contrived algo

 **Adversarial tier.** Tier 5 tasks are designed to test what most benchmarks can't: does the agent correctly identify when a task is impossible? Does it resist hallucinating evidence that doesn't exist? Does it handle contradictory instructions gracefully? These tasks separate models that are *capable* from models that are *trustworthy*.

+### Private holdout (21 tasks)
+
+The remaining 21 tasks from the internal pool stay private:
+- **9 ceiling tasks** — all frontier models score >0.85; don't discriminate at the frontier
+- **9 low-signal tasks** — SNR < 0.5; either broken verifiers or genuinely ambiguous prompts (scheduled for redesign)
+- **3 ranking-inconsistent tasks** — cross-model ordering conflicts with reference ranking (`t2-node-search-patch`, `t5-contradictory-requirements`, `t1-cal-quick-reminder`)
+
 ---

 ## The scoring math
@ -209,118 +278,208 @@ Tasks are designed to mirror what agent users actually do — not contrived algo
 run_score = 0.4 * completion + 0.3 * trajectory + 0.2 * behavior + [0.1 * judge if completion >= 0.9999]
 ```

-The judge term is gated: it only contributes when the deterministic completion score is near-perfect. This means you can't get a good score by producing output that *looks* right but doesn't pass execution checks.
+The judge term is gated: it only contributes when the deterministic completion score is near-perfect. You can't get a good score by producing output that *looks* right but doesn't pass execution checks.

 ### Per-task score (across 3 runs)
 ```
 task_score = 0.9 * bootstrap_mean(run_scores) + 0.1 * reliability_score
-```
-
-Where:
-```
 reliability = 0.5 * pass^k + 0.3 * pass_rate + 0.2 * variance_score
 ```

-`pass^k` is 1 only if ALL runs pass. Not any run — all runs. This is the metric that separates reliable agents from lucky ones.
+`pass^k` is 1 only if ALL runs pass. Not any run — all runs.

 ### Taguchi Signal-to-Noise (robustness)
 ```
 S/N = -10 * log10( (1/n) * sum(1/y_i^2) )
 ```

-The `1/y_i^2` term means the worst score dominates. A configuration scoring 0.85 average but 0.10 on adversarial tasks is **worse in production** than 0.78 average with a 0.65 floor. Taguchi catches this; mean and stddev don't.
+The `1/y_i^2` term means the worst score dominates. A configuration scoring 0.85 average but 0.10 on adversarial tasks is **worse in production** than 0.78 average with a 0.65 floor.
+
+### SNR-weighted alternative (for ranking differentiation)
+
+Flat-mean compresses frontier model gaps. An alternative that weights tasks by their signal density:
+
+```
+w_q = max(0, SNR(q)) × |C(q)|
+w_q^wins = min(w_q, p95({w_q}))
+
+flat_score(model) = mean_q mean_run_score(model, q) over covered tasks
+weighted_score(model) = Σ_q w_q mean_run_score(model, q) / Σ_q w_q
+winsorized_score(model) = Σ_q w_q^wins mean_run_score(model, q) / Σ_q w_q^wins
+```
+
+Under SNR × |C(q)| winsorized on the same 1,080-run archive, **Opus 4.7 ranks #1** (instead of Opus 4.6 under flat mean) and **GPT 5.4 drops from #3 to #7** — its task-specific cliffs (0.16 on `t3-feature-export`) fall on the highest-signal tasks. This exposes what the flat mean averages away.
+
+Generate alternate rankings: `scripts/snr_weighted_ranking.py`.
+
+---
+
+## Reproducibility caveats
+
+Being honest about what reproduces and what doesn't:
+
+### What reproduces deterministically
+
+- **Fair comparison audit** — given an archive dir, `scripts/audit_runs.py` produces identical numbers every time.
+- **Dynamical diagnostics** — C(q), regime classification, variance decomposition, survival curves: all deterministic functions of the archive.
+- **Rankings at the aggregate level** — top-cluster ranking stable across multiple sweeps when both runs use the same OpenClaw release + direct-API models.
+
+### What drifts
+
+- **Absolute scores** — seed noise is ~0.02 stddev per task per model. Expect run_score to drift within that envelope.
+- **OpenRouter-served models** — `openrouter/*` model slugs can silently re-route to different underlying providers. We observed GLM 5.1 at 0.79 then 0.33 within hours as OpenRouter flipped its backing provider. Pin to canonical versions (e.g., `z-ai/glm-5.1-20260406`) for stable measurement.
+- **OpenClaw platform drift** — 4.9 → 4.15-beta.1 shifted scores by +0.13 to +0.29 across all models. 60-70% reduction in `tool_misuse` and `verification_skipped` failure modes across that jump. Pin the base to reproduce published numbers.
+
+### Mitigating the drift
+
+Build both sides of any comparison from the same source state:
+
+```bash
+docker build -t clawbench .
+docker run --rm --entrypoint openclaw clawbench --version
+# -> records the OpenClaw version of THIS build
+```
+
+When publishing scores, record the OpenClaw version your image
+resolved to and treat numbers from a different version as separate
+populations.

 ---

 ## Quick start

+### Build the image
+
 ```bash
-# Clone + install
-git clone git@github.com:scoootscooob/clawbench.git && cd clawbench
-python -m venv .venv && source .venv/bin/activate
-pip install -e .
+git clone git@github.com:openclaw/clawbench.git && cd clawbench
+cp .env.example .env  # optional: fill tokens for local Docker/HF uploads
+docker build -t clawbench .

-# Run a single task
+# Record the OpenClaw version baked in (for reproducibility):
+docker run --rm --entrypoint openclaw clawbench --version
+```
+
+### Run Core v1 on a model
+
+```bash
 export OPENCLAW_GATEWAY_TOKEN=<your-token>
-clawbench run --model anthropic/claude-opus-4-6 --task t1-bugfix-discount --runs 3

-# Run with a plugin profile (enables Configuration Diagnostic)
-clawbench run --model anthropic/claude-opus-4-6 --profile profiles/frontier_opus_4_6.yaml --runs 3
+# Core v1 = 19 specific tasks. List them via the manifest:
+python3 -c "import yaml; m = yaml.safe_load(open('tasks-public/MANIFEST.yaml'));
+             print(' '.join(f'-t {t[\"id\"]}' for t in m['tasks']))"

-# Diagnose a profile without running (instant prediction from historical data)
-clawbench diagnose profiles/frontier_opus_4_6.yaml
+# Then run:
+clawbench run \
+  --model anthropic/claude-opus-4-6 \
+  --runs 3 \
+  --concurrency 4 \
+  --profile profiles/frontier_opus_4_6.yaml \
+  --judge-model anthropic/claude-sonnet-4-6 \
+  -t t1-bugfix-discount -t t1-fs-quick-note \
+  -t t2-add-tests-normalizer -t t2-browser-form-fix \
+  -t t2-config-loader -t t2-fs-find-that-thing \
+  -t t2-msg-summarize-thread -t t2-priv-redact-doc \
+  -t t3-data-pipeline-report -t t3-data-sql-query \
+  -t t3-feature-export -t t3-msg-inbox-triage \
+  -t t3-web-research-and-cite \
+  -t t4-browser-research-and-code -t t4-cross-repo-migration \
+  -t t4-delegation-repair -t t4-life-trip-plan \
+  -t t4-memory-recall-continuation \
+  -t t5-hallucination-resistant-evidence \
+  -o results/opus46_core_v1.json
+```
+
+### Analyze a real archive
+
+```bash
+# Fair-comparison audit
+python3 scripts/audit_runs.py
+python3 scripts/generate_fair_report.py --tag v2026-4-19-full
+
+# Posterior dynamics + ranking from cached per-run JSONs
+python3 scripts/run_posterior_dynamics_pipeline.py \
+  --archive-dir .clawbench/run_cache \
+  --reports-dir results/posterior_reports \
+  --include-dynamics-report \
+  --output-dir results/per_model_dynamics
+
+# Writes:
+#   results/posterior_reports/constraint_index.json
+#   results/posterior_reports/regimes.json
+#   results/posterior_reports/variance_decomposition.json
+#   results/posterior_reports/survival_analysis.json
+#   results/posterior_reports/snr_weighted_ranking.json
+#   results/posterior_reports/EVAL_REPORT_DYNAMICAL.md
+#   results/per_model_dynamics/<safe_model_name>/dynamics.json
+#   results/per_model_dynamics/<safe_model_name>/*.png
+```
+
+If you only want one model's offline dynamics bundle:
+
+```bash
+clawbench dynamics-report \
+  --archive-dir .clawbench/run_cache \
+  --model ollama/gpt-oss:20b \
+  --output-dir results/gptoss_dynamics
+
+# Quick CI path: skip plot rendering
+clawbench dynamics-report \
+  --archive-dir .clawbench/run_cache \
+  --model ollama/gpt-oss:20b \
+  --output-dir results/gptoss_dynamics \
+  --no-plots
+
+# Writes:
+#   results/gptoss_dynamics/dynamics.json
 ```

 ### Running locally with small models (Ollama)

-A single consumer GPU running an open-weight model through
-[Ollama](https://ollama.com) is enough to develop plugin profiles, validate
-algorithmic ideas, and submit scored results — no API keys or cloud spend
-required.
-
-Profiles tested locally can still be submitted as pull requests with
-reference results. The built-in GitHub Actions workflows in this repo only
-run the test suite and deployment sync, so treat local Ollama numbers as
-contributor-side evidence unless a maintainer separately reruns them on
-other infrastructure.
+A single consumer GPU running an open-weight model is enough to develop plugin profiles and validate algorithmic ideas — no API keys or cloud spend required.

 ```bash
-# Pull a model and set your gateway token
-ollama pull gpt-oss:20b   # or llama3.1:8b, qwen3:14b, etc.
+ollama pull gpt-oss:20b
 export OPENCLAW_GATEWAY_TOKEN=<your-gateway-token>
+export CLAWBENCH_RUN_CACHE_DIR=$PWD/.clawbench/run_cache

-# Quick smoke test
-clawbench run --model ollama/gpt-oss:20b --task t1-fs-quick-note --runs 1
+# Real benchmark run + immediate per-run dynamics bundle
+clawbench run \
+  --model ollama/gpt-oss:20b \
+  --task t1-fs-quick-note \
+  --runs 1 \
+  --dynamics \
+  -o results/ollama_smoke.json

-# Tier-1 sweep with confidence intervals
-clawbench run --model ollama/gpt-oss:20b --tier tier1 --runs 5
+# Optional second local model
+ollama pull qwen3.5:27b

-# Tier-2 sweep (run separately; the CLI accepts one --tier at a time)
-clawbench run --model ollama/gpt-oss:20b --tier tier2 --runs 5 --concurrency 2
+# Offline posterior analysis reads CLAWBENCH_RUN_CACHE_DIR
+python3 scripts/run_posterior_dynamics_pipeline.py \
+  --archive-dir .clawbench/run_cache \
+  --reports-dir results/posterior_reports

-# Inspect the reference profile's fingerprint and historical neighbors
 clawbench diagnose profiles/local_ollama_gpt_oss.yaml
 ```

-**Reference contributor-side results** (gpt-oss:20b, RTX 4090, Docker sandbox, network=none):
+### Running on Kubernetes

-| Scope | Score | CI | Completion | Trajectory | Behavior |
-|---|---|---|---|---|---|
-| Tier-1 (6 tasks × 3 runs) | 0.397 | 0.346–0.447 | 0.056 | 0.522 | 1.000 |
-
-High trajectory/behavior but low completion — the model uses tools correctly
-but writes to wrong paths or misses format constraints. This gap is where
-profile-level improvements (workspace-aware prompts, path-checking pre-flight
-calls, retry wrappers) have the most leverage.
-
-### Version control checkpoints
-
-Git is already the source of truth for this repo, but the safest workflow is:
+See [`docs/kubernetes.md`](docs/kubernetes.md) for the full runbook. The short
+version:

 ```bash
-# Start risky work on its own branch
-git switch -c codex/<short-topic>
+export CLAWBENCH_NAMESPACE=clawbench-eval
+export OPENAI_API_KEY="sk-..."       # or ANTHROPIC_API_KEY, OPENROUTER_API_KEY, etc.
+export CLAWBENCH_MODEL="openai/gpt-5.5"
+# export MLFLOW_NAMESPACE="mlflow"   # MLflow deploys in a separate namespace (default: mlflow)

-# Commit small checkpoints as you go
-git add -A
-git commit -m "Checkpoint: describe the working state"
-
-# Mark a known-good version with an annotated tag
-python3 scripts/git_checkpoint.py "before-profile-tuning"
-
-# Push the branch and tags so recovery is not only local
-git push -u origin HEAD
-git push origin --tags
+./scripts/k8s/deploy.sh              # deploys OpenClaw + MLflow + starts eval
+./scripts/k8s/deploy.sh --logs       # follow progress
+./scripts/k8s/deploy.sh --teardown   # tear down openclaw & eval (does not delete MLflow)
 ```

-The checkpoint script refuses to tag a dirty worktree by default, so every saved version points at a reproducible commit instead of a half-finished local state.
-
-### Docker (recommended for reproducibility)
-
-```bash
-docker compose up -d
-# Submit jobs via the Gradio UI at http://localhost:7860
-```
+API keys are stored in a Kubernetes Secret created by the deploy script.
+MLflow is deployed in its own namespace (default: `mlflow`, configurable via
+`MLFLOW_NAMESPACE`).

 ---

@ -349,26 +508,45 @@ clawbench/
 │   ├── environment.py              # 5 deterministic verifier types
 │   ├── judge.py                    # LLM judge (gated, never rescues failures)
 │   ├── harness.py                  # Benchmark orchestration + parallel lanes
-│   ├── worker.py                   # Background eval worker
-│   ├── client.py                   # OpenClaw Gateway WebSocket client
 │   ├── schemas.py                  # 13-mode failure taxonomy + result schemas
 │   ├── stats.py                    # Bootstrap CI + Taguchi S/N
 │   ├── profile.py                  # v0.5 plugin fingerprinting
-│   ├── prediction.py               # k-NN cold-start prediction
-│   ├── factor_analysis.py          # fANOVA factor importance
 │   ├── diagnostic.py               # Configuration Diagnostic report
-│   ├── utilization.py              # Plugin utilization audit
-│   ├── recommendations.py          # Evidence-backed config changes
+│   ├── factor_analysis.py          # fANOVA factor importance
+│   ├── dynamics.py                 # Trajectory metrics + sensitivity analysis
+│   ├── dynamics_archive.py         # Cached-run loading + offline report assembly
+│   ├── dynamics_plots.py           # Offline dynamics visualizations
 │   └── cli.py                      # CLI entry points
 │
-├── tasks/                          # 40 tasks across 5 tiers
-│   ├── tier1/ ... tier5/           # Task YAMLs with verification specs
-│   └── assets/                     # Per-task fixture directories
+├── tasks-public/                   # Core v1 PUBLIC release (19 tasks)
+│   ├── MANIFEST.yaml               # Task list + reference ranking + metadata
+│   ├── README.md                   # Rationale, build + run instructions
+│   ├── tier1/ ... tier5/           # 19 task YAMLs with verification specs
+│   └── assets/                     # 19 asset packs (verifiers + fixtures)
+│
+├── tasks-domain/                   # Planned domain coverage scaffold
+│
+├── tasks/                          # PRIVATE 40-task dev pool (gitignored)
+│
+├── scripts/                        # Reproducibility + analysis pipeline
+│   ├── container_sweep_single.sh   # Per-container OPENCLAW_STATE_DIR isolation
+│   ├── audit_runs.py               # Aggregate coverage + fair-comparison audit
+│   ├── audit_per_run.py            # Per-run cross-model audit
+│   ├── rejudge_all.py              # Direct-API rejudge for broken gateway judges
+│   ├── generate_fair_report.py     # Fair N-model comparison report
+│   ├── run_posterior_dynamics_pipeline.py # One-shot posterior analysis driver
+│   ├── compute_constraint_index.py # C(q) per task
+│   ├── classify_regimes.py         # Per-run dynamical regime classifier
+│   ├── variance_decomp.py          # Seed-noise vs capability-signal decomposition
+│   ├── survival_analysis.py        # Per-turn failure survival curves
+│   ├── snr_weighted_ranking.py     # SNR × |C(q)|-weighted ranking
+│   └── generate_dynamical_report.py # Combined dynamical-systems report
 │
 ├── profiles/                       # v0.5 plugin profile YAMLs
-├── tests/                          # 107 tests
-├── CLAWBENCH_V0_4_SPEC.md         # Full specification
-└── PARTNER_TRACE_SPEC.md          # Trace interchange format
+├── tests/                          # Test suite
+├── Dockerfile                      # Layered on a pinned ghcr.io/openclaw/openclaw image
+├── CLAWBENCH_V0_4_SPEC.md          # Full specification
+└── PARTNER_TRACE_SPEC.md           # Trace interchange format
 ```

 ---
@ -377,20 +555,25 @@ clawbench/

 |  | ClawBench | SWE-bench | HumanEval | LLM-judge leaderboards |
 |---|---|---|---|---|
-| **Scores process, not just output** | Trace-based trajectory + behavior scoring | No | No | No |
-| **Reliability as first-class metric** | pass^k, Taguchi S/N, worst-of-n, bootstrap CI | Single pass rate | pass@k | Best-of-n |
-| **Failure taxonomy** | 13 deterministic modes per run | Binary pass/fail | Binary | None |
+| **Scores process, not just output** | ✓ Trace-based trajectory + behavior | No | No | No |
+| **Reliability as first-class metric** | ✓ pass^k, Taguchi S/N, bootstrap CI | Single pass rate | pass@k | Best-of-n |
+| **Variance decomposition reported** | ✓ seed-noise vs capability-signal ratio | No | No | No |
+| **Per-run dynamical regime** | ✓ trapped / cycle / diffusive | No | No | No |
+| **SNR-weighted alternative ranking** | ✓ principled task weighting | No | No | No |
+| **Failure taxonomy** | ✓ 13 deterministic modes | Binary pass/fail | Binary | None |
 | **LLM judge role** | Capped 10%, gated on deterministic floor | Not used | Not used | Primary scorer |
-| **Configuration diagnostics** | Fingerprint, predict, explain, recommend | No | No | No |
+| **Configuration diagnostics** | ✓ Fingerprint, predict, explain, recommend | No | No | No |
+| **State-isolation per run** | ✓ per-container OPENCLAW_STATE_DIR | No | No | No |
 | **Multiple runs per task** | 3 runs mandatory, statistical tests | Usually 1 | Varies | Usually 1 |
-| **Real tool composition** | Browser + code + memory + cron + delegation | Code only | Code only | Varies |
+| **Provider-routing caveats** | ✓ documented (OpenRouter drift) | Not flagged | Not flagged | Not flagged |
+| **Real tool composition** | ✓ Browser + code + memory + cron + delegation | Code only | Code only | Varies |

 ---

 ## Testing

 ```bash
-python -m pytest -q     # 107 tests
+python -m pytest -q
 ```

 Key test invariants:
@ -401,6 +584,22 @@ Key test invariants:

 ---

+## Version log
+
+| Version | Date | Summary |
+|:---:|---|---|
+| **Core v1** | 2026-04-20 | 19-task signal-curated public release; dynamical-systems diagnostics (C(q), regimes, survival, SNR-weighted); per-container state isolation; rejudge pipeline |
+| v0.5 | earlier | Configuration Diagnostic (fingerprint, predict, fANOVA); plugin-native ablation |
+| v0.4 | earlier | 4-axis scoring with gated judge; 13-mode failure taxonomy; Partner Trace Spec |
+
+Planned for Core v2:
+- **Tier 6 long-horizon tasks** (100+ turn runs) — unlock real Lyapunov / attractor measurement
+- **Paraphrased prompt pairs** — enable perturbation-sensitivity ranking
+- **Creative-synthesis tasks** — currently absent from Core v1
+- **Human-performance baseline** on 10 tasks — calibrate difficulty
+
+---
+
 ## License

 MIT. See `LICENSE`.
@ -409,10 +608,10 @@ MIT. See `LICENSE`.

 ```bibtex
@software{clawbench,
-  title  = {ClawBench: Trace-Scored Agent Benchmark with Configuration Diagnostics},
+  title  = {ClawBench: Trace-Scored Agent Benchmark with Dynamical-Systems Diagnostics},
  author = {ScoootScooob},
  year   = {2026},
-  url    = {https://github.com/scoootscooob/clawbench}
+  url    = {https://github.com/openclaw/clawbench}
 }
 ```

@ -420,8 +619,8 @@ MIT. See `LICENSE`.

 <div align="center">

-**ClawBench** — because users don't experience a benchmark score. They experience the agent.
+**ClawBench** — Rigorous. Reproducible. Dynamical.

-[Dataset](https://huggingface.co/datasets/ScoootScooob/clawbench-results) · [Space](https://huggingface.co/spaces/ScoootScooob/clawbench) · [Spec](CLAWBENCH_V0_4_SPEC.md)
+[Dataset](https://huggingface.co/datasets/openclaw/clawbench-results) · [Space](https://huggingface.co/spaces/openclaw/clawbench) · [Core v1](tasks-public/) · [Spec](CLAWBENCH_V0_4_SPEC.md)

 </div>
--- a/SPACE_README.md
+++ b/SPACE_README.md
@ -136,6 +136,15 @@ submission

 Important rule: browser tasks stay serialized on one dedicated lane to avoid Chromium and port-range collisions.

+## Submission presets
+
+The Submit tab now exposes two preset audiences so the Space can serve both general Claw users and lower-budget exploratory runs:
+
+- `Claw Users` keeps the full preset catalog, including provider-backed frontier models.
+- `Budget Researchers` narrows the list to local or lower-cost presets such as `ollama/gpt-oss:20b`, `ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and `huggingface/google/gemma-4-26B-A4B-it`.
+
+You can still enter any custom model ID directly; the preset audience only filters the shortcut catalog and the bulk-submit action.
+
 ## Task inventory

 | Task | Tier | Family | Main verification |
--- a/app.py
+++ b/app.py
@ -17,6 +17,8 @@ import json
 import logging
 import os
 import threading
+import time
+from dataclasses import dataclass, field
 from pathlib import Path

 import gradio as gr
@ -26,6 +28,16 @@ from clawbench.hub import (
    load_submission_rows_from_parquet,
    resolve_dataset_repo,
 )
+from clawbench.queue import JobQueue, SubmissionRequest
+from clawbench.submission_models import (
+    build_preset_submission_specs,
+    CUSTOM_PRESET_LABEL,
+    PRESET_AUDIENCE_ALL,
+    PRESET_AUDIENCE_CHOICES,
+    PRESET_MODEL_MAP,
+    preset_labels_for_audience,
+    resolve_model_selection,
+)

 logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
 logger = logging.getLogger("clawbench.app")
@ -36,6 +48,16 @@ HF_DATASET_TOKEN = os.environ.get("HF_TOKEN", "")
 HF_DATASET_REPO = resolve_dataset_repo(HF_DATASET_TOKEN)


+@dataclass
+class _LeaderboardCache:
+    lock: threading.Lock = field(default_factory=threading.Lock)
+    loaded_at: float = 0.0
+    frame: pd.DataFrame | None = None
+
+
+_LEADERBOARD_CACHE = _LeaderboardCache()
+
+
 def _env_int(name: str, default: int, *, minimum: int, maximum: int) -> int:
    raw = os.environ.get(name, "").strip()
    if not raw:
@ -48,40 +70,16 @@ def _env_int(name: str, default: int, *, minimum: int, maximum: int) -> int:
    return max(minimum, min(maximum, value))


-DEFAULT_RUNS_PER_TASK = _env_int("CLAWBENCH_DEFAULT_RUNS_PER_TASK", 3, minimum=1, maximum=10)
-DEFAULT_PARALLEL_LANES = _env_int("CLAWBENCH_DEFAULT_PARALLEL_LANES", 1, minimum=1, maximum=4)
-
-# ---------------------------------------------------------------------------
-# Preset models for quick submission
-# ---------------------------------------------------------------------------
-
-PRESET_MODELS = {
-    # All models verified working on HF Inference API (free with HF_TOKEN)
-    # Tested 2026-04-07 via router.huggingface.co/v1/chat/completions
-    #
-    # --- Chinese open-source ---
-    "GLM 5.1 (754B MoE)": "huggingface/zai-org/GLM-5.1",
-    "GLM 5 (400B MoE)": "huggingface/zai-org/GLM-5",
-    "Qwen3 32B": "huggingface/Qwen/Qwen3-32B",
-    "DeepSeek R1": "huggingface/deepseek-ai/DeepSeek-R1",
-    "Kimi K2 Instruct": "huggingface/moonshotai/Kimi-K2-Instruct",
-    "MiniMax M2.5": "huggingface/MiniMaxAI/MiniMax-M2.5",
-    # --- Google open-source ---
-    "Gemma 4 26B MoE": "huggingface/google/gemma-4-26B-A4B-it",
-    # --- Meta open-source ---
-    "Llama 3.3 70B": "huggingface/meta-llama/Llama-3.3-70B-Instruct",
-    "Llama 3.1 70B": "huggingface/meta-llama/Llama-3.1-70B-Instruct",
-    # --- Proprietary models (require runtime auth configured for the model provider) ---
-    "Claude Sonnet 4.6": "anthropic/claude-sonnet-4-6",
-    "Claude Opus 4.6": "anthropic/claude-opus-4-6",
-}
+MAX_RUNS_PER_SUBMISSION = _env_int("CLAWBENCH_MAX_RUNS_PER_SUBMISSION", 3, minimum=1, maximum=10)
+MAX_LANES_PER_SUBMISSION = _env_int("CLAWBENCH_MAX_LANES_PER_SUBMISSION", 4, minimum=1, maximum=8)
+DEFAULT_RUNS_PER_TASK = _env_int("CLAWBENCH_DEFAULT_RUNS_PER_TASK", 3, minimum=1, maximum=MAX_RUNS_PER_SUBMISSION)
+DEFAULT_PARALLEL_LANES = _env_int("CLAWBENCH_DEFAULT_PARALLEL_LANES", 1, minimum=1, maximum=MAX_LANES_PER_SUBMISSION)
+LEADERBOARD_CACHE_SECONDS = _env_int("CLAWBENCH_LEADERBOARD_CACHE_SECONDS", 60, minimum=0, maximum=3600)
+ENABLE_BULK_SUBMIT = os.environ.get("CLAWBENCH_ENABLE_BULK_SUBMIT", "").strip().lower() in {"1", "true", "yes", "on"}
+JUDGE_AFFECTS_SCORE = os.environ.get("CLAWBENCH_JUDGE_AFFECTS_SCORE", "").strip().lower() in {"1", "true", "yes", "on"}

 # ---------------------------------------------------------------------------
 # Background worker (starts in a thread)
-# ---------------------------------------------------------------------------
-
-from clawbench.queue import JobQueue, SubmissionRequest
-
 queue = JobQueue()


@ -108,6 +106,24 @@ logger.info("Background eval worker started")


 def load_leaderboard() -> pd.DataFrame:
+    now = time.monotonic()
+    with _LEADERBOARD_CACHE.lock:
+        if (
+            _LEADERBOARD_CACHE.frame is not None
+            and LEADERBOARD_CACHE_SECONDS > 0
+            and now - _LEADERBOARD_CACHE.loaded_at < LEADERBOARD_CACHE_SECONDS
+        ):
+            return _LEADERBOARD_CACHE.frame.copy()
+
+    frame = _load_leaderboard_uncached()
+    if LEADERBOARD_CACHE_SECONDS > 0:
+        with _LEADERBOARD_CACHE.lock:
+            _LEADERBOARD_CACHE.loaded_at = time.monotonic()
+            _LEADERBOARD_CACHE.frame = frame.copy()
+    return frame.copy()
+
+
+def _load_leaderboard_uncached() -> pd.DataFrame:
    rows = []

    # Load from HF Dataset via direct parquet reads. This avoids
@ -159,29 +175,9 @@ def load_leaderboard() -> pd.DataFrame:


 def _flatten_result(data: dict) -> dict:
-    tasks = data.get("task_results", [])
+    tasks = _parse_json_field(data.get("task_results", []), expected_type=list, default=[])
    n_tasks = len(tasks) if isinstance(tasks, list) else 0
-    # `environment` is serialized as `str(result.environment)` by upload.py
-    # when pushed to the HF Dataset, so rows coming back from the dataset
-    # have a string here instead of the nested dict the local JSON files use.
-    # Normalize both shapes into a dict so `.get()` calls below don't explode.
-    raw_env = data.get("environment", {})
-    if isinstance(raw_env, dict):
-        environment = raw_env
-    elif isinstance(raw_env, str) and raw_env.strip():
-        # Best-effort parse of a stringified dict or JSON object.
-        try:
-            parsed = json.loads(raw_env)
-            environment = parsed if isinstance(parsed, dict) else {}
-        except (ValueError, TypeError):
-            try:
-                import ast
-                parsed = ast.literal_eval(raw_env)
-                environment = parsed if isinstance(parsed, dict) else {}
-            except (ValueError, SyntaxError):
-                environment = {}
-    else:
-        environment = {}
+    environment = _parse_json_field(data.get("environment", {}), expected_type=dict, default={})
    return {
        "Model": data.get("model", ""),
        "Judge Model": data.get("judge_model", environment.get("judge_model", "")) or "-",
@ -205,6 +201,22 @@ def _flatten_result(data: dict) -> dict:
    }


+def _parse_json_field(value, *, expected_type, default):
+    if isinstance(value, expected_type):
+        return value
+    if isinstance(value, str) and value.strip():
+        try:
+            parsed = json.loads(value)
+        except (ValueError, TypeError):
+            try:
+                import ast
+                parsed = ast.literal_eval(value)
+            except (ValueError, SyntaxError):
+                return default
+        return parsed if isinstance(parsed, expected_type) else default
+    return default
+
+
 def load_queue() -> pd.DataFrame:
    jobs = asyncio.run(queue.list_jobs(limit=20))
    if not jobs:
@ -271,16 +283,16 @@ def submit_model(
    prompt_variant: str,
    submitter: str,
 ) -> str:
-    # Use preset if selected, otherwise use custom model ID
-    model_id = PRESET_MODELS.get(preset, "") or model.strip()
+    model_id, provider_id = resolve_model_selection(model, preset, provider)
    if not model_id:
        return "Please enter a model ID or select a preset."

    selected_tier = tier if tier != "all" else None
    request = SubmissionRequest(
        model=model_id,
-        provider=provider.strip(),
+        provider=provider_id,
        judge_model=judge_model.strip(),
+        judge_affects_score=JUDGE_AFFECTS_SCORE,
        runs_per_task=int(runs),
        max_parallel_lanes=int(max_parallel_lanes),
        tier=selected_tier,
@ -288,24 +300,69 @@ def submit_model(
        prompt_variant=prompt_variant,
        submitter=submitter.strip(),
    )
-    job = asyncio.run(queue.submit(request))
-    return f"Submitted [{model_id}]! Job ID: {job.job_id}. Check the Queue tab."
-
-
-def submit_all_presets(runs: int, max_parallel_lanes: int, submitter: str) -> str:
-    """Submit all preset models at once."""
-    submitted = []
-    for name, model_id in PRESET_MODELS.items():
-        request = SubmissionRequest(
-            model=model_id,
-            provider="",
-            runs_per_task=int(runs),
-            max_parallel_lanes=int(max_parallel_lanes),
-            submitter=submitter.strip(),
-        )
+    try:
        job = asyncio.run(queue.submit(request))
-        submitted.append(f"{name} ({job.job_id})")
-    return f"Submitted {len(submitted)} models:\n" + "\n".join(f"  - {s}" for s in submitted)
+    except ValueError as exc:
+        return f"Submission blocked: {exc}"
+    return f"Queued [{model_id}]. Job ID: {job.job_id}. Check the Queue tab."
+
+
+def submit_all_presets(
+    preset_audience: str,
+    runs: int,
+    max_parallel_lanes: int,
+    judge_model: str,
+    tier: str | None,
+    scenario: str | None,
+    prompt_variant: str,
+    submitter: str,
+) -> str:
+    """Submit all preset models from the selected audience track."""
+    if not ENABLE_BULK_SUBMIT:
+        return (
+            "Bulk preset submission is disabled for this deployment. "
+            "Set CLAWBENCH_ENABLE_BULK_SUBMIT=1 to enable it for maintainer runs."
+        )
+
+    selected_tier = tier if tier != "all" else None
+    selected_scenario = scenario if scenario != "all" else None
+    preset_specs = build_preset_submission_specs(
+        preset_audience,
+        runs=int(runs),
+        max_parallel_lanes=int(max_parallel_lanes),
+        judge_model=judge_model,
+        tier=selected_tier,
+        scenario=selected_scenario,
+        prompt_variant=prompt_variant,
+        submitter=submitter,
+    )
+    if not preset_specs:
+        return f"No presets configured for {preset_audience}."
+
+    submitted = []
+    blocked = []
+    for preset, request_kwargs in preset_specs:
+        request_kwargs["judge_affects_score"] = JUDGE_AFFECTS_SCORE
+        request = SubmissionRequest(**request_kwargs)
+        try:
+            job = asyncio.run(queue.submit(request))
+        except ValueError as exc:
+            blocked.append(f"{preset.label}: {exc}")
+            continue
+        submitted.append(f"{preset.label} ({job.job_id})")
+    message = f"Queued {len(submitted)} models from {preset_audience}:\n" + "\n".join(
+        f"  - {item}" for item in submitted
+    )
+    if blocked:
+        message += "\n\nBlocked:\n" + "\n".join(f"  - {item}" for item in blocked)
+    return message
+
+
+def update_preset_choices(preset_audience: str):
+    return gr.update(
+        choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(preset_audience),
+        value=CUSTOM_PRESET_LABEL,
+    )


 # ---------------------------------------------------------------------------
@ -952,7 +1009,7 @@ STAT_JUDGE = (
 )
 STAT_PRESETS = (
    '<div class="stat-pill"><div class="label">Presets</div><div class="value teal">'
-    + str(len(PRESET_MODELS))
+    + str(len(PRESET_MODEL_MAP))
    + "</div></div>"
 )

@ -986,12 +1043,28 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
            "run via HuggingFace Inference API. You can also use locally hosted models "
            "(for example Ollama) when your OpenClaw runtime has them configured."
        )
+        gr.Markdown(
+            "Use `Preset Audience` to switch between the full Claw catalog and a smaller budget track. "
+            "The budget track keeps local and lower-cost options upfront, including `ollama/gpt-oss:20b`, "
+            "`ollama/qwen3.5:27b`, `huggingface/Qwen/Qwen3-32B`, and "
+            "`huggingface/google/gemma-4-26B-A4B-it`."
+        )

+        preset_audience_input = gr.Dropdown(
+            choices=list(PRESET_AUDIENCE_CHOICES),
+            value=PRESET_AUDIENCE_ALL,
+            label="Preset Audience",
+        )
        preset_input = gr.Dropdown(
-            choices=["(custom)"] + list(PRESET_MODELS.keys()),
-            value="(custom)",
+            choices=[CUSTOM_PRESET_LABEL] + preset_labels_for_audience(PRESET_AUDIENCE_ALL),
+            value=CUSTOM_PRESET_LABEL,
            label="Preset models",
        )
+        preset_audience_input.change(
+            fn=update_preset_choices,
+            inputs=preset_audience_input,
+            outputs=preset_input,
+        )
        with gr.Row():
            model_input = gr.Textbox(
                label="Custom Model ID (if not using preset)",
@ -1009,12 +1082,12 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
        )
        with gr.Row():
            runs_input = gr.Slider(
-                minimum=1, maximum=10, value=DEFAULT_RUNS_PER_TASK, step=1,
+                minimum=1, maximum=MAX_RUNS_PER_SUBMISSION, value=DEFAULT_RUNS_PER_TASK, step=1,
                label="Runs per task (higher = more reliable pass^k)",
            )
            max_parallel_lanes_input = gr.Slider(
                minimum=1,
-                maximum=4,
+                maximum=MAX_LANES_PER_SUBMISSION,
                value=DEFAULT_PARALLEL_LANES,
                step=1,
                label="Parallel lanes (browser tasks stay serialized on one lane)",
@ -1054,7 +1127,7 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
        )
        with gr.Row():
            submit_btn = gr.Button("Submit Model", variant="primary")
-            submit_all_btn = gr.Button("Submit All Presets", variant="secondary")
+            submit_all_btn = gr.Button("Submit All Presets", variant="secondary", interactive=ENABLE_BULK_SUBMIT)
        submit_output = gr.Textbox(label="Status", interactive=False, lines=5, elem_classes=["output-textbox"])
        submit_btn.click(
            fn=submit_model,
@ -1074,26 +1147,44 @@ with gr.Blocks(title="ClawBench", theme=clawbench_theme, css=CUSTOM_CSS) as demo
        )
        submit_all_btn.click(
            fn=submit_all_presets,
-            inputs=[runs_input, max_parallel_lanes_input, submitter_input],
+            inputs=[
+                preset_audience_input,
+                runs_input,
+                max_parallel_lanes_input,
+                judge_model_input,
+                tier_input,
+                scenario_input,
+                prompt_variant_input,
+                submitter_input,
+            ],
            outputs=submit_output,
        )

        gr.Markdown("""
-**All presets verified working on HF Inference API (free):**
+**Preset audiences:**

-| Model | Provider | Size | Runtime |
-|-------|----------|------|---------|
-| GLM 5.1 | Z.ai | 754B MoE | HF free |
-| GLM 5 | Z.ai | 400B MoE | HF free |
-| Qwen3 32B | Alibaba | 32B | HF free |
-| DeepSeek R1 | DeepSeek | 671B MoE | HF free |
-| Kimi K2 Instruct | Moonshot AI | MoE | HF free |
-| MiniMax M2.5 | MiniMax | MoE | HF free |
-| Gemma 4 26B MoE | Google | 26B MoE | HF free |
-| Llama 3.3 70B | Meta | 70B | HF free |
-| Llama 3.1 70B | Meta | 70B | HF free |
-| Claude Sonnet 4.6 | Anthropic | - | configured auth |
-| Claude Opus 4.6 | Anthropic | - | configured auth |
+| Audience | What it optimizes for | Presets |
+|---|---|---|
+| Claw Users | Full preset catalog, including provider-backed frontier options | Anthropic, HF open-weight, and Ollama presets |
+| Budget Researchers | Smaller local/free-friendly track | GPT-OSS 20B, Qwen 3.5 27B, Qwen3 32B, Gemma 4 26B |
+
+**Current preset catalog:**
+
+| Model | Provider | Audience |
+|---|---|---|
+| GPT-OSS 20B (Ollama) | Ollama | Claw Users, Budget Researchers |
+| Qwen 3.5 27B (Ollama) | Ollama | Claw Users, Budget Researchers |
+| Qwen3 32B | HuggingFace | Claw Users, Budget Researchers |
+| Gemma 4 26B MoE | HuggingFace | Claw Users, Budget Researchers |
+| GLM 5.1 | HuggingFace | Claw Users |
+| GLM 5 | HuggingFace | Claw Users |
+| DeepSeek R1 | HuggingFace | Claw Users |
+| Kimi K2 Instruct | HuggingFace | Claw Users |
+| MiniMax M2.5 | HuggingFace | Claw Users |
+| Llama 3.3 70B | HuggingFace | Claw Users |
+| Llama 3.1 70B | HuggingFace | Claw Users |
+| Claude Sonnet 4.6 | Anthropic | Claw Users |
+| Claude Opus 4.6 | Anthropic | Claw Users |
 """)

    with gr.Tab("Queue"):
@ -1167,7 +1258,7 @@ Current formula:
 - reported as a sidecar signal and does not change the official deterministic leaderboard score

 ### Task Design
- 20 tasks across 5 tiers
+- 19 tasks across 5 tiers
 - deterministic local services for browser tasks
 - multi-file assets with real bugs, missing tests, and migration work
 - scripted user turns and optional multi-phase fresh-session tasks
@ -1175,19 +1266,19 @@ Current formula:
 ### Coverage snapshot
 ```text
 Tier mix
-tier1 | ###   3
-tier2 | ##### 5
-tier3 | ##### 5
-tier4 | ####  4
-tier5 | ###   3
+tier1 | ##     2
+tier2 | ###### 6
+tier3 | #####  5
+tier4 | #####  5
+tier5 | #      1

 Family mix
-repo        | ###### 6
-coding      | ####   4
-multi_tool  | ###    3
-adversarial | ###    3
-browser     | ##     2
-tools       | ##     2
+tools       | ######## 8
+repo        | ###      3
+coding      | ##       2
+multi_tool  | ###      3
+browser     | ##       2
+adversarial | #        1
 ```

 ### pass^k: Production Reliability
--- a/clawbench/ablation.py
+++ b/clawbench/ablation.py
@ -0,0 +1,313 @@
+"""Ablation profiles and fair-comparison helpers.
+
+The benchmark can only explain model, harness, and tool effects if those
+axes are represented explicitly in run metadata. This module keeps that
+representation small and deterministic: a harness driver plus a tool
+profile yields a fingerprint, and result comparison refuses to call a
+delta fair when models or task sets drift.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import json
+import subprocess
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, Iterable
+
+from pydantic import BaseModel, Field
+
+from clawbench.adapters import get_adapter
+from clawbench.adapters.base import AdapterConfig
+from clawbench.canonical import AdapterCapability
+from clawbench.canonical.convert import from_task_definition
+from clawbench.schemas import BenchmarkResult, TaskDefinition
+
+
+CAPABILITY_TO_INTERFACE: dict[AdapterCapability, str] = {
+    AdapterCapability.FILES: "filesystem",
+    AdapterCapability.EXECUTION: "shell",
+    AdapterCapability.MEMORY: "memory",
+    AdapterCapability.SESSION: "session",
+    AdapterCapability.CRON: "scheduler",
+    AdapterCapability.BROWSER: "browser",
+    AdapterCapability.GATEWAY_RPC: "gateway_rpc",
+    AdapterCapability.MULTI_TURN_INJECTION: "multi_turn",
+}
+
+
+class HarnessDescriptor(BaseModel):
+    """Identifies the agent loop being measured."""
+
+    adapter: str
+    driver: str = ""
+    version: str = ""
+    git_sha: str = ""
+    source: str = ""
+    invocation: str = "clawbench"
+
+
+class ToolProfile(BaseModel):
+    """The tools/interfaces exposed to a harness run."""
+
+    name: str
+    mode: str = "native"
+    interfaces: list[str] = Field(default_factory=list)
+    adapter_capabilities: list[str] = Field(default_factory=list)
+    enabled_toolsets: list[str] = Field(default_factory=list)
+    disabled_toolsets: list[str] = Field(default_factory=list)
+    tools: list[str] = Field(default_factory=list)
+    fingerprint: str = ""
+
+    def with_fingerprint(self) -> "ToolProfile":
+        payload = {
+            "name": self.name,
+            "mode": self.mode,
+            "interfaces": sorted(self.interfaces),
+            "adapter_capabilities": sorted(self.adapter_capabilities),
+            "enabled_toolsets": sorted(self.enabled_toolsets),
+            "disabled_toolsets": sorted(self.disabled_toolsets),
+            "tools": sorted(self.tools),
+        }
+        digest = hashlib.sha256(
+            json.dumps(payload, sort_keys=True, separators=(",", ":")).encode("utf-8")
+        ).hexdigest()
+        return self.model_copy(update={"fingerprint": digest[:16]})
+
+
+class AblationProfile(BaseModel):
+    """Run-level axis metadata embedded in BenchmarkResult.environment."""
+
+    model: str
+    harness: HarnessDescriptor
+    tool_profile: ToolProfile
+    prompt_profile: str = "clear"
+    fingerprint: str = ""
+
+    def with_fingerprint(self) -> "AblationProfile":
+        tool_profile = self.tool_profile.with_fingerprint()
+        payload = {
+            "model": self.model,
+            "harness": self.harness.model_dump(),
+            "tool_profile": tool_profile.model_dump(),
+            "prompt_profile": self.prompt_profile,
+        }
+        digest = hashlib.sha256(
+            json.dumps(payload, sort_keys=True, separators=(",", ":")).encode("utf-8")
+        ).hexdigest()
+        return self.model_copy(
+            update={
+                "tool_profile": tool_profile,
+                "fingerprint": digest[:16],
+            }
+        )
+
+
+@dataclass(frozen=True)
+class FairTaskSet:
+    task_ids: list[str]
+    skipped: dict[str, list[str]] = field(default_factory=dict)
+
+
+def capabilities_to_interfaces(capabilities: Iterable[AdapterCapability | str]) -> list[str]:
+    values: list[str] = []
+    for cap in capabilities:
+        enum_value = cap if isinstance(cap, AdapterCapability) else AdapterCapability(str(cap))
+        values.append(CAPABILITY_TO_INTERFACE.get(enum_value, enum_value.value))
+    return sorted(set(values))
+
+
+def adapter_capabilities(
+    adapter: str,
+    config: AdapterConfig | None = None,
+) -> set[AdapterCapability]:
+    adapter_cls = get_adapter(adapter)
+    return adapter_cls.supported_capabilities(config)
+
+
+def default_tool_profile(
+    *,
+    adapter: str,
+    config: AdapterConfig | None = None,
+    name: str | None = None,
+    mode: str = "native",
+    enabled_toolsets: list[str] | None = None,
+    disabled_toolsets: list[str] | None = None,
+) -> ToolProfile:
+    caps = adapter_capabilities(adapter, config)
+    profile = ToolProfile(
+        name=name or f"{adapter}-{mode}",
+        mode=mode,
+        interfaces=capabilities_to_interfaces(caps),
+        adapter_capabilities=sorted(cap.value for cap in caps),
+        enabled_toolsets=enabled_toolsets or [],
+        disabled_toolsets=disabled_toolsets or [],
+    )
+    return profile.with_fingerprint()
+
+
+def compatible_task_ids(
+    tasks: Iterable[TaskDefinition],
+    *,
+    adapter: str,
+    config: AdapterConfig | None = None,
+) -> tuple[list[str], dict[str, list[str]]]:
+    caps = adapter_capabilities(adapter, config)
+    task_ids: list[str] = []
+    skipped: dict[str, list[str]] = {}
+    for task in tasks:
+        canonical = from_task_definition(task)
+        missing = set(canonical.required_adapter_capabilities) - caps
+        if missing:
+            skipped[task.id] = sorted(cap.value for cap in missing)
+        else:
+            task_ids.append(task.id)
+    return task_ids, skipped
+
+
+def common_compatible_task_set(
+    tasks: Iterable[TaskDefinition],
+    adapter_configs: dict[str, tuple[str, AdapterConfig | None]],
+) -> FairTaskSet:
+    task_list = list(tasks)
+    common: set[str] | None = None
+    skipped: dict[str, list[str]] = {}
+    for label, (adapter, config) in adapter_configs.items():
+        ids, missing = compatible_task_ids(task_list, adapter=adapter, config=config)
+        ids_set = set(ids)
+        common = ids_set if common is None else common & ids_set
+        for task_id, caps in missing.items():
+            skipped.setdefault(task_id, []).append(f"{label}: {', '.join(caps)}")
+    ordered = [task.id for task in task_list if task.id in (common or set())]
+    return FairTaskSet(task_ids=ordered, skipped=skipped)
+
+
+def build_ablation_profile(
+    *,
+    model: str,
+    adapter: str,
+    config: AdapterConfig | None = None,
+    prompt_profile: str = "clear",
+    harness_version: str = "",
+    harness_git_sha: str = "",
+    harness_source: str = "",
+    driver: str = "",
+    tool_profile_name: str | None = None,
+    enabled_toolsets: list[str] | None = None,
+    disabled_toolsets: list[str] | None = None,
+) -> AblationProfile:
+    harness = HarnessDescriptor(
+        adapter=adapter,
+        driver=driver,
+        version=harness_version,
+        git_sha=harness_git_sha,
+        source=harness_source,
+    )
+    tool_profile = default_tool_profile(
+        adapter=adapter,
+        config=config,
+        name=tool_profile_name,
+        enabled_toolsets=enabled_toolsets,
+        disabled_toolsets=disabled_toolsets,
+    )
+    return AblationProfile(
+        model=model,
+        harness=harness,
+        tool_profile=tool_profile,
+        prompt_profile=prompt_profile,
+    ).with_fingerprint()
+
+
+def compare_results(results: dict[str, BenchmarkResult]) -> dict[str, Any]:
+    """Return score deltas plus fairness checks for result JSONs."""
+
+    labels = list(results)
+    models = {label: result.model for label, result in results.items()}
+    task_sets = {
+        label: [task.task_id for task in result.task_results]
+        for label, result in results.items()
+    }
+    first_tasks = next(iter(task_sets.values()), [])
+    same_task_set = all(tasks == first_tasks for tasks in task_sets.values())
+    same_model = len(set(models.values())) == 1
+    snapshot_fingerprints = {
+        result.task_snapshot_fingerprint
+        for result in results.values()
+        if result.task_snapshot_fingerprint
+    }
+    same_task_snapshot = len(snapshot_fingerprints) <= 1
+    prompt_variants = {
+        str(result.environment.get("prompt_variant", ""))
+        for result in results.values()
+        if result.environment.get("prompt_variant", "")
+    }
+    same_prompt_variant = len(prompt_variants) <= 1
+    benchmark_releases = {
+        result.benchmark_release_id
+        for result in results.values()
+        if result.benchmark_release_id
+    }
+    same_benchmark_release = len(benchmark_releases) <= 1
+    task_verifier_fair = same_task_set and same_task_snapshot and same_prompt_variant and same_benchmark_release
+
+    rows: dict[str, Any] = {}
+    for label, result in results.items():
+        rows[label] = {
+            "model": result.model,
+            "adapter": result.environment.get("adapter", ""),
+            "score": result.overall_score,
+            "completion": result.overall_completion,
+            "trajectory": result.overall_trajectory,
+            "behavior": result.overall_behavior,
+            "reliability": result.overall_reliability,
+            "task_count": len(result.task_results),
+            "task_snapshot_fingerprint": result.task_snapshot_fingerprint,
+            "benchmark_release_id": result.benchmark_release_id,
+            "prompt_variant": result.environment.get("prompt_variant", ""),
+            "dimension_coverage": result.environment.get("dimension_coverage", {}),
+            "ablation": result.environment.get("ablation_profile", {}),
+        }
+
+    deltas: dict[str, float] = {}
+    if labels:
+        baseline = results[labels[0]].overall_score
+        for label in labels[1:]:
+            deltas[f"{label}_minus_{labels[0]}"] = round(
+                results[label].overall_score - baseline,
+                4,
+            )
+
+    return {
+        "fair": bool(task_verifier_fair),
+        "task_verifier_fair": bool(task_verifier_fair),
+        "controlled_ablation": bool(task_verifier_fair and same_model),
+        "same_model": same_model,
+        "same_task_set": same_task_set,
+        "same_task_snapshot": same_task_snapshot,
+        "same_prompt_variant": same_prompt_variant,
+        "same_benchmark_release": same_benchmark_release,
+        "models": models,
+        "task_sets": task_sets,
+        "rows": rows,
+        "deltas": deltas,
+    }
+
+
+def git_head(path: Path) -> tuple[str, str]:
+    """Best-effort `(sha, describe)` for harness provenance."""
+
+    try:
+        sha = subprocess.check_output(
+            ["git", "-C", str(path), "rev-parse", "HEAD"],
+            text=True,
+            stderr=subprocess.DEVNULL,
+        ).strip()
+        desc = subprocess.check_output(
+            ["git", "-C", str(path), "describe", "--tags", "--always", "--dirty"],
+            text=True,
+            stderr=subprocess.DEVNULL,
+        ).strip()
+        return sha, desc
+    except Exception:
+        return "", ""
--- a/clawbench/adapters/init.py
+++ b/clawbench/adapters/init.py
@ -0,0 +1,102 @@
+"""Agent adapter layer — Phase-4 of CLAWBENCH_V0_4_SPEC.md.
+
+Adapters plug an agent framework (OpenClaw, Hermes, Codex, Claude Code,
+Deerflow, …) into ClawBench's canonical task pipeline. Each adapter is
+responsible for:
+
+- Setting up the workspace + seed state from a `CanonicalTask`.
+- Driving the agent through each `CanonicalPhase`'s simulated user.
+- Returning a canonical `Transcript` so the scorer, trajectory analyser,
+  and judge can score the run unchanged.
+- Resolving `StateQuery` assertions that fall under its declared
+  capabilities; returning `capability_missing=True` for queries that
+  require a capability the adapter doesn't provide.
+
+The `ADAPTERS` registry is populated by each adapter module at import
+time. `get_adapter(name)` is the canonical lookup.
+"""
+
+from __future__ import annotations
+
+from clawbench.adapters.base import (
+    AdapterConfig,
+    AdapterContext,
+    AgentAdapter,
+    PhaseResult,
+    StateQueryResult,
+)
+
+#: Registry of adapter_name → adapter class. Populated by the adapter
+#: modules at import time (e.g. `from clawbench.adapters.openclaw import *`
+#: registers the OpenClaw adapter). Callers should use `get_adapter`
+#: rather than reading this dict directly.
+ADAPTERS: dict[str, type[AgentAdapter]] = {}
+
+
+def register_adapter(cls: type[AgentAdapter]) -> type[AgentAdapter]:
+    """Decorator / direct-call helper that registers an adapter class.
+
+    Adapters declare themselves via:
+
+    ```
+    @register_adapter
+    class HermesAdapter(AgentAdapter):
+        name = "hermes"
+        ...
+    ```
+    """
+
+    name = getattr(cls, "name", "")
+    if not name:
+        raise ValueError(f"{cls.__name__} must set a non-empty `name` class attribute")
+    existing = ADAPTERS.get(name)
+    if existing is not None and existing is not cls:
+        raise ValueError(
+            f"Adapter name collision: '{name}' is already registered "
+            f"to {existing.__qualname__}"
+        )
+    ADAPTERS[name] = cls
+    return cls
+
+
+def get_adapter(name: str) -> type[AgentAdapter]:
+    """Look up an adapter class by its registered name.
+
+    Import the adapter module before calling this so the registration
+    has run. `clawbench.adapters.openclaw` always loads; optional
+    adapters (hermes, codex) guard their imports and raise a clear
+    error if their runtime dep isn't installed.
+    """
+
+    try:
+        return ADAPTERS[name]
+    except KeyError as exc:
+        available = ", ".join(sorted(ADAPTERS)) or "(none)"
+        raise KeyError(
+            f"Unknown adapter '{name}'. Registered adapters: {available}"
+        ) from exc
+
+
+__all__ = [
+    "ADAPTERS",
+    "AdapterConfig",
+    "AdapterContext",
+    "AgentAdapter",
+    "PhaseResult",
+    "StateQueryResult",
+    "get_adapter",
+    "register_adapter",
+]
+
+
+# Register built-in adapters at import time. Each adapter module is
+# expected to @register_adapter its class. OpenClaw is always
+# available; optional adapters (hermes, codex) guard their imports and
+# are registered only when their runtime dep is present.
+from clawbench.adapters import openclaw as _openclaw  # noqa: E402,F401
+
+try:
+    from clawbench.adapters import hermes as _hermes  # noqa: E402,F401
+except Exception:
+    # hermes-agent is an optional extra; absence is fine.
+    _hermes = None  # type: ignore[assignment]
--- a/clawbench/adapters/base.py
+++ b/clawbench/adapters/base.py
@ -0,0 +1,234 @@
+"""Agent adapter ABC and associated data shapes.
+
+An `AgentAdapter` is the execution counterpart to a `CanonicalTask`. It
+is the only place where framework-specific details (OpenClaw gateway
+RPCs, Hermes `MiniSWERunner`, Claude Code SDK, etc.) live. Everything
+downstream of the adapter — trajectory analysis, scorer, judge, stats —
+consumes a canonical `Transcript` and `TaskRunResult` produced by the
+adapter, so those modules stay unchanged across adapters.
+
+Lifecycle per task run:
+
+1. Harness instantiates `adapter = AdapterClass(config)`.
+2. `async with adapter as adapter:` — starts subprocesses / websockets
+   / whatever this adapter needs to hold open across a run.
+3. `await adapter.setup(ctx)` — realizes seed state, workspace files,
+   background services, pre-run state queries.
+4. For each `CanonicalPhase`: `await adapter.run_phase(phase, ctx)` —
+   drives the simulated user against the agent, returns a
+   `PhaseResult` with the transcript increment.
+5. For each `StateQuery` in `task.verifier.state_queries`:
+   `await adapter.verify_state_query(query, ctx)` — returns whether
+   the assertion held, or that the adapter lacks the capability.
+6. `await adapter.teardown(ctx)` — cleans up agent-side state (the
+   workspace itself is harness-owned).
+"""
+
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Any, ClassVar
+
+from clawbench.canonical import (
+    AdapterCapability,
+    CanonicalPhase,
+    CanonicalTask,
+    StateQuery,
+)
+from clawbench.schemas import Transcript, TranscriptMessage
+
+
+@dataclass
+class AdapterConfig:
+    """Base config every adapter accepts.
+
+    Adapters subclass this to add their own fields. The harness builds
+    a config instance from CLI flags / env vars and passes it to the
+    adapter constructor.
+    """
+
+    #: Primary model identifier. Semantics are adapter-specific (an
+    #: OpenClaw model id, a Hermes `--model` string, etc.).
+    model: str = ""
+
+
+@dataclass
+class AdapterContext:
+    """Per-run context handed to every adapter method.
+
+    `transcript` is mutated in place across phases: each
+    `run_phase` call appends the messages it observed, so the scorer
+    sees one consolidated `Transcript` at the end.
+    """
+
+    task: CanonicalTask
+    workspace: Path
+    runtime_values: dict[str, Any]
+    run_index: int
+    model: str
+    transcript: Transcript
+    #: Free-form adapter-owned scratch state (e.g. the OpenClaw
+    #: `session_key` and `agent_id`; the Hermes `MiniSWERunner`
+    #: instance). The harness never reads these — the adapter is free
+    #: to use the dict as its own in-context cache.
+    adapter_state: dict[str, Any] = field(default_factory=dict)
+
+
+@dataclass
+class PhaseResult:
+    """The transcript increment produced by a single phase."""
+
+    messages: list[TranscriptMessage] = field(default_factory=list)
+    #: Adapter-specific metadata for this phase (token counts returned
+    #: by the adapter, session identifiers, etc.). Merged into
+    #: `TaskRunResult` under the `efficiency_result` / adapter metadata
+    #: fields where applicable.
+    adapter_metadata: dict[str, Any] = field(default_factory=dict)
+    #: True if the adapter detected that the agent completed normally
+    #: (e.g. Hermes's `completed=True`). Not a pass/fail signal — just
+    #: whether the trajectory ran out of work vs was cut short. The
+    #: scorer uses this in `delivery_outcome` classification.
+    completed_normally: bool = True
+    #: If the phase aborted due to the adapter itself (not the agent),
+    #: populated with an error message the harness surfaces.
+    error: str | None = None
+
+
+@dataclass
+class StateQueryResult:
+    """Result of resolving a `StateQuery` against the adapter's state.
+
+    `capability_missing=True` means "this adapter cannot evaluate this
+    kind of query". The scorer treats that as neutral (neither pass nor
+    fail) and records a skip note in the `CompletionResult`; under
+    `--strict-compat` the harness will have filtered the task out before
+    the adapter ever saw it.
+    """
+
+    ok: bool
+    detail: str = ""
+    capability_missing: bool = False
+
+
+class AgentAdapter(ABC):
+    """Abstract base class for agent adapters.
+
+    Subclasses MUST:
+    - Set a unique `name: ClassVar[str]`.
+    - Set a `capabilities: ClassVar[set[AdapterCapability]]` declaring
+      which state-query kinds the adapter can resolve.
+    - Implement `setup`, `run_phase`, `verify_state_query`, `teardown`.
+    - Optionally implement `__aenter__` / `__aexit__` for long-lived
+      resource setup (a persistent websocket, a subprocess pool).
+    """
+
+    name: ClassVar[str] = ""
+    capabilities: ClassVar[set[AdapterCapability]] = set()
+
+    def __init__(self, config: AdapterConfig | None = None) -> None:
+        self.config: AdapterConfig = config or AdapterConfig()
+
+    # ------------------------------------------------------------------
+    # Optional long-lived resource management.
+    # ------------------------------------------------------------------
+
+    async def __aenter__(self) -> "AgentAdapter":
+        return self
+
+    async def __aexit__(self, exc_type: object, exc: object, tb: object) -> None:
+        return None
+
+    # ------------------------------------------------------------------
+    # Required per-run lifecycle.
+    # ------------------------------------------------------------------
+
+    @abstractmethod
+    async def setup(self, ctx: AdapterContext) -> None:
+        """Realise the workspace, seed state, and any pre-run state.
+
+        The harness has already created the workspace dir and expanded
+        `CanonicalAssets.workspace_files` into it. The adapter is
+        responsible for:
+
+        - Applying `seed_state` entries via an adapter-appropriate
+          mechanism (OpenClaw → memory RPCs; Hermes → file writes).
+        - Starting the agent's process/session so `run_phase` can send
+          turns immediately.
+        """
+
+    @abstractmethod
+    async def run_phase(
+        self,
+        phase: CanonicalPhase,
+        ctx: AdapterContext,
+    ) -> PhaseResult:
+        """Drive one `CanonicalPhase` to completion.
+
+        The simulated user in `phase.user` dictates what to send and
+        when. The adapter's job is to deliver those turns, observe the
+        agent's responses, and append canonical `TranscriptMessage`
+        entries to `ctx.transcript`.
+        """
+
+    @abstractmethod
+    async def verify_state_query(
+        self,
+        query: StateQuery,
+        ctx: AdapterContext,
+    ) -> StateQueryResult:
+        """Resolve one `StateQuery` against the agent's post-run state.
+
+        Adapters whose `capabilities` don't cover `query.required_capability`
+        should return `StateQueryResult(ok=False, capability_missing=True)`.
+        """
+
+    @abstractmethod
+    async def teardown(self, ctx: AdapterContext) -> None:
+        """Release any agent-side state created during `setup`/`run_phase`.
+
+        The harness owns the workspace lifecycle; the adapter owns
+        sessions, subprocesses, and any in-memory caches it held open.
+        """
+
+    # ------------------------------------------------------------------
+    # Convenience helpers available to every adapter.
+    # ------------------------------------------------------------------
+
+    @classmethod
+    def supported_capabilities(
+        cls,
+        config: AdapterConfig | None = None,
+    ) -> set[AdapterCapability]:
+        """Return capabilities available for a concrete adapter config.
+
+        Most adapters have a fixed surface and can use the class-level
+        `capabilities`. Adapters with multiple driver modes, such as Hermes
+        MiniSWE vs full AIAgent, override this to keep task gating honest.
+        """
+
+        return set(cls.capabilities)
+
+    @classmethod
+    def missing_capabilities_for(
+        cls,
+        task: CanonicalTask,
+        config: AdapterConfig | None = None,
+    ) -> set[AdapterCapability]:
+        """Return the subset of `task.required_adapter_capabilities` this
+        adapter cannot cover. Empty set means the task is fully runnable
+        under this adapter.
+        """
+
+        return set(task.required_adapter_capabilities) - cls.supported_capabilities(config)
+
+    @classmethod
+    def supports(
+        cls,
+        task: CanonicalTask,
+        config: AdapterConfig | None = None,
+    ) -> bool:
+        """True iff this adapter can cover every capability the task needs."""
+
+        return not cls.missing_capabilities_for(task, config)
--- a/clawbench/adapters/hermes.py
+++ b/clawbench/adapters/hermes.py
@ -0,0 +1,706 @@
+"""Hermes adapter — drives Nous Research `hermes-agent`.
+
+Hermes (https://github.com/NousResearch/hermes-agent) is a Python agent
+framework with `MiniSWERunner` as its clean programmatic entry point.
+This adapter:
+
+1. Realizes the canonical workspace + seed state (seed_state entries
+   with `kind="memory"` become files, since Hermes has no memory RPC).
+2. Constructs a `MiniSWERunner` scoped to the workspace.
+3. For each canonical phase, renders the user turn and calls
+   `runner.run_task(prompt)` in a worker thread, with the phase's
+   timeout enforced as a wall clock.
+4. Parses the returned `conversations` via
+   `clawbench.adapters.hermes_xml.parse_conversation` into a canonical
+   `Transcript` the scorer can consume unchanged.
+5. For state queries the adapter can't resolve (session, cron, custom
+   gateway RPC), returns `capability_missing=True` so the harness
+   reports a clean skip. Memory queries fall back to workspace file
+   scanning via `environment_files.verify_memory_fallback`.
+
+`hermes-agent` is an **optional** dependency (`clawbench[hermes]`). The
+import is guarded so the base install stays lean; calling this adapter
+without the dep installed raises a clear error rather than a cryptic
+`ImportError`.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import importlib.util
+import json
+import logging
+import os
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+from urllib.parse import urlparse
+
+from clawbench.adapters import register_adapter
+from clawbench.adapters.base import (
+    AdapterConfig,
+    AdapterContext,
+    AgentAdapter,
+    PhaseResult,
+    StateQueryResult,
+)
+from clawbench.adapters.hermes_xml import parse_chat_messages, parse_conversation
+from clawbench.canonical import (
+    AdapterCapability,
+    CanonicalPhase,
+    StateQuery,
+)
+from clawbench.environment_files import verify_memory_fallback
+from clawbench.render import render_template
+from clawbench.schemas import MemoryState, PromptVariant
+from clawbench.simulated_user import UserSimulator
+
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# Optional dependency import — guarded so the base install stays lean.
+# ---------------------------------------------------------------------------
+
+def _load_mini_swe_runner() -> tuple[Any, Exception | None]:
+    try:  # pragma: no cover - import-guard branch
+        from mini_swe_runner import MiniSWERunner as runner_cls  # type: ignore[import-not-found]
+
+        return runner_cls, None
+    except Exception as exc:  # pragma: no cover - import-guard branch
+        import_error = exc
+        candidates: list[Path] = []
+        explicit_file = os.environ.get("HERMES_MINI_SWE_RUNNER")
+        if explicit_file:
+            candidates.append(Path(explicit_file).expanduser())
+        for env_name in ("HERMES_AGENT_REPO", "HERMES_INSTALL_DIR"):
+            value = os.environ.get(env_name)
+            if value:
+                candidates.append(Path(value).expanduser() / "mini_swe_runner.py")
+        hermes_home = Path(os.environ.get("HERMES_HOME", "~/.hermes")).expanduser()
+        candidates.append(hermes_home / "hermes-agent" / "mini_swe_runner.py")
+
+        for path in candidates:
+            if not path.is_file():
+                continue
+            try:
+                repo_root = str(path.parent)
+                if repo_root not in sys.path:
+                    sys.path.insert(0, repo_root)
+                spec = importlib.util.spec_from_file_location(
+                    "_clawbench_hermes_mini_swe_runner",
+                    path,
+                )
+                if spec is None or spec.loader is None:
+                    continue
+                module = importlib.util.module_from_spec(spec)
+                sys.modules[spec.name] = module
+                spec.loader.exec_module(module)
+                return module.MiniSWERunner, None
+            except Exception as path_exc:
+                import_error = path_exc
+                continue
+        return None, import_error
+
+
+MiniSWERunner, _HERMES_IMPORT_ERROR = _load_mini_swe_runner()
+
+
+def _load_ai_agent() -> tuple[Any, Exception | None]:
+    try:  # pragma: no cover - import-guard branch
+        from run_agent import AIAgent as agent_cls  # type: ignore[import-not-found]
+
+        return agent_cls, None
+    except Exception as exc:  # pragma: no cover - import-guard branch
+        import_error = exc
+        candidates: list[Path] = []
+        for env_name in ("HERMES_AGENT_REPO", "HERMES_INSTALL_DIR"):
+            value = os.environ.get(env_name)
+            if value:
+                candidates.append(Path(value).expanduser() / "run_agent.py")
+        hermes_home = Path(os.environ.get("HERMES_HOME", "~/.hermes")).expanduser()
+        candidates.append(hermes_home / "hermes-agent" / "run_agent.py")
+
+        for path in candidates:
+            if not path.is_file():
+                continue
+            try:
+                repo_root = str(path.parent)
+                if repo_root not in sys.path:
+                    sys.path.insert(0, repo_root)
+                spec = importlib.util.spec_from_file_location(
+                    "_clawbench_hermes_run_agent",
+                    path,
+                )
+                if spec is None or spec.loader is None:
+                    continue
+                module = importlib.util.module_from_spec(spec)
+                sys.modules[spec.name] = module
+                spec.loader.exec_module(module)
+                return module.AIAgent, None
+            except Exception as path_exc:
+                import_error = path_exc
+                continue
+        return None, import_error
+
+
+AIAgent, _HERMES_AGENT_IMPORT_ERROR = _load_ai_agent()
+
+
+class _CodexToolMessageCompatClient:
+    """Client wrapper for Hermes's Codex Responses shim.
+
+    The current Hermes MiniSWERunner feeds OpenAI chat-style `role="tool"`
+    messages back into `chat.completions.create()`. Hermes's Codex
+    Responses adapter accepts chat-shaped calls but currently forwards
+    those tool messages to Responses as plain input items, where Codex
+    rejects the unsupported role. Rewriting tool results as user-visible
+    text preserves the important observation for the next turn and keeps
+    the runner moving.
+    """
+
+    def __init__(self, inner: Any) -> None:
+        self._inner = inner
+        self.chat = _CodexToolMessageCompatChat(inner.chat)
+        self.api_key = getattr(inner, "api_key", None)
+        self.base_url = getattr(inner, "base_url", None)
+
+    def close(self) -> None:
+        close = getattr(self._inner, "close", None)
+        if callable(close):
+            close()
+
+
+class _CodexToolMessageCompatChat:
+    def __init__(self, inner_chat: Any) -> None:
+        self.completions = _CodexToolMessageCompatCompletions(inner_chat.completions)
+
+
+class _CodexToolMessageCompatCompletions:
+    def __init__(self, inner_completions: Any) -> None:
+        self._inner = inner_completions
+
+    def create(self, **kwargs: Any) -> Any:
+        messages = kwargs.get("messages")
+        if isinstance(messages, list):
+            kwargs = dict(kwargs)
+            kwargs["messages"] = [_rewrite_codex_tool_message(message) for message in messages]
+        return self._inner.create(**kwargs)
+
+
+def _rewrite_codex_tool_message(message: Any) -> Any:
+    if not isinstance(message, dict) or message.get("role") != "tool":
+        return message
+    content = message.get("content", "")
+    if not isinstance(content, str):
+        content = str(content)
+    tool_call_id = message.get("tool_call_id") or message.get("name") or "tool"
+    return {
+        "role": "user",
+        "content": f"Tool result ({tool_call_id}):\n{content}",
+    }
+
+
+# ---------------------------------------------------------------------------
+# Config
+# ---------------------------------------------------------------------------
+
+
+@dataclass
+class HermesAdapterConfig(AdapterConfig):
+    """Config for the Hermes adapter.
+
+    Fields map onto `MiniSWERunner` kwargs; ClawBench passes the
+    canonical model string through verbatim so users pick Hermes-
+    supported models via the existing `--model` flag.
+    """
+
+    env_type: str = "local"
+    max_iterations: int = 15
+    timeout_seconds: int = 60
+    base_url: str | None = None
+    api_key: str | None = None
+    provider: str | None = None
+    api_mode: str | None = None
+    prompt_variant: str = PromptVariant.CLEAR.value
+    driver_mode: str = "mini_swe"
+    enabled_toolsets: list[str] | None = None
+    disabled_toolsets: list[str] | None = None
+    hermes_home: str | None = None
+    tool_delay_seconds: float = 0.0
+    # Optional: an explicit `MiniSWERunner` factory. Used by tests to
+    # plug in a stub; production code leaves this None and the adapter
+    # instantiates the real runner lazily.
+    runner_factory: Any = None
+    agent_factory: Any = None
+
+
+@register_adapter
+class HermesAdapter(AgentAdapter):
+    """Adapter for the Nous Research hermes-agent."""
+
+    name = "hermes"
+    capabilities = {
+        AdapterCapability.FILES,
+        AdapterCapability.EXECUTION,
+    }
+
+    @classmethod
+    def supported_capabilities(cls, config: AdapterConfig | None = None) -> set[AdapterCapability]:
+        if isinstance(config, HermesAdapterConfig) and config.driver_mode == "ai_agent":
+            return {
+                AdapterCapability.FILES,
+                AdapterCapability.EXECUTION,
+                AdapterCapability.MEMORY,
+                AdapterCapability.CRON,
+                AdapterCapability.BROWSER,
+                AdapterCapability.MULTI_TURN_INJECTION,
+            }
+        return set(cls.capabilities)
+
+    def __init__(self, config: HermesAdapterConfig | None = None) -> None:
+        super().__init__(config or HermesAdapterConfig())
+        self._config: HermesAdapterConfig = self.config  # type: ignore[assignment]
+
+    # ------------------------------------------------------------------
+    # Lifecycle.
+    # ------------------------------------------------------------------
+
+    async def setup(self, ctx: AdapterContext) -> None:
+        """Realize memory seed state as files and build the runner.
+
+        Hermes-in-`env_type=local` operates directly on the workspace
+        filesystem, so memory `SeedEntry` entries are written out as
+        `memory/<key>.md` files. Callers that want a different mapping
+        can pre-populate the workspace before invoking the adapter.
+        """
+
+        for seed in ctx.task.assets.seed_state:
+            if seed.kind == "memory" and seed.key:
+                target = ctx.workspace / "memory" / f"{seed.key}.md"
+                target.parent.mkdir(parents=True, exist_ok=True)
+                content = seed.content or ""
+                if not isinstance(content, str):
+                    content = str(content)
+                target.write_text(content, encoding="utf-8")
+
+        if self._config.driver_mode == "ai_agent":
+            agent = self._build_ai_agent(ctx)
+            ctx.adapter_state["agent"] = agent
+            ctx.adapter_state["conversation_history"] = []
+            ctx.adapter_state["hermes_home"] = self._hermes_home(ctx)
+        else:
+            runner = self._build_runner(ctx)
+            ctx.adapter_state["runner"] = runner
+        ctx.adapter_state.setdefault("api_calls", 0)
+
+    def _hermes_home(self, ctx: AdapterContext) -> Path:
+        configured = self._config.hermes_home
+        if configured:
+            return Path(configured).expanduser()
+        return ctx.workspace / ".hermes"
+
+    def _prepare_process_env(self, ctx: AdapterContext) -> None:
+        hermes_home = self._hermes_home(ctx)
+        hermes_home.mkdir(parents=True, exist_ok=True)
+        os.environ["HERMES_HOME"] = str(hermes_home)
+        os.environ["TERMINAL_CWD"] = str(ctx.workspace)
+        os.environ.setdefault("TERMINAL_ENV", "local")
+        cron_jobs = sys.modules.get("cron.jobs")
+        if cron_jobs is not None:
+            cron_dir = hermes_home / "cron"
+            setattr(cron_jobs, "HERMES_DIR", hermes_home)
+            setattr(cron_jobs, "CRON_DIR", cron_dir)
+            setattr(cron_jobs, "JOBS_FILE", cron_dir / "jobs.json")
+            setattr(cron_jobs, "OUTPUT_DIR", cron_dir / "output")
+
+    def _effective_model(self, ctx: AdapterContext) -> str:
+        """Translate ClawBench provider-prefixed slugs for direct providers."""
+
+        model = ctx.model
+        if self._config.provider:
+            return model
+        base_url = self._config.base_url or ""
+        try:
+            host = urlparse(base_url).hostname or ""
+        except Exception:
+            host = ""
+        if host == "api.openai.com" and model.startswith("openai/"):
+            return model.split("/", 1)[1]
+        return model
+
+    def _runtime_provider_hint(self) -> str | None:
+        """Return the provider identity Hermes should expose to its runtime.
+
+        Hermes distinguishes the transport used for the main model from the
+        auxiliary routing metadata it exposes to side tasks. Direct
+        OpenAI-compatible endpoints need to keep their explicit base URL and
+        API key, but should still identify as ``custom`` so Hermes auxiliary
+        calls resolve to the same primary model instead of falling through to
+        auto-detected providers such as OpenRouter.
+        """
+
+        if self._config.provider:
+            return self._config.provider
+        if self._config.base_url:
+            return "custom"
+        return None
+
+    def _build_runner(self, ctx: AdapterContext) -> Any:
+        explicit_api_key = None if self._config.provider else self._config.api_key
+        explicit_base_url = None if self._config.provider else self._config.base_url
+        effective_model = self._effective_model(ctx)
+        ctx.adapter_state["effective_model"] = effective_model
+        if self._config.runner_factory is not None:
+            return self._config.runner_factory(
+                model=effective_model,
+                env_type=self._config.env_type,
+                cwd=str(ctx.workspace),
+                max_iterations=self._config.max_iterations,
+                command_timeout=self._config.timeout_seconds,
+                base_url=explicit_base_url,
+                api_key=explicit_api_key,
+            )
+        if MiniSWERunner is None:  # pragma: no cover - import-guard branch
+            raise RuntimeError(
+                "HermesAdapter requires Hermes Agent's `mini_swe_runner.py`. "
+                "Install Hermes with the official installer, or set "
+                "`HERMES_AGENT_REPO=/path/to/hermes-agent` / "
+                "`HERMES_MINI_SWE_RUNNER=/path/to/mini_swe_runner.py`. "
+                f"Underlying import error: {_HERMES_IMPORT_ERROR!r}"
+            )
+        runner = MiniSWERunner(
+            model=effective_model,
+            env_type=self._config.env_type,
+            cwd=str(ctx.workspace),
+            max_iterations=self._config.max_iterations,
+            command_timeout=self._config.timeout_seconds,
+            base_url=explicit_base_url,
+            api_key=explicit_api_key,
+        )
+        if self._config.provider:
+            try:
+                from agent.auxiliary_client import resolve_provider_client
+            except Exception as exc:  # pragma: no cover - optional Hermes internals
+                raise RuntimeError(
+                    f"Hermes provider routing requested for '{self._config.provider}', "
+                    "but Hermes provider utilities could not be imported."
+                ) from exc
+            client, resolved_model = resolve_provider_client(
+                self._config.provider,
+                model=ctx.model,
+            )
+            if client is None or not resolved_model:
+                raise RuntimeError(
+                    f"Hermes provider '{self._config.provider}' did not resolve credentials."
+                )
+            if self._config.provider == "openai-codex":
+                client = _CodexToolMessageCompatClient(client)
+            runner.client = client
+            runner.model = str(resolved_model)
+        return runner
+
+    def _build_ai_agent(self, ctx: AdapterContext) -> Any:
+        self._prepare_process_env(ctx)
+        explicit_api_key = None if self._config.provider else self._config.api_key
+        explicit_base_url = None if self._config.provider else self._config.base_url
+        enabled_toolsets = self._config.enabled_toolsets or ["hermes-api-server"]
+        effective_model = self._effective_model(ctx)
+        provider_hint = self._runtime_provider_hint()
+        ctx.adapter_state["effective_model"] = effective_model
+        if self._config.agent_factory is not None:
+            return self._config.agent_factory(
+                model=effective_model,
+                base_url=explicit_base_url,
+                api_key=explicit_api_key,
+                provider=provider_hint,
+                api_mode=self._config.api_mode,
+                max_iterations=self._config.max_iterations,
+                enabled_toolsets=enabled_toolsets,
+                disabled_toolsets=self._config.disabled_toolsets,
+            )
+        if AIAgent is None:  # pragma: no cover - import-guard branch
+            raise RuntimeError(
+                "HermesAdapter full mode requires Hermes Agent's `run_agent.py`. "
+                "Set `HERMES_AGENT_REPO=/path/to/hermes-agent` or install Hermes. "
+                f"Underlying import error: {_HERMES_AGENT_IMPORT_ERROR!r}"
+            )
+        return AIAgent(
+            base_url=explicit_base_url,
+            api_key=explicit_api_key,
+            provider=provider_hint,
+            api_mode=self._config.api_mode,
+            model=effective_model,
+            max_iterations=self._config.max_iterations,
+            tool_delay=self._config.tool_delay_seconds,
+            enabled_toolsets=enabled_toolsets,
+            disabled_toolsets=self._config.disabled_toolsets,
+            quiet_mode=True,
+            verbose_logging=False,
+            skip_context_files=True,
+            session_id=f"clawbench-{ctx.task.id}-run{ctx.run_index}",
+            platform="cli",
+        )
+
+    async def run_phase(
+        self,
+        phase: CanonicalPhase,
+        ctx: AdapterContext,
+    ) -> PhaseResult:
+        """Render the phase's first user turn, invoke Hermes, parse output.
+
+        v1 limitation: only the first turn of each phase is delivered.
+        Tasks that declare `MULTI_TURN_INJECTION` as a required
+        capability are filtered out at harness level before the adapter
+        is invoked (harness gating lands in a later step). Guarding
+        here too keeps the adapter honest if it is driven directly.
+        """
+
+        if self._config.driver_mode == "ai_agent":
+            return await self._run_ai_agent_phase(phase, ctx)
+
+        runner = ctx.adapter_state.get("runner")
+        if runner is None:
+            return PhaseResult(
+                error="HermesAdapter.run_phase called before setup(); no runner",
+                completed_normally=False,
+            )
+
+        if not phase.user.turns:
+            return PhaseResult(completed_normally=True)
+
+        # Hermes cannot receive dynamic follow-ups; we render and send
+        # only the first turn. Later turns remain in the canonical
+        # phase description but are intentionally dropped here.
+        first_turn = phase.user.turns[0]
+        message = first_turn.variant_messages.get(
+            self._config.prompt_variant, first_turn.message
+        )
+        prompt = render_template(message, ctx.runtime_values)
+
+        phase_timeout = float(
+            phase.timeout_seconds
+            or ctx.task.budgets.timeout_seconds
+            or self._config.timeout_seconds * self._config.max_iterations
+        )
+
+        try:
+            result: dict[str, Any] = await asyncio.wait_for(
+                asyncio.to_thread(runner.run_task, prompt),
+                timeout=phase_timeout,
+            )
+        except asyncio.TimeoutError:
+            return PhaseResult(
+                error=f"Hermes phase '{phase.name}' exceeded {phase_timeout:.0f}s",
+                completed_normally=False,
+            )
+        except Exception as exc:  # pragma: no cover - runner-internal error
+            return PhaseResult(
+                error=f"HermesAdapter runner error: {exc}",
+                completed_normally=False,
+            )
+
+        phase_transcript = parse_conversation(result or {})
+        ctx.transcript.messages.extend(phase_transcript.messages)
+
+        api_calls = int(result.get("api_calls", 0)) if isinstance(result, dict) else 0
+        ctx.adapter_state["api_calls"] = (
+            int(ctx.adapter_state.get("api_calls", 0)) + api_calls
+        )
+
+        return PhaseResult(
+            messages=phase_transcript.messages,
+            adapter_metadata={
+                "api_calls": api_calls,
+                "hermes_metadata": result.get("metadata", {}) if isinstance(result, dict) else {},
+            },
+            completed_normally=bool(result.get("completed", False)) if isinstance(result, dict) else False,
+        )
+
+    async def _run_ai_agent_phase(
+        self,
+        phase: CanonicalPhase,
+        ctx: AdapterContext,
+    ) -> PhaseResult:
+        agent = ctx.adapter_state.get("agent")
+        if agent is None:
+            return PhaseResult(
+                error="HermesAdapter.run_phase called before setup(); no AIAgent",
+                completed_normally=False,
+            )
+
+        simulator = UserSimulator(
+            phase.user,
+            ctx.runtime_values,
+            prompt_variant=self._config.prompt_variant,
+        )
+        phase_timeout = float(
+            phase.timeout_seconds
+            or ctx.task.budgets.timeout_seconds
+            or self._config.timeout_seconds * self._config.max_iterations
+        )
+        appended_messages: list = []
+        phase_api_calls = 0
+        completed = True
+
+        while not simulator.is_done:
+            user_message = await simulator.next_message(ctx.transcript)
+            if user_message is None:
+                break
+            history = list(ctx.adapter_state.get("conversation_history") or [])
+            try:
+                result: dict[str, Any] = await asyncio.wait_for(
+                    asyncio.to_thread(
+                        agent.run_conversation,
+                        user_message,
+                        conversation_history=history or None,
+                        task_id=f"{ctx.task.id}-run{ctx.run_index}",
+                    ),
+                    timeout=phase_timeout,
+                )
+            except asyncio.TimeoutError:
+                return PhaseResult(
+                    messages=appended_messages,
+                    error=f"Hermes AIAgent phase '{phase.name}' exceeded {phase_timeout:.0f}s",
+                    completed_normally=False,
+                )
+            except Exception as exc:  # pragma: no cover - agent-internal error
+                return PhaseResult(
+                    messages=appended_messages,
+                    error=f"HermesAdapter AIAgent error: {exc}",
+                    completed_normally=False,
+                )
+
+            messages = result.get("messages", []) if isinstance(result, dict) else []
+            if not isinstance(messages, list):
+                messages = []
+            delta = messages[len(history):] if len(messages) >= len(history) else messages
+            phase_transcript = parse_chat_messages(delta)
+            ctx.transcript.messages.extend(phase_transcript.messages)
+            appended_messages.extend(phase_transcript.messages)
+            ctx.adapter_state["conversation_history"] = messages
+            phase_api_calls += int(result.get("api_calls", 0)) if isinstance(result, dict) else 0
+            completed = completed and bool(result.get("completed", False))
+
+        ctx.adapter_state["api_calls"] = (
+            int(ctx.adapter_state.get("api_calls", 0)) + phase_api_calls
+        )
+        return PhaseResult(
+            messages=appended_messages,
+            adapter_metadata={
+                "api_calls": phase_api_calls,
+                "driver_mode": "ai_agent",
+            },
+            completed_normally=completed,
+        )
+
+    async def verify_state_query(
+        self,
+        query: StateQuery,
+        ctx: AdapterContext,
+    ) -> StateQueryResult:
+        if query.kind == "memory":
+            fallback_state = MemoryState(
+                key_pattern=str(query.selector.get("key_pattern", "")),
+                exists=query.predicate != "absent",
+                value_contains=list(query.expected.get("value_contains", [])),
+            )
+            extra_memory_text = self._read_hermes_memory_text(ctx)
+            ok, detail = verify_memory_fallback(
+                fallback_state,
+                ctx.workspace,
+                transcript=ctx.transcript,
+                extra_memory_text=extra_memory_text,
+            )
+            return StateQueryResult(ok=ok, detail=detail)
+
+        if self._config.driver_mode == "ai_agent" and query.kind == "session":
+            expected_model = str(query.expected.get("model") or "")
+            if query.predicate == "absent":
+                return StateQueryResult(ok=False, detail="Hermes AIAgent session exists")
+            if expected_model and expected_model.lower() not in ctx.model.lower():
+                return StateQueryResult(
+                    ok=False,
+                    detail=f"Model mismatch: expected {expected_model}, got {ctx.model}",
+                )
+            return StateQueryResult(ok=True, detail="OK")
+
+        if self._config.driver_mode == "ai_agent" and query.kind == "cron":
+            return self._verify_cron_file(query, ctx)
+
+        # HermesAdapter does not currently expose session/cron/custom
+        # gateway state. Flag as capability-missing so the scorer can
+        # apply the neutral skip policy.
+        return StateQueryResult(
+            ok=False,
+            detail=(
+                f"HermesAdapter does not resolve '{query.kind}' state queries "
+                f"(missing capability {query.required_capability.value})"
+            ),
+            capability_missing=True,
+        )
+
+    def _read_hermes_memory_text(self, ctx: AdapterContext) -> str:
+        hermes_home = Path(ctx.adapter_state.get("hermes_home") or self._hermes_home(ctx))
+        candidates = [
+            hermes_home / "memory",
+            hermes_home / "memories",
+            hermes_home / "user_memory",
+        ]
+        chunks: list[str] = []
+        for candidate in candidates:
+            if candidate.is_file():
+                chunks.append(candidate.read_text(encoding="utf-8", errors="replace"))
+            elif candidate.is_dir():
+                for path in candidate.rglob("*"):
+                    if path.is_file() and path.suffix.lower() in {".md", ".txt", ".json"}:
+                        try:
+                            chunks.append(path.read_text(encoding="utf-8", errors="replace"))
+                        except Exception:
+                            continue
+        return "\n".join(chunks)
+
+    def _verify_cron_file(
+        self,
+        query: StateQuery,
+        ctx: AdapterContext,
+    ) -> StateQueryResult:
+        hermes_home = Path(ctx.adapter_state.get("hermes_home") or self._hermes_home(ctx))
+        jobs_file = hermes_home / "cron" / "jobs.json"
+        if not jobs_file.is_file():
+            if query.predicate == "absent":
+                return StateQueryResult(ok=True, detail="Correctly absent")
+            return StateQueryResult(ok=False, detail=f"No Hermes cron jobs file at {jobs_file}")
+        try:
+            payload = json.loads(jobs_file.read_text(encoding="utf-8"))
+        except Exception as exc:
+            return StateQueryResult(ok=False, detail=f"Could not read Hermes cron jobs: {exc}")
+        jobs = payload if isinstance(payload, list) else payload.get("jobs", [])
+        if not isinstance(jobs, list):
+            jobs = []
+        if query.predicate == "absent":
+            return StateQueryResult(
+                ok=not jobs,
+                detail="Correctly absent" if not jobs else "Cron jobs exist",
+            )
+        description_contains = query.selector.get("description_contains")
+        if not jobs:
+            return StateQueryResult(ok=False, detail="No cron jobs found")
+        if description_contains:
+            needle = str(description_contains).lower()
+            if not any(needle in json.dumps(job, sort_keys=True).lower() for job in jobs):
+                return StateQueryResult(
+                    ok=False,
+                    detail=f"No cron job matched '{description_contains}'",
+                )
+        return StateQueryResult(ok=True, detail="OK")
+
+    async def teardown(self, ctx: AdapterContext) -> None:
+        """Release the runner reference so GC can reclaim its process pool."""
+
+        ctx.adapter_state.pop("runner", None)
+        ctx.adapter_state.pop("agent", None)
+
+
+__all__ = ["HermesAdapter", "HermesAdapterConfig"]
--- a/clawbench/adapters/hermes_xml.py
+++ b/clawbench/adapters/hermes_xml.py
@ -0,0 +1,494 @@
+"""Hermes agent conversation → ClawBench `Transcript` converter.
+
+Hermes's `MiniSWERunner.run_task()` returns a dict shaped like:
+
+```json
+{
+  "conversations": [
+    {"from": "system", "value": "..."},
+    {"from": "user", "value": "..."},
+    {"from": "assistant", "value": "I'll look at the file.\\n<tool_call>{\\"name\\":\\"bash\\",\\"arguments\\":{\\"cmd\\":\\"ls\\"}}</tool_call>"},
+    {"from": "tool", "value": "<tool_response>{\\"stdout\\":\\"file.py\\"}</tool_response>"},
+    {"from": "assistant", "value": "<tool_call>...</tool_call>"},
+    ...
+  ],
+  "completed": true,
+  "api_calls": 7,
+  "metadata": {...}
+}
+```
+
+This module parses that into a canonical `Transcript` with
+`TranscriptMessage` + `ToolCall` entries so the scorer / trajectory /
+judge layers can score the run without any Hermes-specific knowledge.
+
+The XML parsing is deliberately tolerant: Hermes transcripts observed
+in the wild sometimes have malformed JSON inside `<tool_call>` tags
+(trailing commas, unescaped newlines). We fall back to a permissive
+regex extraction in that case so a single bad tool call doesn't tank
+the whole transcript.
+"""
+
+from __future__ import annotations
+
+import json
+import re
+from typing import Any, Iterable
+
+from clawbench.schemas import ToolCall, Transcript, TranscriptMessage
+
+
+#: One `<tool_call>…</tool_call>` block. Non-greedy across newlines.
+_TOOL_CALL_RE = re.compile(
+    r"<tool_call>\s*(?P<body>.*?)\s*</tool_call>", re.DOTALL
+)
+
+#: One `<tool_response>…</tool_response>` block.
+_TOOL_RESPONSE_RE = re.compile(
+    r"<tool_response>\s*(?P<body>.*?)\s*</tool_response>", re.DOTALL
+)
+
+
+def _coerce_role(raw: str) -> str:
+    """Normalize Hermes role labels to ClawBench `TranscriptMessage.role`.
+
+    ClawBench uses `"user"`, `"assistant"`, `"system"`, `"tool"`. Hermes
+    can emit `"human"`/`"gpt"`/`"function"` variants; we map them all
+    down to the canonical vocabulary.
+    """
+
+    value = (raw or "").strip().lower()
+    if value in {"assistant", "gpt", "model"}:
+        return "assistant"
+    if value in {"user", "human"}:
+        return "user"
+    if value in {"tool", "function", "tool_response"}:
+        return "tool"
+    if value == "system":
+        return "system"
+    return value or "assistant"
+
+
+def _extract_json_objects(text: str) -> list[dict[str, Any]]:
+    """Parse 0-or-more top-level JSON objects from free-form text.
+
+    Hermes usually puts a single JSON object inside each `<tool_call>`,
+    but we handle multi-object payloads defensively. Returns an empty
+    list if no valid JSON is present.
+    """
+
+    text = text.strip()
+    if not text:
+        return []
+    try:
+        parsed = json.loads(text)
+        if isinstance(parsed, dict):
+            return [parsed]
+        if isinstance(parsed, list):
+            return [item for item in parsed if isinstance(item, dict)]
+    except json.JSONDecodeError:
+        pass
+    # Fallback: scan for balanced `{...}` blocks. Useful when the
+    # assistant wrote slightly malformed JSON. We accept a best-effort
+    # parse and silently discard the rest.
+    results: list[dict[str, Any]] = []
+    depth = 0
+    start: int | None = None
+    for i, ch in enumerate(text):
+        if ch == "{":
+            if depth == 0:
+                start = i
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0 and start is not None:
+                candidate = text[start : i + 1]
+                try:
+                    obj = json.loads(candidate)
+                    if isinstance(obj, dict):
+                        results.append(obj)
+                except json.JSONDecodeError:
+                    pass
+                start = None
+    return results
+
+
+def _tool_call_from_payload(
+    payload: dict[str, Any],
+    *,
+    index: int,
+    timestamp_ms: int,
+) -> ToolCall:
+    """Build a canonical `ToolCall` from a Hermes `<tool_call>` payload.
+
+    Hermes emits `{"name": "...", "arguments": {...}}` inside each
+    tool_call tag. Some Nous-trained models emit slight variants —
+    `"function"` for the tool name, `"parameters"` or `"input"` for
+    the args. We accept any of those.
+    """
+
+    name = (
+        payload.get("name")
+        or payload.get("function")
+        or payload.get("tool")
+        or ""
+    )
+    arguments = (
+        payload.get("arguments")
+        or payload.get("parameters")
+        or payload.get("args")
+        or payload.get("input")
+        or {}
+    )
+    if isinstance(arguments, str):
+        # Occasionally Hermes passes a JSON-encoded string of args.
+        try:
+            arguments = json.loads(arguments)
+        except json.JSONDecodeError:
+            arguments = {"raw": arguments}
+    if not isinstance(arguments, dict):
+        arguments = {"value": arguments}
+    call_id = str(payload.get("id") or payload.get("call_id") or f"hermes-{index}")
+    return ToolCall(
+        id=call_id,
+        name=str(name),
+        input=arguments,
+        timestamp_ms=timestamp_ms,
+    )
+
+
+def _tool_response_summary(payload: dict[str, Any]) -> tuple[str, str, bool | None]:
+    """Extract (output, error, success) from a `<tool_response>` payload."""
+
+    output = ""
+    error = ""
+    success: bool | None = None
+
+    stdout = payload.get("stdout")
+    stderr = payload.get("stderr")
+    result = payload.get("result")
+    err = payload.get("error")
+    msg = payload.get("message")
+    status = payload.get("status")
+
+    if isinstance(stdout, str):
+        output = stdout
+    elif isinstance(result, (str, dict, list)):
+        output = result if isinstance(result, str) else json.dumps(result)
+    elif isinstance(msg, str):
+        output = msg
+    if isinstance(stderr, str) and stderr.strip():
+        error = stderr
+    elif isinstance(err, (str, dict, list)):
+        error = err if isinstance(err, str) else json.dumps(err)
+
+    if isinstance(status, str):
+        lowered = status.lower()
+        if lowered in {"ok", "success", "succeeded"}:
+            success = True
+        elif lowered in {"error", "failed", "failure"}:
+            success = False
+    if error and success is None:
+        success = False
+    if not error and output and success is None:
+        success = True
+    return output, error, success
+
+
+def _split_tagged(text: str, tag_re: re.Pattern[str]) -> list[tuple[str, str]]:
+    """Split `text` into `(kind, body)` tuples where `kind` is `"text"` or
+    `"tag"`. Preserves ordering so we can thread tool calls/responses
+    back into the canonical transcript in the order they appeared.
+    """
+
+    pieces: list[tuple[str, str]] = []
+    cursor = 0
+    for match in tag_re.finditer(text):
+        if match.start() > cursor:
+            pieces.append(("text", text[cursor : match.start()]))
+        pieces.append(("tag", match.group("body")))
+        cursor = match.end()
+    if cursor < len(text):
+        pieces.append(("text", text[cursor:]))
+    return pieces
+
+
+def parse_conversation(result: dict[str, Any]) -> Transcript:
+    """Parse a `MiniSWERunner.run_task` result dict into a `Transcript`.
+
+    The conversation is processed in order; tool calls are emitted into
+    the assistant message that contained them, and tool responses are
+    paired with the most recent unpaired call. The final Transcript is
+    ready for `annotate_transcript_tool_calls` → scorer.
+    """
+
+    transcript = Transcript()
+    conversations = result.get("conversations") or []
+    pending_calls: list[ToolCall] = []
+    call_counter = 0
+
+    for turn_index, entry in enumerate(conversations):
+        if not isinstance(entry, dict):
+            continue
+        role = _coerce_role(str(entry.get("from", "")))
+        value = str(entry.get("value", "") or "")
+
+        # Tool responses arrive from the tool/function role.
+        if role == "tool":
+            for response_body in _TOOL_RESPONSE_RE.findall(value):
+                payloads = _extract_json_objects(response_body)
+                if not payloads:
+                    payloads = [{"result": response_body}]
+                for payload in payloads:
+                    output, error, success = _tool_response_summary(payload)
+                    if pending_calls:
+                        target = pending_calls.pop(0)
+                        target.output = output
+                        target.error = error
+                        if success is not None:
+                            target.success = success
+                    else:
+                        # Orphan tool response — surface it as a tool
+                        # message so nothing is silently dropped.
+                        transcript.messages.append(
+                            TranscriptMessage(
+                                role="tool",
+                                tool_result_content=output or error,
+                            )
+                        )
+            continue
+
+        # Everything else (assistant / user / system) may carry tool
+        # calls plus free-form text. We interleave them faithfully.
+        pieces = _split_tagged(value, _TOOL_CALL_RE)
+        text_chunks: list[str] = []
+        tool_calls: list[ToolCall] = []
+        for kind, body in pieces:
+            if kind == "text":
+                text_chunks.append(body)
+            else:
+                payloads = _extract_json_objects(body)
+                for payload in payloads:
+                    call_counter += 1
+                    tool_call = _tool_call_from_payload(
+                        payload,
+                        index=call_counter,
+                        timestamp_ms=turn_index,
+                    )
+                    tool_calls.append(tool_call)
+                    pending_calls.append(tool_call)
+
+        joined_text = "\n".join(chunk for chunk in text_chunks if chunk.strip()).strip()
+
+        if role == "assistant":
+            transcript.messages.append(
+                TranscriptMessage(
+                    role="assistant",
+                    text=joined_text,
+                    tool_calls=tool_calls,
+                    timestamp_ms=turn_index,
+                )
+            )
+        elif role == "user":
+            transcript.messages.append(
+                TranscriptMessage(
+                    role="user",
+                    text=joined_text,
+                    timestamp_ms=turn_index,
+                )
+            )
+        elif role == "system":
+            if joined_text:
+                transcript.messages.append(
+                    TranscriptMessage(
+                        role="system",
+                        text=joined_text,
+                        timestamp_ms=turn_index,
+                    )
+                )
+        else:
+            if joined_text:
+                transcript.messages.append(
+                    TranscriptMessage(
+                        role=role,
+                        text=joined_text,
+                        timestamp_ms=turn_index,
+                    )
+                )
+
+    return transcript
+
+
+def _content_to_text(content: Any) -> str:
+    """Normalize OpenAI/Anthropic-style message content to plain text."""
+
+    if content is None:
+        return ""
+    if isinstance(content, str):
+        return content
+    if isinstance(content, list):
+        parts: list[str] = []
+        for part in content:
+            if isinstance(part, str):
+                parts.append(part)
+            elif isinstance(part, dict):
+                if isinstance(part.get("text"), str):
+                    parts.append(part["text"])
+                elif isinstance(part.get("content"), str):
+                    parts.append(part["content"])
+        return "\n".join(parts)
+    if isinstance(content, dict):
+        if isinstance(content.get("text"), str):
+            return content["text"]
+        if isinstance(content.get("content"), str):
+            return content["content"]
+    return str(content)
+
+
+def _tool_call_from_chat_payload(
+    payload: dict[str, Any],
+    *,
+    index: int,
+    timestamp_ms: int,
+) -> ToolCall:
+    """Build a canonical tool call from chat-completions message payloads."""
+
+    function = payload.get("function")
+    if not isinstance(function, dict):
+        function = {}
+    name = (
+        function.get("name")
+        or payload.get("name")
+        or payload.get("tool")
+        or payload.get("type")
+        or ""
+    )
+    arguments = (
+        function.get("arguments")
+        or payload.get("arguments")
+        or payload.get("args")
+        or payload.get("input")
+        or {}
+    )
+    if isinstance(arguments, str):
+        try:
+            arguments = json.loads(arguments)
+        except json.JSONDecodeError:
+            arguments = {"raw": arguments}
+    if not isinstance(arguments, dict):
+        arguments = {"value": arguments}
+    return ToolCall(
+        id=str(payload.get("id") or payload.get("call_id") or f"hermes-chat-{index}"),
+        name=str(name),
+        input=arguments,
+        timestamp_ms=timestamp_ms,
+    )
+
+
+def parse_chat_messages(messages: Iterable[dict[str, Any]]) -> Transcript:
+    """Parse Hermes AIAgent/OpenAI-style message history to a Transcript.
+
+    `AIAgent.run_conversation()` returns a `messages` list with user,
+    assistant, and tool-role entries. This parser preserves ordering and
+    attaches tool-role output back to the assistant `ToolCall` it belongs to.
+    """
+
+    transcript = Transcript()
+    pending_by_id: dict[str, ToolCall] = {}
+    pending_order: list[ToolCall] = []
+    call_counter = 0
+
+    for turn_index, entry in enumerate(messages):
+        if not isinstance(entry, dict):
+            continue
+        role = _coerce_role(str(entry.get("role") or entry.get("from") or ""))
+        text = _content_to_text(entry.get("content", entry.get("value", "")))
+
+        if role == "tool":
+            tool_call_id = str(entry.get("tool_call_id") or entry.get("id") or "")
+            target = pending_by_id.get(tool_call_id) if tool_call_id else None
+            if target is None and pending_order:
+                target = pending_order.pop(0)
+            if target is not None:
+                target.output = text
+                target.success = not _looks_like_error(text)
+                if not target.success:
+                    target.error = text
+            elif text:
+                transcript.messages.append(
+                    TranscriptMessage(
+                        role="tool",
+                        tool_result_for=tool_call_id or None,
+                        tool_result_content=text,
+                        timestamp_ms=turn_index,
+                    )
+                )
+            continue
+
+        tool_calls: list[ToolCall] = []
+        raw_calls = entry.get("tool_calls") or []
+        if isinstance(raw_calls, list):
+            for payload in raw_calls:
+                if not isinstance(payload, dict):
+                    continue
+                call_counter += 1
+                call = _tool_call_from_chat_payload(
+                    payload,
+                    index=call_counter,
+                    timestamp_ms=turn_index,
+                )
+                tool_calls.append(call)
+                pending_by_id[call.id] = call
+                pending_order.append(call)
+
+        if role == "assistant":
+            transcript.messages.append(
+                TranscriptMessage(
+                    role="assistant",
+                    text=text,
+                    tool_calls=tool_calls,
+                    timestamp_ms=turn_index,
+                )
+            )
+        elif role in {"user", "system"}:
+            if text:
+                transcript.messages.append(
+                    TranscriptMessage(
+                        role=role,
+                        text=text,
+                        timestamp_ms=turn_index,
+                    )
+                )
+        elif text:
+            transcript.messages.append(
+                TranscriptMessage(
+                    role=role,
+                    text=text,
+                    timestamp_ms=turn_index,
+                )
+            )
+
+    return transcript
+
+
+def _looks_like_error(text: str) -> bool:
+    lowered = text.lower()
+    return any(token in lowered for token in ("error", "traceback", "failed", "exception"))
+
+
+def iter_tool_calls_from_conversations(conversations: Iterable[dict[str, Any]]) -> list[ToolCall]:
+    """Helper used by tests: pull out just the tool-call sequence.
+
+    Equivalent to `parse_conversation({"conversations": list(conv)}).tool_call_sequence`
+    but skips the assistant-text assembly. Useful for asserting on call
+    order and arguments without noise.
+    """
+
+    return parse_conversation({"conversations": list(conversations)}).tool_call_sequence
+
+
+__all__ = [
+    "iter_tool_calls_from_conversations",
+    "parse_chat_messages",
+    "parse_conversation",
+]
--- a/clawbench/adapters/openclaw.py
+++ b/clawbench/adapters/openclaw.py
@ -0,0 +1,467 @@
+"""OpenClaw adapter — drives tasks through an OpenClaw gateway.
+
+This is the adapter-shaped wrapper around the agent execution flow that
+has lived inside `BenchmarkHarness._run_single` until now. It holds a
+`GatewayClient` open for the run's duration, creates one agent per run
+and one session per phase (matching the existing behavior), delivers
+simulated-user turns, and resolves `StateQuery` assertions against the
+gateway's `memory.search` / `sessions.resolve` / `cron.list` / arbitrary
+`_rpc(method)` surface.
+
+The legacy harness still owns the executable CLI path for now; this
+adapter is the canonical wrapper used by adapter-level tests and later
+harness wiring.
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import uuid
+from dataclasses import dataclass
+
+from clawbench.adapters import register_adapter
+from clawbench.adapters.base import (
+    AdapterConfig,
+    AdapterContext,
+    AgentAdapter,
+    PhaseResult,
+    StateQueryResult,
+)
+from clawbench.canonical import (
+    AdapterCapability,
+    CanonicalPhase,
+    StateQuery,
+)
+from clawbench.client import GatewayClient, GatewayConfig
+from clawbench.environment_files import (
+    resolve_json_path,
+    verify_memory_fallback,
+)
+from clawbench.schemas import (
+    MemoryState,
+    PromptVariant,
+)
+from clawbench.session_labels import unique_session_label
+from clawbench.simulated_user import UserSimulator
+
+logger = logging.getLogger(__name__)
+
+
+@dataclass
+class OpenClawAdapterConfig(AdapterConfig):
+    """Config for the OpenClaw adapter.
+
+    `gateway` holds the connection parameters the adapter uses to reach
+    the OpenClaw gateway. `prompt_variant` controls which wording of
+    each simulated-user turn is rendered.
+    """
+
+    gateway: GatewayConfig | None = None
+    prompt_variant: str = PromptVariant.CLEAR.value
+    # Default per-turn timeout passed to `send_and_wait` when the
+    # phase does not override it. Matches the existing harness default.
+    turn_timeout_seconds: float = 180.0
+
+
+@register_adapter
+class OpenClawAdapter(AgentAdapter):
+    """Adapter for the OpenClaw gateway (default harness path)."""
+
+    name = "openclaw"
+    capabilities = {
+        AdapterCapability.FILES,
+        AdapterCapability.EXECUTION,
+        AdapterCapability.MEMORY,
+        AdapterCapability.SESSION,
+        AdapterCapability.CRON,
+        AdapterCapability.BROWSER,
+        AdapterCapability.GATEWAY_RPC,
+        AdapterCapability.MULTI_TURN_INJECTION,
+    }
+
+    def __init__(self, config: OpenClawAdapterConfig | None = None) -> None:
+        super().__init__(config or OpenClawAdapterConfig())
+        self._config: OpenClawAdapterConfig = self.config  # type: ignore[assignment]
+        self._gateway_config: GatewayConfig = self._config.gateway or GatewayConfig()
+        self._client: GatewayClient | None = None
+        # Dependency injection hook for tests: monkeypatch this to swap
+        # in a stub gateway without touching the class definition.
+        self._client_factory = lambda: GatewayClient(self._gateway_config)
+
+    # ------------------------------------------------------------------
+    # Long-lived gateway connection.
+    # ------------------------------------------------------------------
+
+    async def __aenter__(self) -> "OpenClawAdapter":
+        client = self._client_factory()
+        await client.__aenter__()
+        self._client = client
+        return self
+
+    async def __aexit__(self, exc_type: object, exc: object, tb: object) -> None:
+        if self._client is not None:
+            try:
+                await self._client.__aexit__(exc_type, exc, tb)
+            finally:
+                self._client = None
+
+    @property
+    def client(self) -> GatewayClient:
+        if self._client is None:
+            raise RuntimeError(
+                "OpenClawAdapter must be used as an async context manager "
+                "before calling setup/run_phase/teardown."
+            )
+        return self._client
+
+    # ------------------------------------------------------------------
+    # Lifecycle.
+    # ------------------------------------------------------------------
+
+    async def setup(self, ctx: AdapterContext) -> None:
+        """Create the per-run agent and run pre-run state queries."""
+
+        self._realize_memory_seeds(ctx)
+
+        agent_name = (
+            f"clawbench-{ctx.task.id}-run-{ctx.run_index}-{uuid.uuid4().hex[:6]}"
+        )
+        agent_id = await self.client.create_agent(
+            name=agent_name, workspace=str(ctx.workspace)
+        )
+        ctx.adapter_state["agent_id"] = agent_id
+        ctx.adapter_state.setdefault("session_keys", [])
+
+        # Pre-run gateway assertions (ex-`setup.pre_check_gateway`) —
+        # evaluated immediately, failures are surfaced via the returned
+        # state via `ctx.adapter_state["pre_run_failures"]` so the
+        # harness can fail fast before doing any phase work.
+        failures: list[str] = []
+        for query in ctx.task.verifier.pre_run_queries:
+            result = await self.verify_state_query(query, ctx)
+            if not result.ok:
+                failures.append(result.detail or query.description)
+        if failures:
+            ctx.adapter_state["pre_run_failures"] = failures
+
+    def _realize_memory_seeds(self, ctx: AdapterContext) -> None:
+        """Expose canonical memory seeds through the run workspace.
+
+        OpenClaw's native memory backend has no public seed/write RPC in the
+        benchmark client, but agents can read files in their workspace and the
+        verifier already falls back to these same memory files. This keeps
+        seeded-memory tasks fair across OpenClaw and filesystem-first harnesses.
+        """
+
+        chunks: list[str] = []
+        for seed in ctx.task.assets.seed_state:
+            if seed.kind != "memory" or not seed.key:
+                continue
+            content = seed.content or ""
+            if not isinstance(content, str):
+                content = str(content)
+            safe_key = "".join(
+                ch if ch.isalnum() or ch in ("-", "_") else "_"
+                for ch in seed.key.strip()
+            ).strip("_")
+            if not safe_key:
+                safe_key = "seed"
+            body = f"# {seed.key}\n\n{content.strip()}\n"
+            target = ctx.workspace / "memory" / f"{safe_key}.md"
+            target.parent.mkdir(parents=True, exist_ok=True)
+            target.write_text(body, encoding="utf-8")
+            chunks.append(body)
+
+        if chunks:
+            (ctx.workspace / "MEMORY.md").write_text("\n".join(chunks), encoding="utf-8")
+
+    async def run_phase(
+        self,
+        phase: CanonicalPhase,
+        ctx: AdapterContext,
+    ) -> PhaseResult:
+        """Create a session, drive the simulator, append to the transcript."""
+
+        agent_id = ctx.adapter_state.get("agent_id")
+        if not agent_id:
+            return PhaseResult(
+                error="OpenClawAdapter.run_phase called before setup(); no agent_id",
+                completed_normally=False,
+            )
+
+        session_keys: list[str] = ctx.adapter_state.setdefault("session_keys", [])
+        session_key = await self.client.create_session(
+            model=ctx.model,
+            agent_id=agent_id,
+            label=unique_session_label(
+                f"clawbench-{ctx.task.id}-run{ctx.run_index}-phase{phase.name}"
+            ),
+        )
+        session_keys.append(session_key)
+        ctx.adapter_state["last_session_key"] = session_key
+
+        await self.client.subscribe(session_key)
+
+        # Browser tasks require the browser tool to actually be
+        # registered in the effective tool set for this session. If it
+        # isn't, fail the phase fast rather than letting the agent
+        # flounder against a missing tool.
+        if ctx.task.family.value == "browser":
+            try:
+                await self._assert_browser_support(session_key)
+            except Exception as exc:
+                return PhaseResult(
+                    error=str(exc),
+                    completed_normally=False,
+                )
+
+        simulator = UserSimulator(
+            phase.user,
+            ctx.runtime_values,
+            prompt_variant=self._config.prompt_variant,
+        )
+
+        turn_timeout = float(phase.timeout_seconds or ctx.task.budgets.timeout_seconds)
+        turn_timeout = min(turn_timeout, self._config.turn_timeout_seconds)
+
+        appended: list = []
+        turns_sent = 0
+        while not simulator.is_done:
+            user_message = await simulator.next_message(ctx.transcript)
+            if user_message is None:
+                break
+            phase_transcript = await self.client.send_and_wait(
+                session_key,
+                user_message,
+                timeout=turn_timeout,
+            )
+            ctx.transcript.messages.extend(phase_transcript.messages)
+            appended.extend(phase_transcript.messages)
+            turns_sent += 1
+
+        return PhaseResult(
+            messages=appended,
+            adapter_metadata={
+                "session_key": session_key,
+                "turns_sent": turns_sent,
+            },
+        )
+
+    async def _assert_browser_support(self, session_key: str) -> None:
+        inventory = await self.client.get_effective_tools(session_key)
+        tool_ids = {
+            str(tool.get("id", ""))
+            for group in inventory.get("groups", [])
+            for tool in group.get("tools", [])
+        }
+        if "browser" not in tool_ids:
+            raise RuntimeError(
+                "Browser tasks require the browser tool, but it is not available in this gateway."
+            )
+
+    async def teardown(self, ctx: AdapterContext) -> None:
+        """Delete per-phase sessions and the per-run agent."""
+
+        client = self._client
+        if client is None:
+            return
+        session_keys: list[str] = ctx.adapter_state.get("session_keys", [])
+        agent_id: str | None = ctx.adapter_state.get("agent_id")
+        for session_key in session_keys:
+            try:
+                await client.delete_session(session_key)
+            except Exception as exc:  # pragma: no cover - best effort
+                logger.warning("delete_session failed for %s: %s", session_key, exc)
+        if agent_id:
+            try:
+                await client.delete_agent(agent_id, delete_files=False)
+            except Exception as exc:  # pragma: no cover - best effort
+                logger.warning("delete_agent failed for %s: %s", agent_id, exc)
+
+    # ------------------------------------------------------------------
+    # State query resolution.
+    # ------------------------------------------------------------------
+
+    async def verify_state_query(
+        self,
+        query: StateQuery,
+        ctx: AdapterContext,
+    ) -> StateQueryResult:
+        try:
+            if query.kind == "memory":
+                return await self._verify_memory(query, ctx)
+            if query.kind == "session":
+                return await self._verify_session(query, ctx)
+            if query.kind == "cron":
+                return await self._verify_cron(query, ctx)
+            if query.kind == "custom":
+                return await self._verify_gateway(query, ctx)
+        except Exception as exc:
+            return StateQueryResult(ok=False, detail=str(exc))
+        return StateQueryResult(
+            ok=False,
+            detail=f"OpenClawAdapter has no handler for query kind '{query.kind}'",
+            capability_missing=True,
+        )
+
+    # --- memory ---
+
+    async def _verify_memory(
+        self, query: StateQuery, ctx: AdapterContext
+    ) -> StateQueryResult:
+        key_pattern = str(query.selector.get("key_pattern", ""))
+        value_contains = list(query.expected.get("value_contains", []))
+        session_key = ctx.adapter_state.get("last_session_key", "")
+        agent_id = ctx.adapter_state.get("agent_id")
+
+        # Primary path: memory.search RPC.
+        try:
+            response = await self.client._rpc(
+                "memory.search",
+                {
+                    "query": key_pattern,
+                    "sessionKey": session_key,
+                    "limit": 20,
+                },
+            )
+            entries = response.get("payload", {}).get("entries", [])
+            if query.predicate == "absent":
+                ok = not entries
+                return StateQueryResult(
+                    ok=ok,
+                    detail="Correctly absent" if ok else "Memory entry exists",
+                )
+            if not entries:
+                return StateQueryResult(ok=False, detail="No matching memory entries found")
+            all_values = " ".join(str(entry.get("value", "")) for entry in entries)
+            for token in value_contains:
+                if token.lower() not in all_values.lower():
+                    return StateQueryResult(
+                        ok=False, detail=f"Memory value missing '{token}'"
+                    )
+            return StateQueryResult(ok=True, detail="OK")
+        except Exception as exc:
+            logger.info(
+                "memory.search unavailable for verification, falling back: %s",
+                exc,
+            )
+
+        # Fallback: gateway-sourced memory files + workspace scan + transcript.
+        fallback_state = MemoryState(
+            key_pattern=key_pattern,
+            exists=query.predicate != "absent",
+            value_contains=value_contains,
+        )
+        extra_memory_text = ""
+        if agent_id:
+            try:
+                from clawbench.environment import _read_agent_memory_text  # local import to avoid cycle
+
+                extra_memory_text = await _read_agent_memory_text(self.client, agent_id)
+            except Exception:
+                extra_memory_text = ""
+        ok, detail = verify_memory_fallback(
+            fallback_state,
+            ctx.workspace,
+            transcript=ctx.transcript,
+            extra_memory_text=extra_memory_text,
+        )
+        return StateQueryResult(ok=ok, detail=detail)
+
+    # --- session ---
+
+    async def _verify_session(
+        self, query: StateQuery, ctx: AdapterContext
+    ) -> StateQueryResult:
+        session_key = ctx.adapter_state.get("last_session_key", "")
+        expected_model = query.expected.get("model") or ""
+        try:
+            response = await self.client._rpc("sessions.resolve", {"key": session_key})
+            payload = response.get("payload", {})
+            if query.predicate == "absent":
+                return StateQueryResult(ok=False, detail="Session exists but should not")
+            if expected_model:
+                actual = str(payload.get("model", ""))
+                if str(expected_model).lower() not in actual.lower():
+                    return StateQueryResult(
+                        ok=False,
+                        detail=f"Model mismatch: expected {expected_model}, got {actual}",
+                    )
+            return StateQueryResult(ok=True, detail="OK")
+        except Exception as exc:
+            if query.predicate == "absent":
+                return StateQueryResult(ok=True, detail="Correctly absent")
+            return StateQueryResult(ok=False, detail=str(exc))
+
+    # --- cron ---
+
+    async def _verify_cron(
+        self, query: StateQuery, ctx: AdapterContext
+    ) -> StateQueryResult:
+        description_contains = query.selector.get("description_contains")
+        try:
+            response = await self.client._rpc("cron.list", {})
+            jobs = response.get("payload", {}).get("jobs", [])
+            if query.predicate == "absent":
+                ok = not jobs
+                return StateQueryResult(
+                    ok=ok,
+                    detail="Correctly absent" if ok else "Cron jobs exist",
+                )
+            if not jobs:
+                return StateQueryResult(ok=False, detail="No cron jobs found")
+            if description_contains and not any(
+                str(description_contains).lower() in json.dumps(job).lower() for job in jobs
+            ):
+                return StateQueryResult(
+                    ok=False,
+                    detail=f"No cron job matched '{description_contains}'",
+                )
+            return StateQueryResult(ok=True, detail="OK")
+        except Exception as exc:
+            return StateQueryResult(ok=False, detail=str(exc))
+
+    # --- arbitrary gateway RPC ---
+
+    async def _verify_gateway(
+        self, query: StateQuery, ctx: AdapterContext
+    ) -> StateQueryResult:
+        method = str(query.selector.get("method", ""))
+        params = dict(query.selector.get("params", {}))
+        assert_path = str(query.selector.get("assert_path", "$"))
+        expected_equals = query.expected.get("equals")
+        expected_contains = query.expected.get("contains")
+        expected_exists = bool(query.expected.get("exists", True))
+        try:
+            response = await self.client._rpc(method, params)
+            payload = response.get("payload", {})
+            value = resolve_json_path(payload, assert_path)
+            if not expected_exists:
+                ok = value is None
+                return StateQueryResult(
+                    ok=ok,
+                    detail="Correctly absent" if ok else "Path exists",
+                )
+            if value is None:
+                return StateQueryResult(
+                    ok=False, detail=f"Path {assert_path} not found"
+                )
+            if expected_equals is not None and value != expected_equals:
+                return StateQueryResult(
+                    ok=False, detail=f"Expected {expected_equals}, got {value}"
+                )
+            if (
+                expected_contains is not None
+                and str(expected_contains).lower() not in str(value).lower()
+            ):
+                return StateQueryResult(
+                    ok=False,
+                    detail=f"Expected '{expected_contains}' in {value}",
+                )
+            return StateQueryResult(ok=True, detail="OK")
+        except Exception as exc:
+            return StateQueryResult(ok=False, detail=str(exc))
+
+
+__all__ = ["OpenClawAdapter", "OpenClawAdapterConfig"]
--- a/clawbench/canonical/init.py
+++ b/clawbench/canonical/init.py
@ -0,0 +1,45 @@
+"""Canonical task schema — agent-agnostic intent layer.
+
+Part of ClawBench Phase-4 per CLAWBENCH_V0_4_SPEC.md §"Canonical Task Schema".
+Splits canonical task intent (what to set up, prompt with, and verify) from
+OpenClaw-specific execution details (which become adapter responsibilities).
+
+The existing `TaskDefinition` in `clawbench/schemas.py` stays as-is for
+back-compat; this package adds a canonical view produced by
+`convert.from_task_definition`, which is the single bridge between the two
+shapes. Everything downstream of the harness (scorer, trajectory, judge,
+stats) is already agent-agnostic — those modules consume the transcript +
+TaskRunResult and do not need changes.
+"""
+
+from clawbench.canonical.schema import (
+    AdapterCapability,
+    BudgetSpec,
+    CanonicalAssets,
+    CanonicalPhase,
+    CanonicalTask,
+    Deliverable,
+    InteractionPolicy,
+    SeedEntry,
+    StateQuery,
+    StateQueryKind,
+    StateQueryPredicate,
+    VerifierContract,
+)
+from clawbench.canonical.convert import from_task_definition
+
+__all__ = [
+    "AdapterCapability",
+    "BudgetSpec",
+    "CanonicalAssets",
+    "CanonicalPhase",
+    "CanonicalTask",
+    "Deliverable",
+    "InteractionPolicy",
+    "SeedEntry",
+    "StateQuery",
+    "StateQueryKind",
+    "StateQueryPredicate",
+    "VerifierContract",
+    "from_task_definition",
+]
--- a/clawbench/canonical/convert.py
+++ b/clawbench/canonical/convert.py
@ -0,0 +1,328 @@
+"""Convert `TaskDefinition` → `CanonicalTask`.
+
+This is the single bridge between the existing OpenClaw-entangled task
+format (`clawbench.schemas.TaskDefinition`) and the agent-agnostic
+canonical form (`CanonicalTask`). Callers load tasks as usual via
+`clawbench.tasks.load_all_tasks` and then call
+`from_task_definition(task)` to get the canonical view.
+
+Field mappings (any field not mentioned is copied verbatim):
+
+- `setup.asset_packs`           → `assets.seed_state` (kind="file", asset_pack=...)
+- `setup.workspace_files`       → `assets.workspace_files`
+- `setup.background_services`   → `assets.background_services`
+- `setup.memory_seed`           → `assets.seed_state` (kind="memory")
+- `setup.pre_check_gateway`     → `verifier.pre_run_queries` (GATEWAY_RPC)
+- `completion.files`            → `verifier.file_states`
+- `completion.execution_checks` → `verifier.execution_checks`
+- `completion.memory`           → `verifier.state_queries` (MEMORY)
+- `completion.session`          → `verifier.state_queries` (SESSION)
+- `completion.cron`             → `verifier.state_queries` (CRON)
+- `completion.gateway_assertions` → `verifier.state_queries` (GATEWAY_RPC)
+- `trajectory`                  → `verifier.trajectory`
+- `behavior`                    → `verifier.behavior`
+- `judge`                       → `verifier.judge`
+- `user` / `phases`             → `phases` via `task.normalized_phases()`
+- `timeout_seconds`             → `budgets.timeout_seconds` (also on each phase)
+
+`required_adapter_capabilities` is computed from what the task actually
+needs: always `{FILES, EXECUTION}`, plus `MEMORY`/`SESSION`/`CRON`/
+`GATEWAY_RPC`/`BROWSER`/`MULTI_TURN_INJECTION` when the source task's
+fields trigger those capabilities.
+"""
+
+from __future__ import annotations
+
+from clawbench.canonical.schema import (
+    AdapterCapability,
+    BudgetSpec,
+    CanonicalAssets,
+    CanonicalPhase,
+    CanonicalTask,
+    InteractionPolicy,
+    SeedEntry,
+    StateQuery,
+    VerifierContract,
+)
+from clawbench.schemas import (
+    CronState,
+    GatewayAssertion,
+    MemoryState,
+    SessionState,
+    TaskDefinition,
+    TaskFamily,
+    UserTurn,
+)
+
+
+# ---------------------------------------------------------------------------
+# Seed state
+# ---------------------------------------------------------------------------
+
+
+def _seeds_from_setup(task: TaskDefinition) -> list[SeedEntry]:
+    seeds: list[SeedEntry] = []
+    for pack in task.setup.asset_packs:
+        seeds.append(SeedEntry(kind="file", asset_pack=pack))
+    for entry in task.setup.memory_seed:
+        # memory_seed entries are free-form dicts in the existing schema;
+        # we preserve them verbatim in `metadata` and surface `key` +
+        # `content` when present so adapters can consume the structured
+        # pieces without re-parsing.
+        seeds.append(
+            SeedEntry(
+                kind="memory",
+                key=str(entry.get("key", "")),
+                content=entry.get("value") or entry.get("content"),
+                metadata=dict(entry),
+            )
+        )
+    return seeds
+
+
+# ---------------------------------------------------------------------------
+# State queries: memory / session / cron / gateway_assertions
+# ---------------------------------------------------------------------------
+
+
+def _memory_state_to_query(state: MemoryState) -> StateQuery:
+    expected: dict[str, object] = {}
+    if state.value_contains:
+        expected["value_contains"] = list(state.value_contains)
+    return StateQuery(
+        kind="memory",
+        predicate="exists" if state.exists else "absent",
+        selector={"key_pattern": state.key_pattern},
+        expected=expected,
+        required_capability=AdapterCapability.MEMORY,
+        description=f"memory key ~ /{state.key_pattern}/",
+    )
+
+
+def _session_state_to_query(state: SessionState) -> StateQuery:
+    expected: dict[str, object] = {}
+    if state.model_should_be:
+        expected["model"] = state.model_should_be
+    return StateQuery(
+        kind="session",
+        predicate="exists" if state.should_exist else "absent",
+        selector={},
+        expected=expected,
+        required_capability=AdapterCapability.SESSION,
+        description="session state",
+    )
+
+
+def _cron_state_to_query(state: CronState) -> StateQuery:
+    selector: dict[str, object] = {}
+    if state.description_contains:
+        selector["description_contains"] = state.description_contains
+    return StateQuery(
+        kind="cron",
+        predicate="exists" if state.exists else "absent",
+        selector=selector,
+        expected={},
+        required_capability=AdapterCapability.CRON,
+        description="cron schedule",
+    )
+
+
+def _gateway_assertion_to_query(assertion: GatewayAssertion) -> StateQuery:
+    selector: dict[str, object] = {
+        "method": assertion.method,
+        "params": dict(assertion.params),
+        "assert_path": assertion.assert_path,
+    }
+    expected: dict[str, object] = {}
+    if assertion.assert_equals is not None:
+        expected["equals"] = assertion.assert_equals
+    if assertion.assert_contains is not None:
+        expected["contains"] = assertion.assert_contains
+    expected["exists"] = assertion.assert_exists
+    predicate = "exists"
+    if assertion.assert_equals is not None:
+        predicate = "equals"
+    elif assertion.assert_contains is not None:
+        predicate = "contains"
+    elif not assertion.assert_exists:
+        predicate = "absent"
+    return StateQuery(
+        kind="custom",
+        predicate=predicate,
+        selector=selector,
+        expected=expected,
+        required_capability=AdapterCapability.GATEWAY_RPC,
+        description=f"gateway rpc: {assertion.method}",
+    )
+
+
+def _state_queries_from_completion(task: TaskDefinition) -> list[StateQuery]:
+    queries: list[StateQuery] = []
+    for mem in task.completion.memory:
+        queries.append(_memory_state_to_query(mem))
+    if task.completion.session is not None:
+        queries.append(_session_state_to_query(task.completion.session))
+    for cron in task.completion.cron:
+        queries.append(_cron_state_to_query(cron))
+    for assertion in task.completion.gateway_assertions:
+        queries.append(_gateway_assertion_to_query(assertion))
+    return queries
+
+
+def _pre_run_queries_from_setup(task: TaskDefinition) -> list[StateQuery]:
+    return [_gateway_assertion_to_query(a) for a in task.setup.pre_check_gateway]
+
+
+# ---------------------------------------------------------------------------
+# Phases + dynamic-turn detection
+# ---------------------------------------------------------------------------
+
+
+_DYNAMIC_TURN_FIELDS = (
+    "when_tool_family",
+    "when_tool_name",
+    "when_assistant_contains",
+    "when_last_tool_failed",
+)
+
+
+def _turn_is_dynamic(turn: UserTurn) -> bool:
+    if turn.when_last_tool_failed:
+        return True
+    for name in _DYNAMIC_TURN_FIELDS:
+        value = getattr(turn, name, None)
+        if isinstance(value, bool):
+            if value:
+                return True
+        elif value:
+            return True
+    return False
+
+
+def _phases_from_task(task: TaskDefinition) -> tuple[list[CanonicalPhase], bool]:
+    phases: list[CanonicalPhase] = []
+    any_dynamic = False
+    for phase in task.normalized_phases():
+        phases.append(
+            CanonicalPhase(
+                name=phase.name,
+                user=phase.user,
+                timeout_seconds=phase.timeout_seconds,
+            )
+        )
+        if len(phase.user.turns) > 1 or any(_turn_is_dynamic(t) for t in phase.user.turns):
+            any_dynamic = True
+    return phases, any_dynamic
+
+
+# ---------------------------------------------------------------------------
+# Capability inference
+# ---------------------------------------------------------------------------
+
+
+def _capabilities_for_task(task: TaskDefinition, *, uses_dynamic: bool) -> set[AdapterCapability]:
+    caps: set[AdapterCapability] = {AdapterCapability.FILES, AdapterCapability.EXECUTION}
+    if task.completion.memory or any(seed.get("key") for seed in task.setup.memory_seed):
+        caps.add(AdapterCapability.MEMORY)
+    if task.completion.session is not None:
+        caps.add(AdapterCapability.SESSION)
+    if task.completion.cron:
+        caps.add(AdapterCapability.CRON)
+    if task.completion.gateway_assertions or task.setup.pre_check_gateway:
+        caps.add(AdapterCapability.GATEWAY_RPC)
+    if task.family == TaskFamily.BROWSER:
+        caps.add(AdapterCapability.BROWSER)
+    if uses_dynamic:
+        caps.add(AdapterCapability.MULTI_TURN_INJECTION)
+    return caps
+
+
+# ---------------------------------------------------------------------------
+# Public entry point
+# ---------------------------------------------------------------------------
+
+
+def from_task_definition(task: TaskDefinition) -> CanonicalTask:
+    """Produce the canonical view of a legacy `TaskDefinition`.
+
+    This is lossless for fields that have a canonical equivalent.
+    OpenClaw-only constructs (gateway_assertions, pre_check_gateway,
+    memory_seed) become `StateQuery` entries / `SeedEntry` entries
+    tagged with the capability an adapter needs to resolve them.
+    """
+
+    phases, any_dynamic = _phases_from_task(task)
+
+    assets = CanonicalAssets(
+        workspace_files=list(task.setup.workspace_files),
+        background_services=list(task.setup.background_services),
+        seed_state=_seeds_from_setup(task),
+    )
+
+    verifier = VerifierContract(
+        file_states=list(task.completion.files),
+        execution_checks=list(task.completion.execution_checks),
+        state_queries=_state_queries_from_completion(task),
+        pre_run_queries=_pre_run_queries_from_setup(task),
+        trajectory=task.trajectory,
+        behavior=task.behavior,
+        judge=task.judge,
+    )
+
+    interaction = InteractionPolicy(
+        max_turns=max((phase.user.max_turns for phase in phases), default=20),
+        allow_multi_phase=len(phases) > 1,
+        uses_dynamic_user_triggers=any_dynamic,
+    )
+
+    budgets = BudgetSpec(timeout_seconds=task.timeout_seconds)
+
+    capabilities = _capabilities_for_task(task, uses_dynamic=any_dynamic)
+
+    return CanonicalTask(
+        id=task.id,
+        name=task.name,
+        tier=task.tier,
+        family=task.family,
+        surface=task.surface,
+        scenario=task.scenario,
+        subscenario=task.subscenario,
+        capabilities=list(task.capabilities),
+        atomic_capabilities=list(task.atomic_capabilities),
+        pool=task.pool,
+        subsets=list(task.subsets),
+        variant_group=task.variant_group,
+        variant_id=task.variant_id,
+        template_id=task.template_id,
+        release_id=task.release_id,
+        source_kind=task.source_kind,
+        provenance_ids=list(task.provenance_ids),
+        privacy_tier=task.privacy_tier,
+        contamination_risk=task.contamination_risk,
+        freshness_epoch=task.freshness_epoch,
+        category=task.category,
+        domain=task.domain,
+        functionality=list(task.functionality),
+        trace_distribution=list(task.trace_distribution),
+        tool_surface=list(task.tool_surface),
+        risk_tags=list(task.risk_tags),
+        first_used_at=task.first_used_at,
+        retire_after_runs=task.retire_after_runs,
+        similarity_hash=task.similarity_hash,
+        canary_token=task.canary_token,
+        official=task.official,
+        query_difficulty=task.query_difficulty,
+        query_weight=task.query_weight,
+        artifact_type=task.artifact_type,
+        preconditions=list(task.preconditions),
+        source_dataset=task.source_dataset,
+        prompt_variants=list(task.prompt_variants),
+        pass_threshold=task.pass_threshold,
+        assets=assets,
+        phases=phases,
+        verifier=verifier,
+        budgets=budgets,
+        interaction=interaction,
+        deliverables=[],
+        required_adapter_capabilities=capabilities,
+    )
--- a/clawbench/canonical/schema.py
+++ b/clawbench/canonical/schema.py
@ -0,0 +1,296 @@
+"""Canonical task schema — agent-agnostic intent.
+
+This is the Phase-4 split of `TaskDefinition` (see CLAWBENCH_V0_4_SPEC.md
+§"Canonical Task Schema"). The canonical layer expresses **what** a task
+is — its identity, prompts, assets, and verification contract — without
+saying **how** it gets executed. The "how" (gateway RPCs, session
+lifecycle, tool-family normalization) lives in per-adapter code under
+`clawbench/adapters/`.
+
+The rule of thumb:
+
+- If a field describes what the user asked for, what files/state the
+  agent is expected to produce, or what the run must satisfy to pass,
+  it belongs here.
+- If a field describes how OpenClaw's gateway is called to drive the
+  run or read back state, it belongs in the OpenClaw adapter (and the
+  canonical version of that check is a `StateQuery` with a
+  `required_capability`).
+
+Converting from `TaskDefinition` → `CanonicalTask` is lossless for fields
+that have a canonical equivalent; OpenClaw-only fields (like
+`pre_check_gateway` and `gateway_assertions`) survive as `StateQuery`
+entries tagged with `AdapterCapability.GATEWAY_RPC`, so adapters that
+support them can still resolve them while adapters that don't can cleanly
+report a capability gap.
+"""
+
+from __future__ import annotations
+
+import enum
+from typing import Any, Literal
+
+from pydantic import BaseModel, Field, model_validator
+
+from clawbench.schemas import (
+    ArtifactType,
+    BackgroundService,
+    BehaviorExpectations,
+    CapabilityTag,
+    ExecutionCheck,
+    FileState,
+    JudgeExpectations,
+    PromptVariant,
+    QueryDifficulty,
+    ScenarioDomain,
+    SimulatedUser,
+    TaskFamily,
+    TaskPool,
+    TaskSubset,
+    Tier,
+    TrajectoryExpectations,
+)
+
+
+class AdapterCapability(str, enum.Enum):
+    """What an adapter is able to provide to a running task.
+
+    Each `StateQuery` declares a `required_capability`. If the selected
+    adapter's `capabilities` set does not include that capability, the
+    harness either skips the task entirely (strict mode) or scores the
+    query as neutral (partial mode). This keeps the leaderboard honest
+    about what an adapter can actually evaluate.
+    """
+
+    FILES = "files"
+    EXECUTION = "execution"
+    MEMORY = "memory"
+    SESSION = "session"
+    CRON = "cron"
+    BROWSER = "browser"
+    GATEWAY_RPC = "gateway_rpc"
+    # The adapter can deliver additional user turns mid-trajectory in
+    # response to simulated-user triggers (when_tool_family,
+    # when_assistant_contains, etc). Single-shot drivers like Hermes's
+    # MiniSWERunner do not provide this.
+    MULTI_TURN_INJECTION = "multi_turn_injection"
+
+
+StateQueryKind = Literal["memory", "session", "cron", "custom"]
+StateQueryPredicate = Literal["exists", "absent", "equals", "contains"]
+
+
+class StateQuery(BaseModel):
+    """An abstract state assertion resolved by the active adapter.
+
+    The canonical layer does not commit to how the state is read. For
+    example, a `kind="memory"` query with `selector={"key_pattern":"alpha"}`
+    and `expected={"value_contains":["foo"]}` means "there is a memory
+    entry whose key matches /alpha/ and whose value contains 'foo'".
+    OpenClaw's adapter resolves that against the `memory.search` gateway
+    RPC; a filesystem-memory adapter (e.g. Hermes) resolves it by
+    scanning `MEMORY.md` / `memory/notes.md` in the workspace.
+
+    The `required_capability` is what the harness checks against the
+    adapter's declared capability set.
+    """
+
+    kind: StateQueryKind
+    predicate: StateQueryPredicate = "exists"
+    selector: dict[str, Any] = Field(default_factory=dict)
+    expected: dict[str, Any] = Field(default_factory=dict)
+    required_capability: AdapterCapability
+    description: str = ""
+
+
+class SeedEntry(BaseModel):
+    """A single piece of pre-task state to seed into the workspace.
+
+    `kind="file"`: the adapter writes `content` (or copies a bundled
+    asset via `asset_pack`) to `path` inside the workspace.
+    `kind="memory"`: the adapter seeds a memory entry with `key` and
+    `content`. Adapters without memory support fall back to writing
+    the seed as a file (see `environment_files.verify_memory_fallback`).
+    """
+
+    kind: Literal["file", "memory"]
+    path: str | None = None
+    content: str | None = None
+    key: str | None = None
+    asset_pack: str = ""
+    metadata: dict[str, Any] = Field(default_factory=dict)
+
+    @model_validator(mode="after")
+    def _validate_shape(self) -> SeedEntry:
+        if self.kind == "file" and not self.path and not self.asset_pack:
+            raise ValueError("SeedEntry(kind='file') requires `path` or `asset_pack`.")
+        if self.kind == "memory" and not self.key:
+            raise ValueError("SeedEntry(kind='memory') requires `key`.")
+        return self
+
+
+class Deliverable(BaseModel):
+    """A user-visible artifact the task is expected to produce."""
+
+    kind: ArtifactType
+    paths: list[str] = Field(default_factory=list)
+    description: str = ""
+
+
+class BudgetSpec(BaseModel):
+    """Per-task execution budgets.
+
+    `timeout_seconds` is the wall clock for the full run (all phases).
+    `max_tool_calls=0` means unbounded within the timeout. Adapters are
+    expected to honor these as soft caps; the harness will also enforce
+    the timeout as a hard deadline.
+    """
+
+    timeout_seconds: int = 180
+    max_tool_calls: int = 0
+    per_turn_timeout_seconds: int = 0
+
+
+class InteractionPolicy(BaseModel):
+    """How the canonical phases drive the agent."""
+
+    max_turns: int = 20
+    allow_multi_phase: bool = True
+    # Declares that the task's simulated user sends follow-up turns
+    # based on trajectory triggers (not just counts). Adapters without
+    # MULTI_TURN_INJECTION cannot deliver these dynamically.
+    uses_dynamic_user_triggers: bool = False
+
+
+class VerifierContract(BaseModel):
+    """Everything needed to score a run, independent of how it ran.
+
+    The file/execution halves are fully agent-agnostic — `environment_files`
+    evaluates them against the workspace directly. State queries are
+    resolved by `adapter.verify_state_query`. Trajectory and behavior
+    expectations are evaluated against the `Transcript` (already agent-
+    agnostic). The optional judge rubric is evaluated against artifacts
+    + transcript + completion feedback.
+    """
+
+    file_states: list[FileState] = Field(default_factory=list)
+    execution_checks: list[ExecutionCheck] = Field(default_factory=list)
+    state_queries: list[StateQuery] = Field(default_factory=list)
+    pre_run_queries: list[StateQuery] = Field(default_factory=list)
+    trajectory: TrajectoryExpectations = Field(default_factory=TrajectoryExpectations)
+    behavior: BehaviorExpectations = Field(default_factory=BehaviorExpectations)
+    judge: JudgeExpectations | None = None
+
+
+class CanonicalAssets(BaseModel):
+    """Workspace + seed state the harness realizes before phases run.
+
+    `workspace_files` is a list of relative paths (resolved against the
+    task's assets/ dir) to copy into the workspace. `background_services`
+    is already canonical (subprocess + readiness probe, no OpenClaw
+    coupling). `seed_state` replaces `asset_packs` + `memory_seed` with
+    a uniform per-entry list.
+    """
+
+    workspace_files: list[str] = Field(default_factory=list)
+    background_services: list[BackgroundService] = Field(default_factory=list)
+    seed_state: list[SeedEntry] = Field(default_factory=list)
+
+
+class CanonicalPhase(BaseModel):
+    """One simulated-user phase in a multi-phase task.
+
+    `user` is reused verbatim from `clawbench.schemas.SimulatedUser` —
+    it is already agent-agnostic (turn text + canonical trigger
+    predicates). Whether a specific trigger fires on a given adapter
+    depends on whether tool-family tags are populated, which is an
+    adapter responsibility.
+    """
+
+    name: str
+    user: SimulatedUser
+    timeout_seconds: int | None = None
+
+
+class CanonicalTask(BaseModel):
+    """Agent-agnostic task definition.
+
+    Produced by `convert.from_task_definition` from an existing
+    `TaskDefinition`. Consumed by adapters via `AdapterContext` and by
+    the scorer + trajectory/judge layers. No field here is OpenClaw-
+    specific; OpenClaw-only semantics survive as `StateQuery` entries
+    with `required_capability=GATEWAY_RPC`.
+    """
+
+    # Identity and taxonomy (already canonical in TaskDefinition).
+    id: str
+    name: str
+    tier: Tier
+    family: TaskFamily
+    surface: str
+    scenario: ScenarioDomain | None = None
+    subscenario: str = ""
+    capabilities: list[CapabilityTag] = Field(default_factory=list)
+    atomic_capabilities: list[str] = Field(default_factory=list)
+
+    # Pool / rotation / provenance.
+    pool: TaskPool = TaskPool.PUBLIC_DEV
+    subsets: list[TaskSubset] = Field(default_factory=list)
+    variant_group: str = ""
+    variant_id: str = "main"
+    template_id: str = ""
+    release_id: str = ""
+    source_kind: str = ""
+    provenance_ids: list[str] = Field(default_factory=list)
+    privacy_tier: str = ""
+    contamination_risk: str = ""
+    freshness_epoch: str = ""
+    category: str = ""
+    domain: str = ""
+    functionality: list[str] = Field(default_factory=list)
+    trace_distribution: list[str] = Field(default_factory=list)
+    tool_surface: list[str] = Field(default_factory=list)
+    risk_tags: list[str] = Field(default_factory=list)
+    first_used_at: str = ""
+    retire_after_runs: int = 0
+    similarity_hash: str = ""
+    canary_token: str = ""
+    official: bool = False
+
+    # Policy + prompts.
+    query_difficulty: QueryDifficulty | None = None
+    query_weight: float = 1.0
+    artifact_type: ArtifactType | None = None
+    preconditions: list[str] = Field(default_factory=list)
+    source_dataset: str = ""
+    prompt_variants: list[PromptVariant] = Field(default_factory=lambda: [PromptVariant.CLEAR])
+    pass_threshold: float = 0.7
+
+    # Canonical body.
+    assets: CanonicalAssets = Field(default_factory=CanonicalAssets)
+    phases: list[CanonicalPhase]
+    verifier: VerifierContract = Field(default_factory=VerifierContract)
+    budgets: BudgetSpec = Field(default_factory=BudgetSpec)
+    interaction: InteractionPolicy = Field(default_factory=InteractionPolicy)
+    deliverables: list[Deliverable] = Field(default_factory=list)
+
+    # Adapter gating.
+    required_adapter_capabilities: set[AdapterCapability] = Field(default_factory=set)
+
+    # Forward-compat: lets us evolve this schema while hidden / external
+    # task manifests continue to validate.
+    schema_version: str = "1"
+
+    @model_validator(mode="after")
+    def _defaults(self) -> CanonicalTask:
+        if not self.variant_group:
+            self.variant_group = self.id
+        if not self.prompt_variants:
+            self.prompt_variants = [PromptVariant.CLEAR]
+        else:
+            deduped: list[PromptVariant] = []
+            for variant in self.prompt_variants:
+                if variant not in deduped:
+                    deduped.append(variant)
+            self.prompt_variants = deduped
+        return self
--- a/clawbench/cli.py
+++ b/clawbench/cli.py
@ -10,22 +10,10 @@ from pathlib import Path
 import click

 from clawbench.client import GatewayConfig
-from clawbench.harness import BenchmarkHarness
+from clawbench.harness import BenchmarkHarness, KNOWN_ADAPTERS
+from clawbench.schemas import ScenarioDomain

-SCENARIO_CHOICES = [
-    "file_system_ops",
-    "web_info_ops",
-    "calendar_reminders",
-    "communication_messaging",
-    "data_processing_analysis",
-    "coding_dev_assist",
-    "personal_life_assistant",
-    "multi_step_compound",
-    "context_continuation",
-    "error_boundary_cases",
-    "skill_calling",
-    "system_capabilities",
-]
+SCENARIO_CHOICES = [scenario.value for scenario in ScenarioDomain]


@click.group()
@ -41,6 +29,13 @@ def cli(verbose: bool) -> None:

@cli.command()
@click.option("--model", "-m", required=True, help="Model to benchmark")
+@click.option(
+    "--adapter",
+    type=click.Choice(KNOWN_ADAPTERS),
+    default="openclaw",
+    show_default=True,
+    help="Agent harness adapter. OpenClaw is executable today; other adapters are tracked targets.",
+)
@click.option("--gateway-token", envvar="OPENCLAW_GATEWAY_TOKEN", default="", help="Gateway auth token")
@click.option(
    "--judge-model",
@ -48,7 +43,13 @@ def cli(verbose: bool) -> None:
    default="",
    help="Optional advisory LLM judge model (does not affect official score)",
 )
-@click.option("--runs", "-n", default=5, help="Runs per task (reliability uses all runs)")
+@click.option(
+    "--judge-affects-score",
+    is_flag=True,
+    envvar="CLAWBENCH_JUDGE_AFFECTS_SCORE",
+    help="Opt in to experimental judge-weighted scoring. Official scoring keeps judge advisory.",
+)
+@click.option("--runs", "-n", default=3, show_default=True, help="Runs per task (reliability uses all runs)")
@click.option("--tier", type=click.Choice(["tier1", "tier2", "tier3", "tier4", "tier5"]), help="Filter tier")
@click.option("--scenario", type=click.Choice(SCENARIO_CHOICES), help="Filter query scenario")
@click.option("--artifact-type", type=click.Choice(["file", "information", "operation", "code", "external_action", "memory", "automation", "mixed"]), help="Filter expected artifact type")
@ -116,10 +117,17 @@ def cli(verbose: bool) -> None:
    show_default=True,
    help="Where to write ecosystem insight files after a --profile run.",
 )
+@click.option(
+    "--dynamics",
+    is_flag=True,
+    help="Run quick post-benchmark dynamics analysis. Prefer dynamics-report for offline cache/archive analysis.",
+)
 def run(
    model: str,
+    adapter: str,
    gateway_token: str,
    judge_model: str,
+    judge_affects_score: bool,
    runs: int,
    tier: str | None,
    scenario: str | None,
@ -137,12 +145,15 @@ def run(
    browser_concurrency: int,
    profile: Path | None,
    insights_dir: Path,
+    dynamics: bool,
 ) -> None:
    gateway_config = GatewayConfig(token=gateway_token)
    harness = BenchmarkHarness(
        gateway_config=gateway_config,
        model=model,
+        adapter=adapter,
        judge_model=judge_model,
+        judge_affects_score=judge_affects_score,
        runs_per_task=runs,
        tier=tier,
        scenario=scenario,
@ -165,10 +176,14 @@ def run(
        json.dump(result.model_dump(), handle, indent=2)
    click.echo(f"\nResults saved to {out_path}")

+    if dynamics:
+        _run_dynamics_analysis(harness.last_task_runs, out_path)
+
    if profile is not None:
        _run_v05_diagnostic(
            profile_path=profile,
            result=result,
+            task_runs=harness.last_task_runs,
            runs_per_task=runs,
            insights_dir=insights_dir,
        )
@ -179,10 +194,88 @@ def run(
        asyncio.run(upload_result(result))


+@cli.command("dynamics-report")
+@click.option(
+    "--archive-dir",
+    type=click.Path(exists=True, file_okay=False, path_type=Path),
+    required=True,
+    help="Path to a run cache/archive root or a single model cache directory.",
+)
+@click.option(
+    "--model",
+    default=None,
+    help="Model id to select when the archive root contains multiple model directories.",
+)
+@click.option("--tier", type=click.Choice(["tier1", "tier2", "tier3", "tier4", "tier5"]))
+@click.option("--task", "task_ids", multiple=True, help="Specific task IDs to include from the archive.")
+@click.option(
+    "--output-dir",
+    type=click.Path(path_type=Path),
+    default=Path("results/offline_dynamics"),
+    show_default=True,
+    help="Directory where dynamics.json and plots will be written.",
+)
+@click.option(
+    "--no-plots",
+    is_flag=True,
+    help="Write only dynamics.json and skip plot rendering.",
+)
+def dynamics_report(
+    archive_dir: Path,
+    model: str | None,
+    tier: str | None,
+    task_ids: tuple[str, ...],
+    output_dir: Path,
+    no_plots: bool,
+) -> None:
+    """Generate dynamics plots and a JSON report from cached TaskRunResult archives."""
+    from clawbench.dynamics_archive import load_task_runs_archive
+
+    try:
+        task_runs = load_task_runs_archive(
+            archive_dir=archive_dir,
+            model=model,
+            task_ids=task_ids,
+            tier=tier,
+        )
+    except ValueError as exc:
+        raise click.ClickException(str(exc)) from exc
+
+    if not task_runs:
+        raise click.ClickException(f"No cached runs found under {archive_dir}")
+
+    report_path, plots, n_runs = _write_dynamics_report(
+        task_runs,
+        output_dir,
+        generate_plots=not no_plots,
+    )
+    click.echo(f"Loaded {n_runs} cached runs across {len(task_runs)} tasks")
+    click.echo(f"Dynamics report saved to {report_path}")
+    click.echo(f"Saved {len(plots)} plots to {output_dir}/")
+
+
+def _write_dynamics_report(
+    task_runs: dict[str, list],
+    output_dir: Path,
+    *,
+    generate_plots: bool = True,
+) -> tuple[Path, list[Path], int]:
+    from clawbench.dynamics_archive import write_dynamics_report
+
+    report_path, plots = write_dynamics_report(
+        task_runs,
+        output_dir,
+        generate_plots=generate_plots,
+    )
+    n_runs = sum(len(runs) for runs in task_runs.values())
+    return report_path, plots, n_runs
+
+
 def _run_v05_diagnostic(
    *,
    profile_path: Path,
    result,
+    task_runs: dict[str, list] | None,
    runs_per_task: int,
    insights_dir: Path,
 ) -> None:
@ -192,6 +285,7 @@ def _run_v05_diagnostic(
        DEFAULT_MANIFEST_DIR,
        DEFAULT_SUBMISSIONS_DIR,
        ensure_data_dirs,
+        infer_registration_traces_from_manifests,
        load_manifests,
        write_submission_record,
    )
@ -205,6 +299,7 @@ def _run_v05_diagnostic(
    plugin_profile = PluginProfile.from_yaml_file(profile_path)
    plugin_ids = [e.id for e in plugin_profile.plugins]
    manifests = load_manifests(DEFAULT_MANIFEST_DIR, plugin_ids)
+    traces = infer_registration_traces_from_manifests(plugin_profile, manifests)
    db = HistoricalDatabase(path=DEFAULT_DB_PATH)

    # Extract per-task scores + tier map from the BenchmarkResult
@ -215,12 +310,16 @@ def _run_v05_diagnostic(
        if getattr(task_stats, "tier", ""):
            tier_of[task_stats.task_id] = task_stats.tier

+    transcripts = _merge_task_transcripts_from_runs(task_runs or {})
+
    diagnostic = submit_run(
        profile=plugin_profile,
        manifests=manifests,
        db=db,
        actual_overall_score=float(result.overall_score),
        actual_per_task_scores=actual_per_task,
+        traces=traces,
+        transcripts=transcripts,
        tier_of=tier_of or None,
        n_runs_contributing=runs_per_task,
    )
@ -243,6 +342,22 @@ def _run_v05_diagnostic(
    )


+def _merge_task_transcripts_from_runs(task_runs: dict[str, list]):
+    """Merge all run transcripts per task for the v0.5 utilization audit."""
+    if not task_runs:
+        return None
+    from clawbench.schemas import Transcript
+
+    merged: dict[str, Transcript] = {}
+    for task_id, runs in task_runs.items():
+        transcript = Transcript()
+        for run in runs:
+            transcript.messages.extend(getattr(run.transcript, "messages", []))
+        if transcript.messages:
+            merged[task_id] = transcript
+    return merged or None
+
+
@cli.command()
@click.argument("profile", type=click.Path(exists=True, path_type=Path))
@click.option(
@ -693,5 +808,23 @@ def show(result_file: str) -> None:
        )


+def _run_dynamics_analysis(
+    task_runs: dict[str, list],
+    result_path: str,
+) -> None:
+    """Compute stratified dynamics from raw TaskRunResult objects."""
+    run_stem = Path(result_path).stem
+    dyn_dir = Path(result_path).parent / f"{run_stem}_dynamics"
+    try:
+        dyn_path, plots, n_runs = _write_dynamics_report(task_runs, dyn_dir)
+    except ValueError as exc:
+        click.echo(str(exc))
+        return
+
+    click.echo(f"\n[dynamics] Analysed {n_runs} cached runs")
+    click.echo(f"  Dynamics report saved to {dyn_path}")
+    click.echo(f"  Saved {len(plots)} plots to {dyn_dir}/")
+
+
 def main() -> None:
    cli()
--- a/clawbench/client.py
+++ b/clawbench/client.py
@ -8,7 +8,9 @@ import logging
 import math
 import os
 import re
+import shutil
 import subprocess
+import sys
 import uuid
 from dataclasses import dataclass, field
 from typing import Any
@ -24,10 +26,10 @@ logger = logging.getLogger(__name__)

 PROTOCOL_VERSION = 3
 DEVICE_IDENTITY_HELPER_JS = r"""
-const crypto = require("node:crypto");
-const fs = require("node:fs");
-const os = require("node:os");
-const path = require("node:path");
+const crypto = require("crypto");
+const fs = require("fs");
+const os = require("os");
+const path = require("path");

 const ED25519_SPKI_PREFIX = Buffer.from("302a300506032b6570032100", "hex");

@ -52,7 +54,7 @@ function fingerprintPublicKey(publicKeyPem) {
 }

 function generateIdentity() {
-  const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519");
+    const { publicKey, privateKey } = crypto.generateKeyPairSync("ed25519", {});
  const publicKeyPem = publicKey.export({ type: "spki", format: "pem" }).toString();
  const privateKeyPem = privateKey.export({ type: "pkcs8", format: "pem" }).toString();
  return {
@ -224,14 +226,73 @@ class GatewayClient:
            attempt += 1
            try:
                remaining = max(1.0, deadline - asyncio.get_running_loop().time())
+                attempt_timeout = min(30.0, remaining)
                self._ws = await websockets.connect(
                    self.config.url,
                    max_size=10 * 1024 * 1024,
-                    open_timeout=min(self.config.connect_timeout, remaining),
+                    open_timeout=attempt_timeout,
                    additional_headers={"Origin": host},
                )
-                break
+                self._listen_task = asyncio.create_task(self._listener())
+                challenge = await self._wait_event(
+                    "connect.challenge", timeout=attempt_timeout
+                )
+                challenge_payload = challenge.get("payload", {})
+                nonce = ""
+                if isinstance(challenge_payload, dict):
+                    raw_nonce = challenge_payload.get("nonce", "")
+                    if isinstance(raw_nonce, str):
+                        nonce = raw_nonce.strip()
+
+                role = "operator"
+                scopes = [
+                    "operator.admin",
+                    "operator.read",
+                    "operator.write",
+                    "operator.approvals",
+                    "operator.pairing",
+                ]
+                client_info = {
+                    "id": "openclaw-control-ui",
+                    "version": __version__,
+                    "platform": "linux",
+                    "mode": "ui",
+                }
+                connect_params: dict[str, Any] = {
+                    "minProtocol": PROTOCOL_VERSION,
+                    "maxProtocol": PROTOCOL_VERSION,
+                    "client": client_info,
+                    "role": role,
+                    "scopes": scopes,
+                    "caps": [],
+                    "commands": [],
+                    "permissions": {},
+                    "auth": {"token": self.config.token} if self.config.token else {},
+                }
+                device = _build_connect_device(
+                    nonce=nonce,
+                    token=self.config.token,
+                    client_id=str(client_info["id"]),
+                    client_mode=str(client_info["mode"]),
+                    role=role,
+                    scopes=scopes,
+                    platform=str(client_info["platform"]),
+                )
+                if device:
+                    connect_params["device"] = device
+
+                response = await self._rpc(
+                    "connect",
+                    connect_params,
+                    timeout=attempt_timeout,
+                )
+                payload = response.get("payload", {})
+                if payload.get("type") != "hello-ok":
+                    raise ConnectionError(f"Expected hello-ok, got: {payload}")
+                logger.info("Connected to gateway (protocol v%s)", payload.get("protocol", "?"))
+                return
            except Exception as exc:
+                await self.close()
                if not _is_transient_gateway_connect_error(exc):
                    raise
                if asyncio.get_running_loop().time() >= deadline:
@ -243,60 +304,6 @@ class GatewayClient:
                    delay,
                )
                await asyncio.sleep(delay)
-        self._listen_task = asyncio.create_task(self._listener())
-        challenge = await self._wait_event("connect.challenge", timeout=self.config.connect_timeout)
-        challenge_payload = challenge.get("payload", {})
-        nonce = ""
-        if isinstance(challenge_payload, dict):
-            raw_nonce = challenge_payload.get("nonce", "")
-            if isinstance(raw_nonce, str):
-                nonce = raw_nonce.strip()
-
-        role = "operator"
-        scopes = [
-            "operator.admin",
-            "operator.read",
-            "operator.write",
-            "operator.approvals",
-            "operator.pairing",
-        ]
-        client_info = {
-            "id": "openclaw-control-ui",
-            "version": __version__,
-            "platform": "linux",
-            "mode": "ui",
-        }
-        connect_params: dict[str, Any] = {
-            "minProtocol": PROTOCOL_VERSION,
-            "maxProtocol": PROTOCOL_VERSION,
-            "client": client_info,
-            "role": role,
-            "scopes": scopes,
-            "caps": [],
-            "commands": [],
-            "permissions": {},
-            "auth": {"token": self.config.token} if self.config.token else {},
-        }
-        device = _build_connect_device(
-            nonce=nonce,
-            token=self.config.token,
-            client_id=str(client_info["id"]),
-            client_mode=str(client_info["mode"]),
-            role=role,
-            scopes=scopes,
-            platform=str(client_info["platform"]),
-        )
-        if device:
-            connect_params["device"] = device
-
-        response = await self._rpc(
-            "connect",
-            connect_params,
-        )
-        payload = response.get("payload", {})
-        if payload.get("type") != "hello-ok":
-            raise ConnectionError(f"Expected hello-ok, got: {payload}")
-        logger.info("Connected to gateway (protocol v%s)", payload.get("protocol", "?"))

    async def close(self) -> None:
        if self._listen_task and not self._listen_task.done():
@ -392,6 +399,15 @@ class GatewayClient:
        except Exception as exc:
            logger.warning("Failed to delete session %s: %s", session_key, exc)

+    async def abort_session(self, session_key: str, *, run_id: str | None = None) -> None:
+        params: dict[str, Any] = {"key": session_key}
+        if run_id:
+            params["runId"] = run_id
+        try:
+            await self._rpc("sessions.abort", params, timeout=min(self.config.request_timeout, 10.0))
+        except Exception as exc:
+            logger.warning("Failed to abort session %s run %s: %s", session_key, run_id or "-", exc)
+
    async def get_effective_tools(self, session_key: str) -> dict[str, Any]:
        response = await self._rpc("tools.effective", {"sessionKey": session_key})
        return response.get("payload", {})
@ -411,15 +427,27 @@ class GatewayClient:
        msg_queue: asyncio.Queue[dict[str, Any]] = asyncio.Queue()
        self._event_queues[chat_queue_key] = chat_queue
        self._event_queues[msg_queue_key] = msg_queue
+        timeout_ms = max(1, min(int(timeout * 1000), 2_147_483_647))

-        await self._rpc(
+        send_response = await self._rpc(
            "sessions.send",
            {
                "key": session_key,
                "message": message,
                "idempotencyKey": idempotency_key,
+                "timeoutMs": timeout_ms,
            },
        )
+        send_payload = send_response.get("payload", {})
+        run_id = idempotency_key
+        if isinstance(send_payload, dict):
+            raw_run_id = send_payload.get("runId")
+            if isinstance(raw_run_id, str) and raw_run_id.strip():
+                run_id = raw_run_id.strip()
+
+        wait_task = asyncio.create_task(
+            self._wait_for_agent_run(run_id, timeout_ms=timeout_ms)
+        )

        collected_messages: list[TranscriptMessage] = []
        done = False
@ -428,8 +456,31 @@ class GatewayClient:
            while not done:
                remaining = deadline - asyncio.get_running_loop().time()
                if remaining <= 0:
-                    logger.warning("Timeout waiting for final state on session %s", session_key)
+                    logger.warning(
+                        "Timeout waiting for final state on session %s run %s",
+                        session_key,
+                        run_id,
+                    )
                    break
+                if wait_task.done():
+                    wait_payload = _task_result_or_empty(wait_task)
+                    status = str(wait_payload.get("status", ""))
+                    if status and status != "timeout":
+                        logger.info(
+                            "agent.wait observed terminal status for session %s run %s: %s",
+                            session_key,
+                            run_id,
+                            status,
+                        )
+                        done = True
+                        break
+                    if status == "timeout":
+                        logger.warning(
+                            "agent.wait timed out for session %s run %s",
+                            session_key,
+                            run_id,
+                        )
+                        break
                try:
                    event = await asyncio.wait_for(chat_queue.get(), timeout=min(0.5, remaining))
                    state = event.get("payload", {}).get("state", "")
@ -438,6 +489,9 @@ class GatewayClient:
                except asyncio.TimeoutError:
                    pass

+            if not done:
+                await self.abort_session(session_key, run_id=run_id)
+
            collected_messages.extend(
                await _drain_message_queue(
                    msg_queue,
@ -445,12 +499,67 @@ class GatewayClient:
                    max_wait_seconds=2.0,
                )
            )
+
+            # Some gateway/provider paths persist assistant messages in session
+            # history without emitting complete streaming events. Backfill from
+            # sessions.get if stream capture appears incomplete.
+            history_messages = await self.get_session_messages(session_key)
+            collected_assistant = sum(
+                1 for msg in collected_messages if msg.role == "assistant"
+            )
+            history_assistant = sum(
+                1 for msg in history_messages if msg.role == "assistant"
+            )
+            if history_messages and (
+                len(history_messages) > len(collected_messages)
+                or history_assistant > collected_assistant
+            ):
+                collected_messages = history_messages
        finally:
+            if not wait_task.done():
+                wait_task.cancel()
+                try:
+                    await wait_task
+                except asyncio.CancelledError:
+                    pass
            self._event_queues.pop(chat_queue_key, None)
            self._event_queues.pop(msg_queue_key, None)

        return _correlate_transcript(Transcript(messages=collected_messages))

+    async def _wait_for_agent_run(self, run_id: str, *, timeout_ms: int) -> dict[str, Any]:
+        try:
+            response = await self._rpc(
+                "agent.wait",
+                {"runId": run_id, "timeoutMs": timeout_ms},
+                timeout=(timeout_ms / 1000.0) + 10.0,
+            )
+        except Exception as exc:
+            logger.warning("agent.wait failed for run %s: %s", run_id, exc)
+            return {}
+        payload = response.get("payload", {})
+        return payload if isinstance(payload, dict) else {}
+
+    async def get_session_messages(self, session_key: str) -> list[TranscriptMessage]:
+        try:
+            response = await self._rpc("sessions.get", {"key": session_key})
+        except Exception:
+            return []
+
+        payload = response.get("payload", {})
+        raw_messages = payload.get("messages", [])
+        if not isinstance(raw_messages, list):
+            return []
+
+        parsed: list[TranscriptMessage] = []
+        for raw in raw_messages:
+            if not isinstance(raw, dict):
+                continue
+            msg = _parse_single_message(raw)
+            if msg is not None:
+                parsed.append(msg)
+        return parsed
+
    async def _rpc(
        self,
        method: str,
@ -469,14 +578,17 @@ class GatewayClient:
        effective_timeout = timeout if timeout is not None else self.config.request_timeout
        future: asyncio.Future[dict[str, Any]] = asyncio.get_running_loop().create_future()
        self._pending[request_id] = future
-        await self._ws.send(json.dumps(frame))
        try:
+            await self._ws.send(json.dumps(frame))
            response = await asyncio.wait_for(future, timeout=effective_timeout)
        except asyncio.TimeoutError:
            self._pending.pop(request_id, None)
            raise TimeoutError(
                f"RPC {method} timed out after {effective_timeout:.1f}s"
            )
+        except Exception:
+            self._pending.pop(request_id, None)
+            raise

        if not response.get("ok", False):
            error = response.get("error", {})
@ -536,6 +648,13 @@ def _build_connect_device(
    platform: str,
    device_family: str | None = None,
 ) -> dict[str, Any] | None:
+    if os.environ.get("CLAWBENCH_DISABLE_GATEWAY_DEVICE_IDENTITY", "").strip().lower() in {
+        "1",
+        "true",
+        "yes",
+        "on",
+    }:
+        return None
    if not nonce:
        return None

@ -551,9 +670,17 @@ def _build_connect_device(
            "deviceFamily": device_family or "",
        }
    )
+
+    node_executable = _resolve_node_executable()
+    if not node_executable:
+        logger.warning(
+            "Failed to build device identity payload: no Node executable found"
+        )
+        return None
+
    try:
        completed = subprocess.run(
-            ["node", "-e", DEVICE_IDENTITY_HELPER_JS],
+            [node_executable, "-e", DEVICE_IDENTITY_HELPER_JS],
            input=helper_input,
            capture_output=True,
            text=True,
@ -577,7 +704,30 @@ def _build_connect_device(
    return payload


+def _resolve_node_executable() -> str | None:
+    """Resolve Node binary, preferring the active Python/conda environment."""
+    candidates: list[str] = []
+
+    # First try the same environment as the active Python interpreter.
+    candidates.append(os.path.join(os.path.dirname(sys.executable), "node"))
+
+    # Then try CONDA_PREFIX when available.
+    conda_prefix = os.environ.get("CONDA_PREFIX")
+    if conda_prefix:
+        candidates.append(os.path.join(conda_prefix, "bin", "node"))
+
+    for candidate in candidates:
+        if os.path.isfile(candidate) and os.access(candidate, os.X_OK):
+            return candidate
+
+    return shutil.which("node")
+
+
 def _is_transient_gateway_connect_error(exc: Exception) -> bool:
+    if isinstance(exc, (TimeoutError, asyncio.TimeoutError)):
+        return True
+    if isinstance(exc, websockets.exceptions.ConnectionClosed):
+        return True
    if isinstance(exc, InvalidStatus):
        return exc.response.status_code in {502, 503, 504}
    if isinstance(exc, InvalidMessage):
@ -593,6 +743,13 @@ def _describe_connect_error(exc: Exception) -> str:
    return exc.__class__.__name__


+def _task_result_or_empty(task: asyncio.Task[dict[str, Any]]) -> dict[str, Any]:
+    try:
+        return task.result()
+    except Exception:
+        return {}
+
+
 def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | None:
    role = message_data.get("role", "")
    if not role:
@ -615,6 +772,9 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N
            if block_type == "text":
                text_parts.append(block.get("text", ""))
                continue
+            if block_type == "output_text":
+                text_parts.append(block.get("text", ""))
+                continue
            if block_type in {"tool_use", "toolCall"}:
                arguments = block.get("input", block.get("arguments", {}))
                if isinstance(arguments, str):
@ -641,6 +801,16 @@ def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | N
                if tool_result_content:
                    text_parts.append(tool_result_content)

+    # Some providers surface assistant failures in a dedicated error field
+    # with empty content blocks. Preserve that signal in transcript text.
+    error_message = message_data.get("errorMessage", "")
+    if isinstance(error_message, str) and error_message.strip():
+        text_parts.append(error_message.strip())
+
+    direct_text = message_data.get("text", "")
+    if isinstance(direct_text, str) and direct_text.strip():
+        text_parts.append(direct_text.strip())
+
    if not text_parts and not tool_calls and not tool_result_for:
        return None

--- a/clawbench/diagnose_cli.py
+++ b/clawbench/diagnose_cli.py
@ -37,7 +37,8 @@ from clawbench.diagnostic import build_diagnostic, submit_run
 from clawbench.insights import publish_insights
 from clawbench.prediction import HistoricalDatabase
 from clawbench.profile import PluginManifest, PluginProfile, RegistrationTrace
-from clawbench.schemas import Transcript
+from clawbench.schemas import ToolCall, Transcript
+from clawbench.trajectory import classify_tool_call


 DEFAULT_CLAWBENCH_ROOT = Path(".clawbench")
@ -80,6 +81,39 @@ def load_transcripts(path: Path) -> dict[str, Transcript]:
    return out


+def infer_registration_traces_from_manifests(
+    profile: PluginProfile,
+    manifests: dict[str, PluginManifest],
+) -> dict[str, RegistrationTrace]:
+    """Build best-effort registration traces from manifest-declared tools.
+
+    Full runtime registration traces are better because they include hooks,
+    gateway methods, routes, and services. This fallback still gives the
+    diagnostic layer exact manifest-declared tool names, which is enough to
+    attribute many transcript tool calls instead of dropping all utilization
+    into the unassigned bucket.
+    """
+    traces: dict[str, RegistrationTrace] = {}
+    for entry in profile.plugins:
+        manifest = manifests.get(entry.id)
+        if manifest is None:
+            continue
+        tools = list(manifest.contracts.get("tools", []))
+        families = sorted(
+            {
+                classify_tool_call(ToolCall(name=tool))[0]
+                for tool in tools
+                if tool
+            }
+        )
+        traces[entry.id] = RegistrationTrace(
+            plugin_id=entry.id,
+            tools=tools,
+            tool_families_seen=families,
+        )
+    return traces
+
+
 def write_submission_record(
    submissions_dir: Path, fingerprint_hash: str, report_dict: dict
 ) -> Path:
@ -162,6 +196,7 @@ def main() -> None:
    profile = PluginProfile.from_yaml_file(args.profile)
    plugin_ids = [e.id for e in profile.plugins]
    manifests = load_manifests(args.manifests, plugin_ids)
+    traces = infer_registration_traces_from_manifests(profile, manifests)
    db = HistoricalDatabase(path=args.db)

    actual_overall: float | None = None
@ -172,9 +207,16 @@ def main() -> None:
            sys.exit(2)
        results_data = json.loads(args.results.read_text(encoding="utf-8"))
        actual_overall = float(results_data.get("overall_score", 0.0))
-        actual_per_task = {
-            k: float(v) for k, v in results_data.get("per_task_score", {}).items()
-        }
+        if "per_task_score" in results_data:
+            actual_per_task = {
+                k: float(v) for k, v in results_data.get("per_task_score", {}).items()
+            }
+        else:
+            actual_per_task = {
+                str(item.get("task_id")): float(item.get("mean_task_score", 0.0))
+                for item in results_data.get("task_results", [])
+                if item.get("task_id")
+            }

    transcripts: dict[str, Transcript] | None = None
    if args.transcripts:
@ -208,6 +250,7 @@ def main() -> None:
            db=db,
            actual_overall_score=actual_overall,
            actual_per_task_scores=actual_per_task,
+            traces=traces,
            transcripts=transcripts,
            tier_of=tier_of,
        )
@ -223,6 +266,7 @@ def main() -> None:
            db=db,
            actual_overall_score=actual_overall,
            actual_per_task_scores=actual_per_task,
+            traces=traces,
            transcripts=transcripts,
            tier_of=tier_of,
        )
--- a/clawbench/diagnostic.py
+++ b/clawbench/diagnostic.py
@ -17,16 +17,13 @@ leaderboards.

 from __future__ import annotations

-import json
 from dataclasses import dataclass, field, asdict
-from pathlib import Path
 from typing import Any

 from clawbench.factor_analysis import FactorAnalysisReport, analyze
 from clawbench.prediction import (
    HistoricalDatabase,
    HistoricalRun,
-    PredictionReport,
    attribute_surprise,
    predict_profile,
 )
--- a/clawbench/dynamics.py
+++ b/clawbench/dynamics.py
@ -0,0 +1,695 @@
+"""Dynamics analysis for ClawBench agent trajectories.
+
+Treats each agent run as a discrete dynamical system and computes step
+embeddings, trajectory metrics, sensitivity analysis, regime classification,
+Kaplan-Meier survival, non-Markov memory, and stratified assessment with
+Bayesian importance-weight correction for distribution shift.
+"""
+
+from __future__ import annotations
+
+import math
+from collections import Counter
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import TYPE_CHECKING, Callable
+
+import numpy as np
+
+if TYPE_CHECKING:
+    from clawbench.schemas import TaskRunResult, Transcript
+
+# ── Constants ──────────────────────────────────────────────────────────
+
+TOOL_FAMILIES = ("browser", "edit", "execute", "memory", "read", "search")
+_N_FAM = len(TOOL_FAMILIES)
+
+# ── Types ──────────────────────────────────────────────────────────────
+
+
+class Regime(str, Enum):
+    convergent = "convergent"
+    chaotic = "chaotic"
+    trapped = "trapped"
+    diffusive = "diffusive"
+    limit_cycle = "limit_cycle"
+    unknown = "unknown"
+
+
+@dataclass
+class Dynamics:
+    """Computed dynamics for a single trajectory."""
+
+    n_steps: int
+    embeddings: np.ndarray          # (n_steps, 10)
+    drift: np.ndarray               # cosine distance from step 0
+    step_size: np.ndarray           # cosine distance from step t-1
+    entropy_series: list[float]     # running tool-family entropy
+    error_rate_series: list[float]  # running error fraction
+    tokens_series: list[int]
+    latency_series: list[float]
+    tool_sequence: list[str]        # primary family per step
+    markov: dict[str, dict[str, float]]
+    family_dist: dict[str, float]
+    regime: Regime
+    mean_drift: float
+    mean_step_size: float
+    tool_entropy: float
+    error_rate: float
+    constraint_index: float
+    pca_trajectory: np.ndarray | None = None  # (n_steps, 2)
+    bigram_transitions: dict[str, dict[str, float]] = field(default_factory=dict)
+    memory_depth: float = 0.0       # I(X_t; X_{t-2} | X_{t-1})
+
+
+@dataclass
+class Sensitivity:
+    """Pairwise comparison between two runs of the same task."""
+
+    task_id: str
+    score_delta: float
+    tool_edit_distance: int
+    family_js_divergence: float
+    embedding_divergence: np.ndarray  # (min_steps,)
+    lyapunov_proxy: float
+
+
+@dataclass
+class SurvivalPoint:
+    time: float
+    survival: float
+
+
+# ── Helpers ────────────────────────────────────────────────────────────
+
+
+def _cosine_dist(a: np.ndarray, b: np.ndarray) -> float:
+    na, nb = np.linalg.norm(a), np.linalg.norm(b)
+    if na < 1e-12 or nb < 1e-12:
+        return 1.0
+    return float(1.0 - np.dot(a, b) / (na * nb))
+
+
+def _entropy(counts: dict[str, int]) -> float:
+    total = sum(counts.values())
+    if total == 0:
+        return 0.0
+    return -sum(
+        (c / total) * math.log2(c / total) for c in counts.values() if c > 0
+    )
+
+
+def _js_divergence(p: dict[str, int], q: dict[str, int]) -> float:
+    keys = set(p) | set(q)
+    if not keys:
+        return 0.0
+    tp, tq = sum(p.values()) or 1, sum(q.values()) or 1
+    jsd = 0.0
+    for k in keys:
+        pk, qk = p.get(k, 0) / tp, q.get(k, 0) / tq
+        mk = (pk + qk) / 2
+        if pk > 0 and mk > 0:
+            jsd += 0.5 * pk * math.log2(pk / mk)
+        if qk > 0 and mk > 0:
+            jsd += 0.5 * qk * math.log2(qk / mk)
+    return jsd
+
+
+def _levenshtein(a: list, b: list) -> int:
+    if not a:
+        return len(b)
+    if not b:
+        return len(a)
+    prev = list(range(len(b) + 1))
+    for ca in a:
+        curr = [prev[0] + 1] + [0] * len(b)
+        for j, cb in enumerate(b):
+            curr[j + 1] = min(
+                prev[j] + (0 if ca == cb else 1),
+                prev[j + 1] + 1,
+                curr[j] + 1,
+            )
+        prev = curr
+    return prev[-1]
+
+
+def _classify_tool(name: str) -> str:
+    lo = name.lower()
+    for fam in TOOL_FAMILIES:
+        if fam in lo:
+            return fam
+    _ALIASES = {
+        "edit": ("write_file", "create_file", "str_replace", "patch"),
+        "execute": ("bash", "terminal", "shell", "run", "exec"),
+        "browser": ("browse", "click", "navigate", "screenshot"),
+        "search": ("grep", "find", "glob", "semantic"),
+        "read": ("cat", "head", "tail", "view", "list_dir"),
+    }
+    for fam, keywords in _ALIASES.items():
+        if any(k in lo for k in keywords):
+            return fam
+    return "execute"
+
+
+def _normalize_tool_family(name: str, family: str | None) -> str:
+    if family in TOOL_FAMILIES:
+        return family
+    return _classify_tool(name)
+
+
+# ── Feature embedding ──────────────────────────────────────────────────
+
+
+def _embed_transcript(
+    transcript: Transcript,
+) -> tuple[np.ndarray, list[str], list[int], list[float], list[bool]]:
+    """Build (n_steps, 10) feature matrix from assistant turns.
+
+    Features: [0:6] tool-family proportions, [6] error flag,
+    [7] normalised tokens, [8] normalised text length, [9] progress.
+    """
+    msgs = transcript.assistant_messages
+    n = len(msgs)
+    if n == 0:
+        return np.empty((0, _N_FAM + 4)), [], [], [], []
+
+    X = np.zeros((n, _N_FAM + 4))
+    families: list[str] = []
+    tokens: list[int] = []
+    latencies: list[float] = []
+    errors: list[bool] = []
+    raw_tokens = np.zeros(n)
+    raw_text = np.zeros(n)
+
+    for i, msg in enumerate(msgs):
+        fam_counts: Counter = Counter()
+        has_err = False
+        for tc in msg.tool_calls:
+            fam = _normalize_tool_family(tc.name, tc.family)
+            fam_counts[fam] += 1
+            if tc.success is False or tc.error:
+                has_err = True
+        n_tc = sum(fam_counts.values()) or 1
+        for j, fam in enumerate(TOOL_FAMILIES):
+            X[i, j] = fam_counts.get(fam, 0) / n_tc
+        X[i, _N_FAM] = 1.0 if has_err else 0.0
+        X[i, _N_FAM + 3] = i / max(n - 1, 1)
+
+        families.append(
+            max(fam_counts, key=fam_counts.get) if fam_counts else "execute"
+        )
+        errors.append(has_err)
+        tokens.append(msg.usage.total_tokens)
+        raw_tokens[i] = float(msg.usage.total_tokens)
+        raw_text[i] = float(len(msg.text))
+        dt = msg.timestamp_ms - msgs[i - 1].timestamp_ms if i > 0 else 0
+        latencies.append(max(float(dt), 0.0))
+
+    mx_tok = raw_tokens.max() or 1
+    mx_txt = raw_text.max() or 1
+    X[:, _N_FAM + 1] = raw_tokens / mx_tok
+    X[:, _N_FAM + 2] = raw_text / mx_txt
+
+    return X, families, tokens, latencies, errors
+
+
+# ── Non-Markov memory ────────────────────────────────────────────────
+
+
+def _compute_bigram_transitions(seq: list[str]) -> dict[str, dict[str, float]]:
+    """P(family_t | family_{t-1}, family_{t-2}) grouped by bigram context."""
+    if len(seq) < 3:
+        return {}
+    bigrams: dict[str, Counter] = {}
+    for a, b, c in zip(seq[:-2], seq[1:-1], seq[2:]):
+        ctx = f"{a}->{b}"
+        bigrams.setdefault(ctx, Counter())[c] += 1
+    return {
+        ctx: {k: v / sum(cnts.values()) for k, v in cnts.items()}
+        for ctx, cnts in bigrams.items()
+    }
+
+
+def _conditional_mi(seq: list[str]) -> float:
+    """I(X_t ; X_{t-2} | X_{t-1}) — non-Markov msemory indicator."""
+    if len(seq) < 3:
+        return 0.0
+    n = len(seq) - 2
+    triple = Counter(zip(seq[:-2], seq[1:-1], seq[2:]))
+    pair_01 = Counter(zip(seq[:-2], seq[1:-1]))
+    pair_12 = Counter(zip(seq[1:-1], seq[2:]))
+    single = Counter(seq[1:-1])
+
+    mi = 0.0
+    for (a, b, c), count in triple.items():
+        p_abc = count / n
+        p_ab, p_bc, p_b = pair_01[(a, b)] / n, pair_12[(b, c)] / n, single[b] / n
+        if p_ab > 0 and p_bc > 0 and p_b > 0:
+            mi += p_abc * math.log2((p_abc * p_b) / (p_ab * p_bc))
+    return max(mi, 0.0)
+
+
+# ── Core analysis ──────────────────────────────────────────────────────
+
+
+def compute_dynamics(transcript: Transcript) -> Dynamics:
+    """Compute trajectory dynamics from a single run transcript."""
+    X, families, tokens, latencies, errors = _embed_transcript(transcript)
+    n = len(families)
+
+    drift = (
+        np.array([_cosine_dist(X[0], X[i]) for i in range(n)])
+        if n else np.array([])
+    )
+    step_sz = np.zeros(n)
+    for i in range(1, n):
+        step_sz[i] = _cosine_dist(X[i - 1], X[i])
+
+    fam_acc: Counter = Counter()
+    err_count = 0
+    entropy_s: list[float] = []
+    error_s: list[float] = []
+    for i, (fam, err) in enumerate(zip(families, errors)):
+        fam_acc[fam] += 1
+        err_count += int(err)
+        entropy_s.append(_entropy(dict(fam_acc)))
+        error_s.append(err_count / (i + 1))
+
+    total = sum(fam_acc.values()) or 1
+    fam_dist = {k: v / total for k, v in fam_acc.items()}
+
+    mc: dict[str, Counter] = {f: Counter() for f in TOOL_FAMILIES}
+    for a, b in zip(families[:-1], families[1:]):
+        mc[a][b] += 1
+    markov = {
+        src: ({dst: c / t for dst, c in cnts.items()} if (t := sum(cnts.values())) else {})
+        for src, cnts in mc.items()
+    }
+
+    ci = 0.5
+    if n > 2:
+        cov = np.cov(X.T)
+        eigvals = np.maximum(np.linalg.eigvalsh(cov), 0)
+        tv = eigvals.sum()
+        if tv > 1e-10:
+            p = eigvals / tv
+            pr = 1.0 / np.sum(p**2)
+            ci = 1.0 - (pr - 1) / (X.shape[1] - 1)
+
+    h = _entropy(dict(fam_acc))
+    er = err_count / n if n else 0
+    regime = _classify_regime(drift, step_sz, h, er, ci, n)
+
+    return Dynamics(
+        n_steps=n,
+        embeddings=X,
+        drift=drift,
+        step_size=step_sz,
+        entropy_series=entropy_s,
+        error_rate_series=error_s,
+        tokens_series=tokens,
+        latency_series=latencies,
+        tool_sequence=families,
+        markov=markov,
+        family_dist=fam_dist,
+        regime=regime,
+        mean_drift=float(np.mean(drift)) if n else 0,
+        mean_step_size=float(np.mean(step_sz)) if n else 0,
+        tool_entropy=h,
+        error_rate=er,
+        constraint_index=ci,
+        bigram_transitions=_compute_bigram_transitions(families),
+        memory_depth=_conditional_mi(families),
+    )
+
+
+def _classify_regime(drift, step_sz, entropy, error_rate, ci, n) -> Regime:
+    if n < 3:
+        return Regime.unknown
+    if entropy < 0.5 or (error_rate > 0.6 and float(np.std(drift)) < 0.05):
+        return Regime.trapped
+    q = max(1, n // 4)
+    late_drift_std = float(np.std(drift[-q:]))
+    late_step_mean = float(np.mean(step_sz[-q:]))
+    if late_drift_std < 0.1 and late_step_mean < 0.15 and error_rate < 0.2:
+        return Regime.convergent
+    if entropy > 1.5 and error_rate < 0.15 and ci < 0.8:
+        return Regime.diffusive
+    step_var = float(np.var(step_sz[1:])) if n > 1 else 0
+    if entropy > 2.0 and step_var > 0.02:
+        return Regime.chaotic
+    if n > 6:
+        ss = step_sz[1:]
+        ss_c = ss - ss.mean()
+        norm = np.dot(ss_c, ss_c)
+        if norm > 1e-10:
+            ac = np.correlate(ss_c, ss_c, mode="full")
+            ac = ac[len(ac) // 2:] / norm
+            if len(ac) > 5 and max(ac[2:6]) > 0.3:
+                return Regime.limit_cycle
+    return Regime.unknown
+
+
+# ── Sensitivity ────────────────────────────────────────────────────────
+
+
+def compute_sensitivity(
+    run_a: TaskRunResult,
+    run_b: TaskRunResult,
+    task_id: str = "",
+) -> Sensitivity:
+    """Compare two runs of the same task for prompt sensitivity."""
+    Xa, fam_a, *_ = _embed_transcript(run_a.transcript)
+    Xb, fam_b, *_ = _embed_transcript(run_b.transcript)
+
+    min_n = min(len(Xa), len(Xb))
+    emb_div = (
+        np.array([_cosine_dist(Xa[i], Xb[i]) for i in range(min_n)])
+        if min_n else np.array([])
+    )
+
+    lyap = 0.0
+    if min_n > 1:
+        d0 = max(_cosine_dist(Xa[0], Xb[0]), 1e-6)
+        lyap = sum(
+            math.log(max(emb_div[t], 1e-6) / d0) / t for t in range(1, min_n)
+        ) / (min_n - 1)
+
+    return Sensitivity(
+        task_id=task_id or run_a.task_id,
+        score_delta=abs(run_a.run_score - run_b.run_score),
+        tool_edit_distance=_levenshtein(fam_a, fam_b),
+        family_js_divergence=_js_divergence(dict(Counter(fam_a)), dict(Counter(fam_b))),
+        embedding_divergence=emb_div,
+        lyapunov_proxy=lyap,
+    )
+
+
+# ── Survival analysis ─────────────────────────────────────────────────
+
+
+def kaplan_meier(
+    event_times: list[float],
+    censored: list[bool] | None = None,
+) -> list[SurvivalPoint]:
+    """Kaplan-Meier survival estimator."""
+    n = len(event_times)
+    if n == 0:
+        return []
+    if censored is None:
+        censored = [False] * n
+    pairs = sorted(zip(event_times, censored))
+    pts = [SurvivalPoint(0.0, 1.0)]
+    at_risk = n
+    surv = 1.0
+    for t, cens in pairs:
+        if cens:
+            at_risk -= 1
+            continue
+        if at_risk > 0:
+            surv *= (at_risk - 1) / at_risk
+        at_risk -= 1
+        pts.append(SurvivalPoint(t, surv))
+    return pts
+
+
+def find_event_step(transcript: Transcript, event: str) -> float | None:
+    """Return step index of the first occurrence of *event*, or None."""
+    msgs = transcript.assistant_messages
+    if event == "first_error_recovery":
+        in_err = False
+        for i, m in enumerate(msgs):
+            any_err = any(tc.success is False or tc.error for tc in m.tool_calls)
+            if any_err:
+                in_err = True
+            elif in_err:
+                return float(i)
+    elif event == "first_correct_write":
+        for i, m in enumerate(msgs):
+            for tc in m.tool_calls:
+                fam = tc.family or _classify_tool(tc.name)
+                if fam == "edit" and tc.success is not False and not tc.error:
+                    return float(i)
+    elif event == "task_completion":
+        if msgs:
+            last = msgs[-1]
+            if not any(tc.success is False or tc.error for tc in last.tool_calls):
+                return float(len(msgs) - 1)
+    elif event == "failure_absorption":
+        err_seen = False
+        for i, m in enumerate(msgs):
+            any_err = any(tc.success is False or tc.error for tc in m.tool_calls)
+            if any_err:
+                err_seen = True
+            elif err_seen and m.tool_calls:
+                return float(i)
+    return None
+
+
+# ── PCA trajectory bundles ─────────────────────────────────────────────
+
+
+def compute_pca_bundle(
+    dynamics_list: list[Dynamics],
+) -> tuple[np.ndarray, list[np.ndarray]]:
+    """Fit PCA on pooled embeddings, project each trajectory into PC1-PC2."""
+    non_empty = [d.embeddings for d in dynamics_list if d.n_steps > 0]
+    if not non_empty:
+        for d in dynamics_list:
+            d.pca_trajectory = np.empty((0, 2))
+        return np.zeros((2, _N_FAM + 4)), []
+    all_emb = np.vstack(non_empty)
+    mean = all_emb.mean(axis=0)
+    centred = all_emb - mean
+    _, _, Vt = np.linalg.svd(centred, full_matrices=False)
+    components = Vt[:2]
+
+    projections: list[np.ndarray] = []
+    for d in dynamics_list:
+        proj = (d.embeddings - mean) @ components.T if d.n_steps else np.empty((0, 2))
+        d.pca_trajectory = proj
+        projections.append(proj)
+    return components, projections
+
+
+# ── Stratified assessment with Bayesian reweighting ───────────────────
+
+
+@dataclass
+class StratumStats:
+    """Distributional statistics for one stratum of runs."""
+
+    name: str
+    n_runs: int
+    weight: float
+
+    # Score distribution
+    scores: np.ndarray
+    score_mean: float
+    score_std: float
+    score_quantiles: dict[str, float]  # q10, q25, q50, q75, q90
+
+    # Dynamics distributions
+    entropy_dist: np.ndarray
+    error_rate_dist: np.ndarray
+    constraint_dist: np.ndarray
+    memory_depth_dist: np.ndarray
+    mean_drift_dist: np.ndarray
+    mean_step_size_dist: np.ndarray
+
+    # Time-series curves (aligned by step index)
+    drift_curve_mean: np.ndarray
+    drift_curve_std: np.ndarray
+    step_curve_mean: np.ndarray
+    step_curve_std: np.ndarray
+
+    regime_counts: dict[str, int]
+    sensitivity_deltas: np.ndarray
+
+
+# Scalar fields on StratumStats that reweight() aggregates.
+_REWEIGHT_FIELDS = [
+    ("entropy", "entropy_dist"),
+    ("error_rate", "error_rate_dist"),
+    ("constraint", "constraint_dist"),
+    ("memory_depth", "memory_depth_dist"),
+    ("mean_drift", "mean_drift_dist"),
+    ("mean_step_size", "mean_step_size_dist"),
+]
+
+
+@dataclass
+class StratifiedAssessment:
+    """Full stratified assessment with Bayesian reweighting.
+
+    Call ``reweight(target_weights)`` with a different task distribution
+    to obtain importance-weighted aggregate estimates.
+    """
+
+    strata: list[StratumStats]
+    stratifier_name: str
+    total_runs: int
+    observed_mean_score: float
+    observed_std_score: float
+
+    def stratum_names(self) -> list[str]:
+        return [s.name for s in self.strata]
+
+    def reweight(self, target_weights: dict[str, float]) -> dict[str, float]:
+        """Bayesian importance-weight correction.
+
+        w_k = p_target(k) / p_observed(k), then normalised.
+        """
+        t_total = sum(target_weights.values()) or 1.0
+        p_target = {k: v / t_total for k, v in target_weights.items()}
+        by_name = {s.name: s for s in self.strata}
+
+        weights = {
+            name: pt / by_name[name].weight
+            for name, pt in p_target.items()
+            if name in by_name and by_name[name].weight > 1e-12
+        }
+        if not weights:
+            return {"score_mean": self.observed_mean_score,
+                    "score_std": self.observed_std_score}
+
+        w_total = sum(weights.values())
+        w = {k: v / w_total for k, v in weights.items()}
+
+        # Reweight score (mean + law-of-total-variance)
+        score_mu = sum(w[k] * by_name[k].score_mean for k in w)
+        score_var = sum(
+            w[k] * (by_name[k].score_std ** 2 + (by_name[k].score_mean - score_mu) ** 2)
+            for k in w
+        )
+        result = {"score_mean": score_mu, "score_std": math.sqrt(max(score_var, 0.0))}
+
+        def _safe_mean(arr: np.ndarray) -> float:
+            return float(np.mean(arr)) if len(arr) > 0 else 0.0
+
+        for label, dist_attr in _REWEIGHT_FIELDS:
+            result[f"{label}_mean"] = sum(
+                w[k] * _safe_mean(getattr(by_name[k], dist_attr)) for k in w
+            )
+        return result
+
+
+def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]:
+    """Mean and std of variable-length arrays aligned at step 0."""
+    if not arrays:
+        return np.array([]), np.array([])
+    max_len = max(len(a) for a in arrays)
+    mat = np.full((len(arrays), max_len), np.nan)
+    for i, a in enumerate(arrays):
+        mat[i, :len(a)] = a
+    return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0)
+
+
+def build_strata(
+    runs: list[TaskRunResult],
+    dynamics_list: list[Dynamics],
+    scores: list[float],
+    stratifier: Callable[[TaskRunResult, Dynamics], str],
+    stratifier_name: str = "custom",
+    sensitivities: list[Sensitivity] | None = None,
+) -> StratifiedAssessment:
+    """Group runs into strata and compute per-stratum distributions."""
+    assert len(runs) == len(dynamics_list) == len(scores)
+
+    groups: dict[str, list[int]] = {}
+    for idx, (r, d) in enumerate(zip(runs, dynamics_list)):
+        groups.setdefault(stratifier(r, d), []).append(idx)
+
+    total = len(runs)
+    all_scores = np.array(scores)
+
+    sens_by_task: dict[str, list[Sensitivity]] = {}
+    if sensitivities:
+        for s in sensitivities:
+            sens_by_task.setdefault(s.task_id, []).append(s)
+
+    strata: list[StratumStats] = []
+    for name, idxs in sorted(groups.items()):
+        n = len(idxs)
+        sc = np.array([scores[i] for i in idxs])
+        dyns = [dynamics_list[i] for i in idxs]
+
+        qs = {f"q{q}": float(np.percentile(sc, q)) if n else 0.0
+              for q in (10, 25, 50, 75, 90)}
+
+        drift_m, drift_s = _aligned_mean_std([d.drift for d in dyns])
+        step_m, step_s = _aligned_mean_std([d.step_size for d in dyns])
+
+        stratum_tasks = {runs[i].task_id for i in idxs}
+        sens_deltas = [
+            s.score_delta
+            for tid in stratum_tasks
+            for s in sens_by_task.get(tid, [])
+        ]
+
+        strata.append(StratumStats(
+            name=name, n_runs=n, weight=n / total if total else 0.0,
+            scores=sc,
+            score_mean=float(np.mean(sc)) if n else 0.0,
+            score_std=float(np.std(sc)) if n else 0.0,
+            score_quantiles=qs,
+            entropy_dist=np.array([d.tool_entropy for d in dyns]),
+            error_rate_dist=np.array([d.error_rate for d in dyns]),
+            constraint_dist=np.array([d.constraint_index for d in dyns]),
+            memory_depth_dist=np.array([d.memory_depth for d in dyns]),
+            mean_drift_dist=np.array([d.mean_drift for d in dyns]),
+            mean_step_size_dist=np.array([d.mean_step_size for d in dyns]),
+            drift_curve_mean=drift_m, drift_curve_std=drift_s,
+            step_curve_mean=step_m, step_curve_std=step_s,
+            regime_counts=dict(Counter(d.regime.value for d in dyns)),
+            sensitivity_deltas=np.array(sens_deltas) if sens_deltas else np.array([]),
+        ))
+
+    return StratifiedAssessment(
+        strata=strata,
+        stratifier_name=stratifier_name,
+        total_runs=total,
+        observed_mean_score=float(np.mean(all_scores)) if total else 0.0,
+        observed_std_score=float(np.std(all_scores)) if total else 0.0,
+    )
+
+
+# ── Built-in stratifiers ──────────────────────────────────────────────
+
+
+def stratify_by_regime(run: TaskRunResult, dyn: Dynamics) -> str:
+    return dyn.regime.value
+
+
+def stratify_by_task(run: TaskRunResult, dyn: Dynamics) -> str:
+    return run.task_id
+
+
+def stratify_by_tier(run: TaskRunResult, dyn: Dynamics) -> str:
+    tid = run.task_id.lower()
+    for i in range(1, 6):
+        if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"):
+            return f"tier{i}"
+    return "unknown"
+
+
+def stratify_by_tool_mix(run: TaskRunResult, dyn: Dynamics) -> str:
+    if not dyn.family_dist:
+        return "unknown"
+    return max(dyn.family_dist, key=dyn.family_dist.get)
+
+
+def stratify_by_prompt_style(run: TaskRunResult, dyn: Dynamics) -> str:
+    user_msgs = [m for m in run.transcript.messages if m.role == "user"]
+    if not user_msgs:
+        return "unknown"
+    wc = len(user_msgs[0].text.split())
+    return "terse" if wc <= 6 else ("medium" if wc <= 15 else "verbose")
+
+
+def stratify_by_scenario(run: TaskRunResult, dyn: Dynamics) -> str:
+    return run.scenario or "unknown"
+
+
+def stratify_by_family(run: TaskRunResult, dyn: Dynamics) -> str:
+    return run.family or "unknown"
--- a/clawbench/dynamics_archive.py
+++ b/clawbench/dynamics_archive.py
@ -0,0 +1,494 @@
+"""Offline dynamics analysis helpers for cached ClawBench runs."""
+
+from __future__ import annotations
+
+import json
+from itertools import combinations
+from pathlib import Path
+from typing import Iterable
+
+import numpy as np
+
+from clawbench.dynamics import (
+    build_strata,
+    compute_dynamics,
+    compute_pca_bundle,
+    compute_sensitivity,
+    find_event_step,
+    kaplan_meier,
+    stratify_by_regime,
+    stratify_by_scenario,
+    stratify_by_tier,
+    stratify_by_tool_mix,
+)
+from clawbench.schemas import TaskRunResult
+
+_TIER_PREFIXES = {
+    "tier1": ("t1-", "t1_"),
+    "tier2": ("t2-", "t2_"),
+    "tier3": ("t3-", "t3_"),
+    "tier4": ("t4-", "t4_"),
+    "tier5": ("t5-", "t5_"),
+}
+
+
+def safe_model_name(model: str) -> str:
+    return model.replace("/", "_").replace(":", "_")
+
+
+def _candidate_model_dir_names(model: str) -> set[str]:
+    return {
+        model,
+        safe_model_name(model),
+        model.replace("/", "_"),
+        model.replace("/", "-").replace(":", "-"),
+    }
+
+
+def _has_run_files(path: Path) -> bool:
+    try:
+        for child in path.iterdir():
+            if child.is_file() and child.name.startswith("run") and child.suffix == ".json":
+                return True
+    except FileNotFoundError:
+        return False
+    return False
+
+
+def _is_task_collection_root(path: Path) -> bool:
+    try:
+        for child in path.iterdir():
+            if child.is_dir() and _has_run_files(child):
+                return True
+    except FileNotFoundError:
+        return False
+    return False
+
+
+def _resolve_model_roots(archive_dir: Path, model: str | None) -> list[Path]:
+    if _is_task_collection_root(archive_dir):
+        if model is not None and archive_dir.name not in _candidate_model_dir_names(model):
+            raise ValueError(
+                f"Archive dir {archive_dir} does not match requested model {model}."
+            )
+        return [archive_dir]
+
+    roots = [
+        child
+        for child in sorted(archive_dir.iterdir())
+        if child.is_dir() and _is_task_collection_root(child)
+    ]
+    if model is not None:
+        candidates = _candidate_model_dir_names(model)
+        roots = [root for root in roots if root.name in candidates]
+    elif len(roots) > 1:
+        raise ValueError(
+            "Archive root contains multiple model directories. Pass --model or point "
+            "--archive-dir at a specific model directory."
+        )
+    return roots
+
+
+def discover_model_roots(archive_dir: Path) -> dict[str, Path]:
+    """Discover model directories inside an archive root.
+
+    Returns a mapping of model directory name to its path. If archive_dir is
+    itself a model cache root (contains task directories with run*.json), the
+    mapping contains a single entry.
+    """
+    if not archive_dir.exists():
+        raise ValueError(f"Archive dir does not exist: {archive_dir}")
+
+    if _is_task_collection_root(archive_dir):
+        return {archive_dir.name: archive_dir}
+
+    roots = {
+        child.name: child
+        for child in sorted(archive_dir.iterdir())
+        if child.is_dir() and _is_task_collection_root(child)
+    }
+    return roots
+
+
+def _matches_tier(task_id: str, tier: str | None) -> bool:
+    if tier is None:
+        return True
+    return task_id.lower().startswith(_TIER_PREFIXES[tier])
+
+
+def load_task_runs_archive(
+    archive_dir: Path,
+    model: str | None = None,
+    task_ids: Iterable[str] | None = None,
+    tier: str | None = None,
+) -> dict[str, list[TaskRunResult]]:
+    """Load cached TaskRunResult objects from a run cache/archive directory."""
+    task_filter = set(task_ids or [])
+    task_runs: dict[str, list[TaskRunResult]] = {}
+
+    if not archive_dir.exists():
+        raise ValueError(f"Archive dir does not exist: {archive_dir}")
+
+    roots = _resolve_model_roots(archive_dir, model)
+    if not roots:
+        return {}
+
+    for root in roots:
+        for task_dir in sorted(child for child in root.iterdir() if child.is_dir()):
+            task_id = task_dir.name
+            if task_filter and task_id not in task_filter:
+                continue
+            if not _matches_tier(task_id, tier):
+                continue
+
+            runs = []
+            for run_file in sorted(task_dir.glob("run*.json")):
+                try:
+                    run = TaskRunResult.model_validate_json(
+                        run_file.read_text(encoding="utf-8")
+                    )
+                except Exception:
+                    continue
+                runs.append(run)
+
+            if runs:
+                task_runs.setdefault(task_id, []).extend(runs)
+
+    for task_id, runs in task_runs.items():
+        runs.sort(key=lambda run: run.run_index)
+
+    return task_runs
+
+
+def _aligned_mean_std(arrays: list[np.ndarray]) -> tuple[np.ndarray, np.ndarray]:
+    if not arrays:
+        return np.array([]), np.array([])
+    max_len = max(len(arr) for arr in arrays)
+    if max_len == 0:
+        return np.array([]), np.array([])
+    mat = np.full((len(arrays), max_len), np.nan)
+    for idx, arr in enumerate(arrays):
+        mat[idx, :len(arr)] = arr
+    return np.nanmean(mat, axis=0), np.nanstd(mat, axis=0)
+
+
+def _round_list(values: np.ndarray, digits: int = 4) -> list[float]:
+    return [round(float(value), digits) for value in values.tolist()]
+
+
+def _empty_sensitivity_summary() -> dict[str, object]:
+    return {
+        "n_pairs": 0,
+        "mean_score_delta": 0.0,
+        "mean_tool_edit_distance": 0.0,
+        "mean_family_js_divergence": 0.0,
+        "mean_lyapunov_proxy": 0.0,
+        "mean_initial_divergence": 0.0,
+        "mean_final_divergence": 0.0,
+        "mean_contraction_delta": 0.0,
+        "mean_contraction_ratio": 0.0,
+        "fraction_converging_pairs": 0.0,
+        "mean_divergence_curve": [],
+        "std_divergence_curve": [],
+        "pair_points": [],
+    }
+
+
+def _summarize_sensitivity_group(pairs: list) -> dict[str, object]:
+    if not pairs:
+        return _empty_sensitivity_summary()
+
+    divergence_curves = [pair.embedding_divergence for pair in pairs if len(pair.embedding_divergence) > 0]
+    curve_mean, curve_std = _aligned_mean_std(divergence_curves)
+
+    pair_points = []
+    for pair in pairs:
+        if len(pair.embedding_divergence) > 0:
+            initial_divergence = float(pair.embedding_divergence[0])
+            final_divergence = float(pair.embedding_divergence[-1])
+            contraction_delta = final_divergence - initial_divergence
+            contraction_ratio = final_divergence / max(initial_divergence, 1e-6)
+        else:
+            initial_divergence = 0.0
+            final_divergence = 0.0
+            contraction_delta = 0.0
+            contraction_ratio = 0.0
+        pair_points.append(
+            {
+                "score_delta": round(float(pair.score_delta), 4),
+                "tool_edit_distance": int(pair.tool_edit_distance),
+                "family_js_divergence": round(float(pair.family_js_divergence), 4),
+                "lyapunov_proxy": round(float(pair.lyapunov_proxy), 4),
+                "initial_divergence": round(initial_divergence, 4),
+                "final_divergence": round(final_divergence, 4),
+                "contraction_delta": round(contraction_delta, 4),
+                "contraction_ratio": round(contraction_ratio, 4),
+            }
+        )
+
+    converging_pairs = sum(
+        1 for point in pair_points if point["final_divergence"] < point["initial_divergence"]
+    )
+
+    return {
+        "n_pairs": len(pairs),
+        "mean_score_delta": round(float(np.mean([pair.score_delta for pair in pairs])), 4),
+        "mean_tool_edit_distance": round(float(np.mean([pair.tool_edit_distance for pair in pairs])), 4),
+        "mean_family_js_divergence": round(float(np.mean([pair.family_js_divergence for pair in pairs])), 4),
+        "mean_lyapunov_proxy": round(float(np.mean([pair.lyapunov_proxy for pair in pairs])), 4),
+        "mean_initial_divergence": round(float(np.mean([point["initial_divergence"] for point in pair_points])), 4),
+        "mean_final_divergence": round(float(np.mean([point["final_divergence"] for point in pair_points])), 4),
+        "mean_contraction_delta": round(float(np.mean([point["contraction_delta"] for point in pair_points])), 4),
+        "mean_contraction_ratio": round(float(np.mean([point["contraction_ratio"] for point in pair_points])), 4),
+        "fraction_converging_pairs": round(converging_pairs / len(pair_points), 4),
+        "mean_divergence_curve": _round_list(curve_mean),
+        "std_divergence_curve": _round_list(curve_std),
+        "pair_points": pair_points,
+    }
+
+
+def _build_sensitivity_sections(
+    valid_runs_by_task: dict[str, list[TaskRunResult]],
+) -> tuple[list, dict[str, object]]:
+    same_task_pairs = []
+    per_task: dict[str, object] = {}
+    for task_id, runs in sorted(valid_runs_by_task.items()):
+        if len(runs) < 2:
+            continue
+        task_pairs = [
+            compute_sensitivity(run_a, run_b, task_id=task_id)
+            for run_a, run_b in combinations(runs, 2)
+        ]
+        if task_pairs:
+            same_task_pairs.extend(task_pairs)
+            per_task[task_id] = _summarize_sensitivity_group(task_pairs)
+
+    same_task_summary = _summarize_sensitivity_group(same_task_pairs)
+    same_task_summary["per_task"] = per_task
+
+    perturbation_pairs = []
+    per_variant_group: dict[str, object] = {}
+    runs_by_variant_group: dict[str, list[TaskRunResult]] = {}
+    for runs in valid_runs_by_task.values():
+        for run in runs:
+            runs_by_variant_group.setdefault(run.variant_group or run.task_id, []).append(run)
+
+    for variant_group, runs in sorted(runs_by_variant_group.items()):
+        distinct_members = {
+            (run.task_id, run.prompt_variant, run.variant_id)
+            for run in runs
+        }
+        if len(distinct_members) < 2:
+            continue
+
+        group_pairs = []
+        for run_a, run_b in combinations(runs, 2):
+            if (
+                run_a.task_id == run_b.task_id
+                and run_a.prompt_variant == run_b.prompt_variant
+                and run_a.variant_id == run_b.variant_id
+            ):
+                continue
+            group_pairs.append(compute_sensitivity(run_a, run_b, task_id=variant_group))
+
+        if not group_pairs:
+            continue
+
+        perturbation_pairs.extend(group_pairs)
+        group_summary = _summarize_sensitivity_group(group_pairs)
+        group_summary["members"] = [
+            {
+                "task_id": task_id,
+                "prompt_variant": prompt_variant,
+                "variant_id": variant_id,
+            }
+            for task_id, prompt_variant, variant_id in sorted(distinct_members)
+        ]
+        per_variant_group[variant_group] = group_summary
+
+    perturbation_summary = _summarize_sensitivity_group(perturbation_pairs)
+    perturbation_summary["per_variant_group"] = per_variant_group
+
+    return same_task_pairs, {
+        "same_task": same_task_summary,
+        "prompt_perturbation": perturbation_summary,
+    }
+
+
+def build_dynamics_report(
+    task_runs: dict[str, list[TaskRunResult]],
+    include_pca: bool = True,
+) -> tuple[dict[str, object], dict[str, object]]:
+    """Compute stratified dynamics report data from cached runs."""
+    all_runs = [run for runs in task_runs.values() for run in runs]
+    if not all_runs:
+        raise ValueError("No cached runs were loaded.")
+
+    dynamics_list = []
+    scores = []
+    valid_runs = []
+    for run in all_runs:
+        if not run.transcript.messages:
+            continue
+        dynamics_list.append(compute_dynamics(run.transcript))
+        scores.append(run.run_score)
+        valid_runs.append(run)
+
+    if not valid_runs:
+        raise ValueError("No runs with transcripts were found in the archive.")
+
+    valid_runs_by_task: dict[str, list[TaskRunResult]] = {}
+    for run in valid_runs:
+        valid_runs_by_task.setdefault(run.task_id, []).append(run)
+
+    same_task_sensitivities, sensitivity_summary = _build_sensitivity_sections(valid_runs_by_task)
+
+    stratifiers = {
+        "tier": stratify_by_tier,
+        "regime": stratify_by_regime,
+        "tool_mix": stratify_by_tool_mix,
+        "scenario": stratify_by_scenario,
+    }
+
+    report: dict[str, object] = {
+        "n_runs": len(valid_runs),
+        "n_tasks": len(task_runs),
+        "strata": {},
+    }
+
+    stratified = {}
+    for name, fn in stratifiers.items():
+        assessment = build_strata(
+            valid_runs,
+            dynamics_list,
+            scores,
+            fn,
+            name,
+            sensitivities=same_task_sensitivities,
+        )
+        stratified[name] = assessment
+        strata_summary = []
+        for stratum in assessment.strata:
+            strata_summary.append(
+                {
+                    "name": stratum.name,
+                    "n_runs": stratum.n_runs,
+                    "weight": round(stratum.weight, 4),
+                    "score_mean": round(stratum.score_mean, 4),
+                    "score_std": round(stratum.score_std, 4),
+                    "score_quantiles": {
+                        key: round(value, 4)
+                        for key, value in stratum.score_quantiles.items()
+                    },
+                    "entropy_mean": round(float(stratum.entropy_dist.mean()), 4)
+                    if len(stratum.entropy_dist)
+                    else 0.0,
+                    "error_rate_mean": round(float(stratum.error_rate_dist.mean()), 4)
+                    if len(stratum.error_rate_dist)
+                    else 0.0,
+                    "constraint_mean": round(float(stratum.constraint_dist.mean()), 4)
+                    if len(stratum.constraint_dist)
+                    else 0.0,
+                    "memory_depth_mean": round(float(stratum.memory_depth_dist.mean()), 4)
+                    if len(stratum.memory_depth_dist)
+                    else 0.0,
+                    "sensitivity_pairs": int(len(stratum.sensitivity_deltas)),
+                    "sensitivity_mean_score_delta": round(float(stratum.sensitivity_deltas.mean()), 4)
+                    if len(stratum.sensitivity_deltas)
+                    else 0.0,
+                    "regime_counts": stratum.regime_counts,
+                }
+            )
+        report["strata"][name] = {
+            "observed_mean_score": round(assessment.observed_mean_score, 4),
+            "observed_std_score": round(assessment.observed_std_score, 4),
+            "strata": strata_summary,
+        }
+
+    report["per_run"] = [
+        {
+            "task_id": run.task_id,
+            "run_index": run.run_index,
+            "score": round(run.run_score, 4),
+            "regime": dynamics.regime.value,
+            "entropy": round(dynamics.tool_entropy, 4),
+            "error_rate": round(dynamics.error_rate, 4),
+            "constraint_index": round(dynamics.constraint_index, 4),
+            "memory_depth": round(dynamics.memory_depth, 4),
+            "n_steps": dynamics.n_steps,
+            "mean_drift": round(dynamics.mean_drift, 4),
+            "mean_step_size": round(dynamics.mean_step_size, 4),
+        }
+        for run, dynamics in zip(valid_runs, dynamics_list)
+    ]
+    report["sensitivity"] = sensitivity_summary
+
+    if include_pca:
+        compute_pca_bundle(dynamics_list)
+
+    events = []
+    censored = []
+    for run in valid_runs:
+        step = find_event_step(run.transcript, "first_correct_write")
+        if step is not None:
+            events.append(step)
+            censored.append(False)
+        else:
+            events.append(float(len(run.transcript.assistant_messages)))
+            censored.append(True)
+    km_points = kaplan_meier(events, censored)
+    return report, {
+        "valid_runs": valid_runs,
+        "dynamics_list": dynamics_list,
+        "stratified": stratified,
+        "km_points": km_points,
+        "sensitivity": sensitivity_summary,
+    }
+
+
+def write_dynamics_report(
+    task_runs: dict[str, list[TaskRunResult]],
+    out_dir: Path,
+    report_name: str = "dynamics.json",
+    generate_plots: bool = True,
+) -> tuple[Path, list[Path]]:
+    """Write the dynamics report JSON and plots to an output directory."""
+    report, plot_data = build_dynamics_report(task_runs, include_pca=generate_plots)
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    report_path = out_dir / report_name
+    report_path.write_text(json.dumps(report, indent=2), encoding="utf-8")
+
+    plots: list[Path] = []
+    if generate_plots:
+        from clawbench.dynamics_plots import generate_all_plots
+
+        plots = generate_all_plots(
+            plot_data["dynamics_list"],
+            plot_data["valid_runs"],
+            plot_data["stratified"],
+            km_points=plot_data["km_points"],
+            event_name="first_correct_write",
+            out_dir=out_dir,
+            sensitivity_summary=plot_data["sensitivity"],
+        )
+    return report_path, plots
+
+
+def load_task_runs_by_model(
+    archive_dir: Path,
+    tier: str | None = None,
+    task_ids: Iterable[str] | None = None,
+) -> dict[str, dict[str, list[TaskRunResult]]]:
+    """Load cached TaskRunResult objects grouped by model directory name."""
+    grouped: dict[str, dict[str, list[TaskRunResult]]] = {}
+    for model_name, model_dir in discover_model_roots(archive_dir).items():
+        task_runs = load_task_runs_archive(
+            archive_dir=model_dir,
+            model=None,
+            task_ids=task_ids,
+            tier=tier,
+        )
+        if task_runs:
+            grouped[model_name] = task_runs
+    return grouped
--- a/clawbench/dynamics_plots.py
+++ b/clawbench/dynamics_plots.py
@ -0,0 +1,411 @@
+"""Plotting utilities for dynamics analysis.
+
+Generates publication-ready figures from dynamics data and saves to a
+results directory. All plots use matplotlib with the Agg backend so they
+work headlessly.
+"""
+from __future__ import annotations
+
+from pathlib import Path
+
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+from clawbench.dynamics import (
+    Dynamics,
+    StratifiedAssessment,
+    StratumStats,
+    SurvivalPoint,
+)
+
+
+def _savefig(fig: plt.Figure, path: Path) -> None:
+    fig.savefig(path, dpi=150, bbox_inches="tight")
+    plt.close(fig)
+
+
+def _plot_series_curves(
+    dynamics_list: list[Dynamics],
+    labels: list[str],
+    out_path: Path,
+    *,
+    series_attr: str,
+    ylabel: str,
+    title: str,
+) -> None:
+    """Plot a step-aligned per-run series coloured by label."""
+    fig, ax = plt.subplots(figsize=(10, 5))
+    cmap = plt.cm.tab10
+    unique = sorted(set(labels))
+    colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)}
+
+    for d, lbl in zip(dynamics_list, labels):
+        series = np.asarray(getattr(d, series_attr), dtype=float)
+        if len(series) < 2:
+            continue
+        ax.plot(series, alpha=0.6, color=colour_map[lbl], linewidth=1)
+
+    for lbl in unique:
+        ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2)
+    ax.legend(fontsize=8, loc="upper left")
+    ax.set_xlabel("Step")
+    ax.set_ylabel(ylabel)
+    ax.set_title(title)
+    _savefig(fig, out_path)
+
+
+def plot_drift_curves(
+    dynamics_list: list[Dynamics],
+    labels: list[str],
+    out_path: Path,
+) -> None:
+    """Drift-from-origin curves coloured by label (e.g. task_id or regime)."""
+    _plot_series_curves(
+        dynamics_list,
+        labels,
+        out_path,
+        series_attr="drift",
+        ylabel="Cosine distance from step 0",
+        title="Drift from Origin",
+    )
+
+
+def plot_step_size_curves(
+    dynamics_list: list[Dynamics],
+    labels: list[str],
+    out_path: Path,
+) -> None:
+    """Step-to-step movement curves coloured by label."""
+    _plot_series_curves(
+        dynamics_list,
+        labels,
+        out_path,
+        series_attr="step_size",
+        ylabel="Cosine distance from previous step",
+        title="Step-to-Step Movement",
+    )
+
+
+def plot_pca_trajectories(
+    dynamics_list: list[Dynamics],
+    labels: list[str],
+    out_path: Path,
+) -> None:
+    """PCA phase portraits (PC1 vs PC2) coloured by label."""
+    fig, ax = plt.subplots(figsize=(8, 8))
+    cmap = plt.cm.tab10
+    unique = sorted(set(labels))
+    colour_map = {lbl: cmap(i / max(len(unique) - 1, 1)) for i, lbl in enumerate(unique)}
+
+    for d, lbl in zip(dynamics_list, labels):
+        if d.pca_trajectory is None or len(d.pca_trajectory) < 2:
+            continue
+        traj = d.pca_trajectory
+        ax.plot(traj[:, 0], traj[:, 1], alpha=0.5, color=colour_map[lbl], linewidth=1)
+        ax.scatter(traj[0, 0], traj[0, 1], color=colour_map[lbl], marker="o", s=30, zorder=5)
+        ax.scatter(traj[-1, 0], traj[-1, 1], color=colour_map[lbl], marker="x", s=30, zorder=5)
+
+    for lbl in unique:
+        ax.plot([], [], color=colour_map[lbl], label=lbl, linewidth=2)
+    ax.legend(fontsize=8)
+    ax.set_xlabel("PC1")
+    ax.set_ylabel("PC2")
+    ax.set_title("PCA Phase Portrait (o=start, x=end)")
+    _savefig(fig, out_path)
+
+
+def plot_regime_distribution(
+    strata: list[StratumStats],
+    stratifier_name: str,
+    out_path: Path,
+) -> None:
+    """Stacked bar chart of regime counts per stratum."""
+    fig, ax = plt.subplots(figsize=(10, 5))
+    all_regimes = sorted({r for s in strata for r in s.regime_counts})
+    x = np.arange(len(strata))
+    bottom = np.zeros(len(strata))
+    cmap = plt.cm.Set2
+
+    for j, regime in enumerate(all_regimes):
+        counts = [s.regime_counts.get(regime, 0) for s in strata]
+        ax.bar(x, counts, bottom=bottom, label=regime, color=cmap(j / max(len(all_regimes) - 1, 1)))
+        bottom += np.array(counts)
+
+    ax.set_xticks(x)
+    ax.set_xticklabels([s.name for s in strata], rotation=30, ha="right")
+    ax.set_ylabel("Count")
+    ax.set_title(f"Regime Distribution by {stratifier_name}")
+    ax.legend(fontsize=8)
+    _savefig(fig, out_path)
+
+
+def plot_score_distributions(
+    strata: list[StratumStats],
+    stratifier_name: str,
+    out_path: Path,
+) -> None:
+    """Box plots of score distributions per stratum."""
+    fig, ax = plt.subplots(figsize=(10, 5))
+    data = [s.scores for s in strata if len(s.scores) > 0]
+    labels = [s.name for s in strata if len(s.scores) > 0]
+
+    if data:
+        ax.boxplot(data, labels=labels, patch_artist=True,
+                   boxprops=dict(facecolor="lightblue", alpha=0.7))
+    ax.set_ylabel("Score")
+    ax.set_title(f"Score Distribution by {stratifier_name}")
+    plt.xticks(rotation=30, ha="right")
+    _savefig(fig, out_path)
+
+
+def plot_survival_curve(
+    km_points: list[SurvivalPoint],
+    event_name: str,
+    out_path: Path,
+) -> None:
+    """Kaplan-Meier survival curve."""
+    if not km_points:
+        return
+    fig, ax = plt.subplots(figsize=(8, 5))
+    times = [p.time for p in km_points]
+    surv = [p.survival for p in km_points]
+    ax.step(times, surv, where="post", linewidth=2, color="steelblue")
+    ax.fill_between(times, surv, step="post", alpha=0.15, color="steelblue")
+    ax.set_xlabel("Step")
+    ax.set_ylabel("Survival probability")
+    ax.set_title(f"Kaplan-Meier: {event_name}")
+    ax.set_ylim(-0.05, 1.05)
+    _savefig(fig, out_path)
+
+
+def plot_stratum_dynamics_heatmap(
+    strata: list[StratumStats],
+    stratifier_name: str,
+    out_path: Path,
+) -> None:
+    """Heatmap of mean dynamics metrics across strata."""
+    metrics = ["entropy", "error_rate", "constraint", "memory_depth", "mean_drift", "mean_step_size"]
+    data = np.zeros((len(strata), len(metrics)))
+    for i, s in enumerate(strata):
+        arrays = [s.entropy_dist, s.error_rate_dist, s.constraint_dist,
+                  s.memory_depth_dist, s.mean_drift_dist, s.mean_step_size_dist]
+        for j, arr in enumerate(arrays):
+            data[i, j] = float(np.mean(arr)) if len(arr) > 0 else 0.0
+
+    fig, ax = plt.subplots(figsize=(10, max(3, len(strata) * 0.6)))
+    im = ax.imshow(data, aspect="auto", cmap="YlOrRd")
+    ax.set_xticks(range(len(metrics)))
+    ax.set_xticklabels(metrics, rotation=30, ha="right")
+    ax.set_yticks(range(len(strata)))
+    ax.set_yticklabels([s.name for s in strata])
+    for i in range(len(strata)):
+        for j in range(len(metrics)):
+            ax.text(j, i, f"{data[i, j]:.2f}", ha="center", va="center", fontsize=8)
+    fig.colorbar(im, ax=ax, shrink=0.8)
+    ax.set_title(f"Dynamics Metrics by {stratifier_name}")
+    _savefig(fig, out_path)
+
+
+def plot_pairwise_divergence_curves(
+    per_task_sensitivity: dict[str, dict],
+    out_path: Path,
+) -> bool:
+    """Plot mean pairwise trajectory divergence over aligned steps."""
+    if not per_task_sensitivity:
+        return False
+
+    fig, ax = plt.subplots(figsize=(10, 5))
+    cmap = plt.cm.tab10
+    tasks = sorted(per_task_sensitivity)
+    colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)}
+
+    plotted = False
+    for task in tasks:
+        summary = per_task_sensitivity[task]
+        mean_curve = np.asarray(summary.get("mean_divergence_curve", []), dtype=float)
+        std_curve = np.asarray(summary.get("std_divergence_curve", []), dtype=float)
+        if len(mean_curve) == 0:
+            continue
+        steps = np.arange(len(mean_curve))
+        ax.plot(steps, mean_curve, linewidth=2, color=colour_map[task], label=task)
+        if len(std_curve) == len(mean_curve):
+            ax.fill_between(steps, mean_curve - std_curve, mean_curve + std_curve, color=colour_map[task], alpha=0.12)
+        plotted = True
+
+    if not plotted:
+        plt.close(fig)
+        return False
+
+    ax.set_xlabel("Aligned step")
+    ax.set_ylabel("Pairwise embedding divergence")
+    ax.set_title("Do Repeated Trajectories Converge or Diverge?")
+    ax.legend(fontsize=8)
+    _savefig(fig, out_path)
+    return True
+
+
+def plot_pairwise_contraction_scatter(
+    per_task_sensitivity: dict[str, dict],
+    out_path: Path,
+) -> bool:
+    """Scatter initial vs final pairwise divergence; below diagonal means convergence."""
+    if not per_task_sensitivity:
+        return False
+
+    fig, ax = plt.subplots(figsize=(7, 6))
+    cmap = plt.cm.tab10
+    tasks = sorted(per_task_sensitivity)
+    colour_map = {task: cmap(i / max(len(tasks) - 1, 1)) for i, task in enumerate(tasks)}
+
+    max_seen = 0.0
+    plotted = False
+    for task in tasks:
+        points = per_task_sensitivity[task].get("pair_points", [])
+        if not points:
+            continue
+        xs = [point["initial_divergence"] for point in points]
+        ys = [point["final_divergence"] for point in points]
+        max_seen = max(max_seen, *(xs + ys))
+        ax.scatter(xs, ys, s=60, alpha=0.8, color=colour_map[task], label=task)
+        plotted = True
+
+    if not plotted:
+        plt.close(fig)
+        return False
+
+    limit = max(max_seen, 0.1)
+    ax.plot([0, limit], [0, limit], linestyle="--", color="black", linewidth=1)
+    ax.set_xlabel("Initial pairwise divergence")
+    ax.set_ylabel("Final pairwise divergence")
+    ax.set_title("Pairwise Trajectory Contraction")
+    ax.legend(fontsize=8)
+    _savefig(fig, out_path)
+    return True
+
+
+def plot_sensitivity_heatmap(
+    per_task_sensitivity: dict[str, dict],
+    out_path: Path,
+) -> bool:
+    """Heatmap of per-task sensitivity metrics."""
+    if not per_task_sensitivity:
+        return False
+
+    metrics = [
+        ("mean_score_delta", "score_delta"),
+        ("mean_tool_edit_distance", "tool_edit"),
+        ("mean_family_js_divergence", "js_div"),
+        ("mean_lyapunov_proxy", "lyapunov"),
+        ("fraction_converging_pairs", "frac_converging"),
+    ]
+    tasks = sorted(per_task_sensitivity)
+    data = np.zeros((len(tasks), len(metrics)))
+    for row_idx, task in enumerate(tasks):
+        summary = per_task_sensitivity[task]
+        for col_idx, (key, _label) in enumerate(metrics):
+            data[row_idx, col_idx] = float(summary.get(key, 0.0))
+
+    fig, ax = plt.subplots(figsize=(9, max(3, len(tasks) * 0.7)))
+    im = ax.imshow(data, aspect="auto", cmap="Blues")
+    ax.set_xticks(range(len(metrics)))
+    ax.set_xticklabels([label for _key, label in metrics], rotation=30, ha="right")
+    ax.set_yticks(range(len(tasks)))
+    ax.set_yticklabels(tasks)
+    for row_idx in range(len(tasks)):
+        for col_idx in range(len(metrics)):
+            ax.text(col_idx, row_idx, f"{data[row_idx, col_idx]:.2f}", ha="center", va="center", fontsize=8)
+    fig.colorbar(im, ax=ax, shrink=0.8)
+    ax.set_title("Pairwise Sensitivity by Task")
+    _savefig(fig, out_path)
+    return True
+
+
+def generate_all_plots(
+    dynamics_list: list[Dynamics],
+    runs: list,
+    stratified: dict[str, StratifiedAssessment],
+    km_points: list[SurvivalPoint] | None = None,
+    event_name: str = "first_correct_write",
+    out_dir: Path = Path("results"),
+    sensitivity_summary: dict[str, dict] | None = None,
+) -> list[Path]:
+    """Generate all dynamics plots and return list of saved paths."""
+    out_dir.mkdir(parents=True, exist_ok=True)
+    saved: list[Path] = []
+
+    # Labels by regime
+    regime_labels = [d.regime.value for d in dynamics_list]
+    tier_labels = []
+    for r in runs:
+        tid = r.task_id.lower()
+        tier = "unknown"
+        for i in range(1, 6):
+            if tid.startswith(f"t{i}_") or tid.startswith(f"t{i}-"):
+                tier = f"tier{i}"
+                break
+        tier_labels.append(tier)
+
+    # Drift curves by regime
+    p = out_dir / "drift_by_regime.png"
+    plot_drift_curves(dynamics_list, regime_labels, p)
+    saved.append(p)
+
+    # Drift curves by tier
+    p = out_dir / "drift_by_tier.png"
+    plot_drift_curves(dynamics_list, tier_labels, p)
+    saved.append(p)
+
+    p = out_dir / "step_size_by_regime.png"
+    plot_step_size_curves(dynamics_list, regime_labels, p)
+    saved.append(p)
+
+    p = out_dir / "step_size_by_tier.png"
+    plot_step_size_curves(dynamics_list, tier_labels, p)
+    saved.append(p)
+
+    # PCA trajectories
+    has_pca = any(d.pca_trajectory is not None for d in dynamics_list)
+    if has_pca:
+        p = out_dir / "pca_by_regime.png"
+        plot_pca_trajectories(dynamics_list, regime_labels, p)
+        saved.append(p)
+        p = out_dir / "pca_by_tier.png"
+        plot_pca_trajectories(dynamics_list, tier_labels, p)
+        saved.append(p)
+
+    # Per-stratifier plots
+    for name, sa in stratified.items():
+        p = out_dir / f"regimes_by_{name}.png"
+        plot_regime_distribution(sa.strata, name, p)
+        saved.append(p)
+
+        p = out_dir / f"scores_by_{name}.png"
+        plot_score_distributions(sa.strata, name, p)
+        saved.append(p)
+
+        p = out_dir / f"dynamics_heatmap_{name}.png"
+        plot_stratum_dynamics_heatmap(sa.strata, name, p)
+        saved.append(p)
+
+    # Survival curve
+    if km_points:
+        p = out_dir / f"survival_{event_name}.png"
+        plot_survival_curve(km_points, event_name, p)
+        saved.append(p)
+
+    per_task_sensitivity = (sensitivity_summary or {}).get("same_task", {}).get("per_task", {})
+    p = out_dir / "pairwise_divergence_by_task.png"
+    if plot_pairwise_divergence_curves(per_task_sensitivity, p):
+        saved.append(p)
+
+    p = out_dir / "pairwise_contraction_scatter.png"
+    if plot_pairwise_contraction_scatter(per_task_sensitivity, p):
+        saved.append(p)
+
+    p = out_dir / "sensitivity_heatmap.png"
+    if plot_sensitivity_heatmap(per_task_sensitivity, p):
+        saved.append(p)
+
+    return saved
--- a/clawbench/environment.py
+++ b/clawbench/environment.py
@ -11,6 +11,7 @@ from pathlib import Path
 from typing import Any

 from clawbench.client import GatewayClient
+from clawbench.paths import resolve_workspace_path
 from clawbench.render import render_template, render_value
 from clawbench.schemas import (
    CompletionResult,
@ -109,7 +110,20 @@ async def run_execution_check(
    runtime_values: dict[str, Any],
 ) -> ExecutionCheckResult:
    rendered_command = render_template(spec.command, runtime_values)
-    rendered_cwd = workspace / render_template(spec.cwd, runtime_values)
+    try:
+        rendered_cwd = resolve_workspace_path(
+            workspace,
+            render_template(spec.cwd, runtime_values),
+            field=f"execution check cwd for {spec.name}",
+        )
+    except ValueError as exc:
+        return ExecutionCheckResult(
+            name=spec.name,
+            command=rendered_command,
+            exit_code=-1,
+            passed=False,
+            reason=str(exc),
+        )
    rendered_env = render_value(spec.env, runtime_values)
    import os
    import sys
@ -219,7 +233,14 @@ def _evaluate_execution_result(
            return False, "stdout did not match expected text"

    if spec.expected_stdout_file:
-        expected_path = workspace / render_template(spec.expected_stdout_file, runtime_values)
+        try:
+            expected_path = resolve_workspace_path(
+                workspace,
+                render_template(spec.expected_stdout_file, runtime_values),
+                field=f"expected_stdout_file for {spec.name}",
+            )
+        except ValueError as exc:
+            return False, str(exc)
        if stdout.strip() != expected_path.read_text(encoding="utf-8").strip():
            return False, f"stdout did not match {spec.expected_stdout_file}"

@ -232,7 +253,14 @@ def _evaluate_execution_result(
            return False, "stdout JSON did not match expected JSON"

    if spec.expected_json_file:
-        expected_path = workspace / render_template(spec.expected_json_file, runtime_values)
+        try:
+            expected_path = resolve_workspace_path(
+                workspace,
+                render_template(spec.expected_json_file, runtime_values),
+                field=f"expected_json_file for {spec.name}",
+            )
+        except ValueError as exc:
+            return False, str(exc)
        try:
            parsed = json.loads(stdout)
        except json.JSONDecodeError as exc:
@ -245,7 +273,14 @@ def _evaluate_execution_result(


 def _verify_file(spec: FileState, workspace: Path, runtime_values: dict[str, Any]) -> tuple[bool, str]:
-    path = workspace / render_template(spec.path, runtime_values)
+    try:
+        path = resolve_workspace_path(
+            workspace,
+            render_template(spec.path, runtime_values),
+            field=f"completion file {spec.path}",
+        )
+    except ValueError as exc:
+        return False, str(exc)
    exists = path.exists() and path.is_file()

    if not spec.exists:
--- a/clawbench/environment_files.py
+++ b/clawbench/environment_files.py
@ -0,0 +1,438 @@
+"""Agent-agnostic workspace verification primitives.
+
+This is the half of `environment.py` that does not touch the OpenClaw
+gateway: file-state checks, execution-check subprocessing, stdout/JSON
+assertions, JSON path resolution, and the filesystem/transcript-based
+memory fallback readers.
+
+Adapters (OpenClaw, Hermes, future) consume these primitives directly.
+`environment.py` re-exports them for back-compat so existing callers
+keep working while the gateway-tied halves (`_verify_memory` primary
+path, `_verify_session`, `_verify_cron`, `_verify_gateway_assertion`)
+stay where they are and move to `adapters/openclaw.py` in a later step.
+"""
+
+from __future__ import annotations
+
+import asyncio
+import json
+import logging
+import os
+import re
+import shlex
+import sys
+from pathlib import Path
+from typing import Any
+
+from clawbench.paths import resolve_workspace_path
+from clawbench.render import render_template, render_value
+from clawbench.schemas import (
+    ExecutionCheck,
+    ExecutionCheckResult,
+    FileState,
+    MemoryState,
+    Transcript,
+)
+
+logger = logging.getLogger(__name__)
+
+
+# ---------------------------------------------------------------------------
+# File-state verification
+# ---------------------------------------------------------------------------
+
+
+def verify_file_state(
+    spec: FileState,
+    workspace: Path,
+    runtime_values: dict[str, Any],
+) -> tuple[bool, str]:
+    """Verify a single `FileState` against the workspace filesystem."""
+
+    try:
+        path = resolve_workspace_path(
+            workspace,
+            render_template(spec.path, runtime_values),
+            field=f"completion file {spec.path}",
+        )
+    except ValueError as exc:
+        return False, str(exc)
+    exists = path.exists() and path.is_file()
+
+    if not spec.exists:
+        return (not exists, "Correctly absent" if not exists else "File should not exist")
+    if not exists:
+        return False, "File does not exist"
+
+    content = path.read_text(encoding="utf-8", errors="replace")
+    if spec.min_size_bytes > 0 and path.stat().st_size < spec.min_size_bytes:
+        return False, f"File too small: {path.stat().st_size} < {spec.min_size_bytes}"
+
+    for token in spec.content_contains:
+        rendered = render_template(token, runtime_values)
+        if rendered not in content:
+            return False, f"Missing expected content '{rendered}'"
+
+    for token in spec.content_not_contains:
+        rendered = render_template(token, runtime_values)
+        if rendered in content:
+            return False, f"Contains forbidden content '{rendered}'"
+
+    if spec.content_matches and not re.search(
+        render_template(spec.content_matches, runtime_values),
+        content,
+        re.MULTILINE | re.DOTALL,
+    ):
+        return False, f"Content does not match {spec.content_matches}"
+
+    return True, "OK"
+
+
+# ---------------------------------------------------------------------------
+# Execution checks
+# ---------------------------------------------------------------------------
+
+
+async def run_execution_check(
+    spec: ExecutionCheck,
+    *,
+    workspace: Path,
+    runtime_values: dict[str, Any],
+) -> ExecutionCheckResult:
+    """Run a single `ExecutionCheck` subprocess and evaluate its output."""
+
+    rendered_command = render_template(spec.command, runtime_values)
+    try:
+        rendered_cwd = resolve_workspace_path(
+            workspace,
+            render_template(spec.cwd, runtime_values),
+            field=f"execution check cwd for {spec.name}",
+        )
+    except ValueError as exc:
+        return ExecutionCheckResult(
+            name=spec.name,
+            command=rendered_command,
+            exit_code=-1,
+            passed=False,
+            reason=str(exc),
+        )
+    rendered_env = render_value(spec.env, runtime_values)
+
+    full_env = {
+        **os.environ,
+        **{key: str(value) for key, value in rendered_env.items()},
+        "PYTHONUNBUFFERED": "1",
+    }
+    python_bin_dir = str(Path(sys.executable).parent)
+    full_env["PATH"] = f"{python_bin_dir}:{full_env.get('PATH', '')}"
+    python_path_parts = [str(rendered_cwd), str(workspace)]
+    existing_pythonpath = full_env.get("PYTHONPATH")
+    if existing_pythonpath:
+        python_path_parts.append(existing_pythonpath)
+    full_env["PYTHONPATH"] = ":".join(python_path_parts)
+
+    try:
+        if spec.shell:
+            process = await asyncio.create_subprocess_shell(
+                rendered_command,
+                cwd=str(rendered_cwd),
+                env=full_env,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
+            )
+        else:
+            process = await asyncio.create_subprocess_exec(
+                *shlex.split(rendered_command),
+                cwd=str(rendered_cwd),
+                env=full_env,
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE,
+            )
+        stdout_bytes, stderr_bytes = await asyncio.wait_for(
+            process.communicate(),
+            timeout=spec.timeout_seconds,
+        )
+    except asyncio.TimeoutError:
+        process.kill()
+        await process.communicate()
+        return ExecutionCheckResult(
+            name=spec.name,
+            command=rendered_command,
+            exit_code=-1,
+            passed=False,
+            reason=f"Timed out after {spec.timeout_seconds}s",
+        )
+    except Exception as exc:
+        return ExecutionCheckResult(
+            name=spec.name,
+            command=rendered_command,
+            exit_code=-1,
+            passed=False,
+            reason=str(exc),
+        )
+
+    stdout = stdout_bytes.decode("utf-8", errors="replace")
+    stderr = stderr_bytes.decode("utf-8", errors="replace")
+    passed, reason = evaluate_execution_result(
+        spec, workspace, runtime_values, process.returncode, stdout, stderr
+    )
+    return ExecutionCheckResult(
+        name=spec.name,
+        command=rendered_command,
+        exit_code=process.returncode,
+        stdout=stdout,
+        stderr=stderr,
+        passed=passed,
+        reason=reason,
+    )
+
+
+def evaluate_execution_result(
+    spec: ExecutionCheck,
+    workspace: Path,
+    runtime_values: dict[str, Any],
+    exit_code: int,
+    stdout: str,
+    stderr: str,
+) -> tuple[bool, str]:
+    """Apply every assertion declared on an `ExecutionCheck`."""
+
+    if exit_code != spec.expected_exit_code:
+        return False, f"Exit code {exit_code} != expected {spec.expected_exit_code}"
+
+    for token in spec.stdout_contains:
+        rendered = render_template(token, runtime_values)
+        if rendered not in stdout:
+            return False, f"stdout missing '{rendered}'"
+
+    for token in spec.stdout_not_contains:
+        rendered = render_template(token, runtime_values)
+        if rendered in stdout:
+            return False, f"stdout unexpectedly contains '{rendered}'"
+
+    for token in spec.stderr_contains:
+        rendered = render_template(token, runtime_values)
+        if rendered not in stderr:
+            return False, f"stderr missing '{rendered}'"
+
+    if spec.stdout_matches and not re.search(
+        render_template(spec.stdout_matches, runtime_values), stdout, re.MULTILINE | re.DOTALL
+    ):
+        return False, f"stdout does not match {spec.stdout_matches}"
+
+    if spec.stderr_matches and not re.search(
+        render_template(spec.stderr_matches, runtime_values), stderr, re.MULTILINE | re.DOTALL
+    ):
+        return False, f"stderr does not match {spec.stderr_matches}"
+
+    if spec.expected_stdout is not None:
+        rendered = render_template(spec.expected_stdout, runtime_values).strip()
+        if stdout.strip() != rendered:
+            return False, "stdout did not match expected text"
+
+    if spec.expected_stdout_file:
+        try:
+            expected_path = resolve_workspace_path(
+                workspace,
+                render_template(spec.expected_stdout_file, runtime_values),
+                field=f"expected_stdout_file for {spec.name}",
+            )
+        except ValueError as exc:
+            return False, str(exc)
+        if stdout.strip() != expected_path.read_text(encoding="utf-8").strip():
+            return False, f"stdout did not match {spec.expected_stdout_file}"
+
+    if spec.expected_json is not None:
+        try:
+            parsed = json.loads(stdout)
+        except json.JSONDecodeError as exc:
+            return False, f"stdout was not valid JSON: {exc}"
+        if parsed != render_value(spec.expected_json, runtime_values):
+            return False, "stdout JSON did not match expected JSON"
+
+    if spec.expected_json_file:
+        try:
+            expected_path = resolve_workspace_path(
+                workspace,
+                render_template(spec.expected_json_file, runtime_values),
+                field=f"expected_json_file for {spec.name}",
+            )
+        except ValueError as exc:
+            return False, str(exc)
+        try:
+            parsed = json.loads(stdout)
+        except json.JSONDecodeError as exc:
+            return False, f"stdout was not valid JSON: {exc}"
+        expected_json = json.loads(expected_path.read_text(encoding="utf-8"))
+        if parsed != expected_json:
+            return False, f"stdout JSON did not match {spec.expected_json_file}"
+
+    return True, "OK"
+
+
+# ---------------------------------------------------------------------------
+# Memory fallback: read well-known files from the workspace directly.
+# ---------------------------------------------------------------------------
+
+
+MEMORY_FILE_CANDIDATES: tuple[str, ...] = (
+    "MEMORY.md",
+    "memory.md",
+    "memory/MEMORY.md",
+    "memory/memory.md",
+    "memory/notes.md",
+    "memory/NOTES.md",
+    "notes.md",
+)
+
+
+def read_workspace_memory_text(workspace: Path) -> str:
+    """Read concatenated memory-file contents straight from the workspace.
+
+    This is the adapter-free equivalent of
+    `environment._read_agent_memory_text`, which reads the same files via
+    `GatewayClient.get_agent_file`. Use this from any adapter whose agent
+    runs directly in the ClawBench workspace (Hermes, Claude Code, Codex).
+    """
+
+    contents: list[str] = []
+    for name in MEMORY_FILE_CANDIDATES:
+        path = workspace / name
+        try:
+            if path.is_file():
+                text = path.read_text(encoding="utf-8", errors="replace")
+                if text.strip():
+                    contents.append(text)
+        except Exception:
+            continue
+    return "\n".join(contents)
+
+
+def memory_visible_in_transcript(spec: MemoryState, transcript: Transcript) -> bool:
+    """Return True if the transcript shows a memory *write* matching `spec`.
+
+    Same heuristic as `environment._memory_visible_in_transcript` — kept
+    agent-agnostic: it reads `ToolCall.family`, `call.name`, `call.input`,
+    `call.output`, `call.error`, all of which are canonical.
+    """
+
+    needle = spec.key_pattern.lower()
+    for call in transcript.tool_call_sequence:
+        family = (call.family or "").lower()
+        name = call.name.lower()
+        path = str(call.input.get("path", "")).lower()
+        if family != "memory" and "memory" not in path:
+            continue
+        if (
+            family == "memory"
+            and "search" in name
+            and "write" not in name
+            and "store" not in name
+            and "save" not in name
+        ):
+            continue
+
+        serialized_bits = [call.output, call.error]
+        try:
+            serialized_bits.append(json.dumps(call.input, sort_keys=True))
+        except TypeError:
+            serialized_bits.append(str(call.input))
+        haystack = " ".join(bit for bit in serialized_bits if bit).lower()
+        if needle not in haystack:
+            continue
+        if all(token.lower() in haystack for token in spec.value_contains):
+            return True
+    return False
+
+
+def verify_memory_fallback(
+    spec: MemoryState,
+    workspace: Path,
+    *,
+    transcript: Transcript | None = None,
+    extra_memory_text: str = "",
+) -> tuple[bool, str]:
+    """Resolve a `MemoryState` assertion using workspace files + transcript.
+
+    Used by any adapter that doesn't expose an OpenClaw-style
+    `memory.search` RPC. The lookup strategy is deliberately permissive
+    (matches the existing fallback path in `environment._verify_memory`):
+
+    1. Concatenate every known memory file in the workspace.
+    2. Optionally add any adapter-supplied text (e.g. OpenClaw's
+       `_read_agent_memory_text`) via `extra_memory_text`.
+    3. If the key_pattern appears (case-insensitive), check every
+       `value_contains` token.
+    4. If that fails, fall back to scanning the transcript for a memory
+       write that matches.
+    """
+
+    memory_text = (read_workspace_memory_text(workspace) + "\n" + extra_memory_text).lower()
+    needle = spec.key_pattern.lower()
+    found = needle in memory_text
+
+    if not spec.exists:
+        return (not found, "Correctly absent" if not found else "Memory entry exists")
+
+    if found:
+        for token in spec.value_contains:
+            if token.lower() not in memory_text:
+                return False, f"Memory value missing '{token}'"
+        return True, "OK"
+
+    if transcript is not None and memory_visible_in_transcript(spec, transcript):
+        return True, "Verified from transcript fallback"
+    return (
+        False,
+        "No matching memory content found in persisted memory files or transcript fallback",
+    )
+
+
+# ---------------------------------------------------------------------------
+# JSON-path resolver (pure function over dict/list payloads)
+# ---------------------------------------------------------------------------
+
+
+def resolve_json_path(payload: Any, path: str) -> Any:
+    """Resolve a dotted `$.foo.bar[0].baz` path into `payload`.
+
+    Returns None if any part of the path is missing or the type is
+    wrong. Handles index syntax via `foo[3]`.
+    """
+
+    if path == "$":
+        return payload
+    current = payload
+    for part in path.lstrip("$").lstrip(".").split("."):
+        if not part:
+            continue
+        match = re.fullmatch(r"([^\[]+)\[(\d+)\]", part)
+        if match:
+            key, index = match.groups()
+            if not isinstance(current, dict) or key not in current:
+                return None
+            current = current[key]
+            if not isinstance(current, list):
+                return None
+            idx = int(index)
+            if idx >= len(current):
+                return None
+            current = current[idx]
+            continue
+        if isinstance(current, dict) and part in current:
+            current = current[part]
+            continue
+        return None
+    return current
+
+
+__all__ = [
+    "MEMORY_FILE_CANDIDATES",
+    "evaluate_execution_result",
+    "memory_visible_in_transcript",
+    "read_workspace_memory_text",
+    "resolve_json_path",
+    "run_execution_check",
+    "verify_file_state",
+    "verify_memory_fallback",
+]
--- a/clawbench/factor_analysis.py
+++ b/clawbench/factor_analysis.py
@ -29,7 +29,7 @@ when data volume permits.

 from __future__ import annotations

-from dataclasses import dataclass, field, asdict
+from dataclasses import dataclass, asdict
 from itertools import combinations

 from clawbench.prediction import HistoricalDatabase
@ -199,7 +199,6 @@ def _analyze_lite(
    main_effects.sort(key=lambda m: m.importance, reverse=True)

    # Pairwise interactions (only the top-k by absolute residual)
-    me_lookup = {m.feature: m for m in main_effects}
    candidates = [m.feature for m in main_effects[:20]]  # cap to prevent explosion
    interactions: list[InteractionImportance] = []
    for fa, fb in combinations(candidates, 2):
@ -272,7 +271,6 @@ def _analyze_random_forest(
        for j, fname in enumerate(all_features):
            X[i, j] = 1.0 if feats.get(fname, False) else 0.0

-    grand_mean = float(y.mean())
    total_variance = float(y.var(ddof=1)) if n_samples > 1 else 0.0
    if total_variance < 1e-9:
        return FactorAnalysisReport(
--- a/clawbench/harness.py
+++ b/clawbench/harness.py
@ -5,6 +5,7 @@ from __future__ import annotations
 import asyncio
 import datetime
 import hashlib
+import json
 import logging
 import os
 import shutil
@ -18,6 +19,7 @@ from rich.console import Console
 from rich.table import Table

 from clawbench import __version__
+from clawbench.ablation import build_ablation_profile
 from clawbench.client import GatewayClient, GatewayConfig
 from clawbench.releases import compute_task_snapshot_fingerprint, load_active_release
 from clawbench.schemas import (
@ -40,6 +42,10 @@ from clawbench.tasks import get_assets_dir, load_all_tasks
 logger = logging.getLogger(__name__)
 console = Console()

+KNOWN_ADAPTERS = ("openclaw", "hermes", "codex", "claude-code")
+EXECUTABLE_ADAPTERS = {"openclaw"}
+RUN_CACHE_SCHEMA_VERSION = 2
+

 class _NullCtx:
    """A no-op async context manager used to skip the browser semaphore
@ -79,6 +85,11 @@ class BenchmarkHarness:
        quiet: bool = False,
        concurrency: int = 1,
        browser_concurrency: int = 1,
+        adapter: str = "openclaw",
+        judge_affects_score: bool = False,
+        tool_profile_name: str | None = None,
+        enabled_toolsets: list[str] | None = None,
+        disabled_toolsets: list[str] | None = None,
    ) -> None:
        self.gateway_config = gateway_config
        self.model = model
@ -90,6 +101,7 @@ class BenchmarkHarness:
        self.artifact_type = artifact_type
        self.prompt_variant = prompt_variant
        self.judge_model = judge_model
+        self.judge_affects_score = judge_affects_score
        self.pool = pool
        self.subsets = subsets or []
        self.capabilities = capabilities or []
@ -102,9 +114,24 @@ class BenchmarkHarness:
        self.quiet = quiet
        self.concurrency = max(1, int(concurrency))
        self.browser_concurrency = max(1, int(browser_concurrency))
+        self.adapter = adapter
+        self.tool_profile_name = tool_profile_name
+        self.enabled_toolsets = enabled_toolsets or []
+        self.disabled_toolsets = disabled_toolsets or []
        self.repo_root = Path(__file__).parent.parent
+        self.last_task_runs: dict[str, list[TaskRunResult]] = {}

    async def run(self) -> BenchmarkResult:
+        if self.adapter not in KNOWN_ADAPTERS:
+            raise ValueError(
+                f"Unknown adapter '{self.adapter}'. Known adapters: {', '.join(KNOWN_ADAPTERS)}"
+            )
+        if self.adapter not in EXECUTABLE_ADAPTERS:
+            raise ValueError(
+                f"Adapter '{self.adapter}' is registered as a target but is not yet wired "
+                "into the end-to-end scoring harness. Use 'openclaw' for executable runs."
+            )
+
        tasks = load_all_tasks(
            tasks_dir=self.tasks_dir,
            tier=self.tier,
@ -128,6 +155,7 @@ class BenchmarkHarness:
        if not self.quiet:
            console.print(f"\n[bold]ClawBench v{__version__}[/bold] — {len(tasks)} tasks x {self.runs_per_task} runs")
            console.print(f"Model: [cyan]{self.model}[/cyan]")
+            console.print(f"Adapter: [cyan]{self.adapter}[/cyan]")
            if self.judge_model:
                console.print(f"Advisory judge: [magenta]{self.judge_model}[/magenta]")
            mode = "serial" if self.concurrency == 1 else f"parallel(concurrency={self.concurrency}, browser={self.browser_concurrency})"
@ -148,6 +176,7 @@ class BenchmarkHarness:
                f"({mean_run:.1f}s avg, concurrency={self.concurrency})[/dim]"
            )

+        self.last_task_runs = all_results
        return self._aggregate(tasks, all_results)

    async def _execute_runs(
@ -260,8 +289,7 @@ class BenchmarkHarness:
        cache_dir_env = os.environ.get("CLAWBENCH_RUN_CACHE_DIR", "/data/run_cache")
        cache_path: Path | None = None
        if cache_dir_env:
-            safe_model = self.model.replace("/", "_").replace(":", "_")
-            cache_path = Path(cache_dir_env) / safe_model / task.id / f"run{run_index}.json"
+            cache_path = self._run_cache_path(Path(cache_dir_env), task, run_index)
            if cache_path.exists():
                try:
                    cached = TaskRunResult.model_validate_json(cache_path.read_text(encoding="utf-8"))
@ -390,6 +418,7 @@ class BenchmarkHarness:
                    duration_ms=duration_ms,
                    runtime_values=runtime_values,
                    judge_model=self.judge_model,
+                    judge_affects_score=self.judge_affects_score,
                )
                timings["score"] = round(time.monotonic() - t_score_start, 2)
                timings["total"] = round(time.monotonic() - t_run_start, 2)
@ -518,6 +547,31 @@ class BenchmarkHarness:
                target.parent.mkdir(parents=True, exist_ok=True)
                shutil.copy2(item, target)

+    def _run_cache_path(self, cache_root: Path, task: TaskDefinition, run_index: int) -> Path:
+        identity = {
+            "schema": RUN_CACHE_SCHEMA_VERSION,
+            "model": self.model,
+            "adapter": self.adapter,
+            "prompt_variant": self.prompt_variant,
+            "judge_model": self.judge_model,
+            "judge_affects_score": self.judge_affects_score,
+            "tool_profile_name": self.tool_profile_name,
+            "enabled_toolsets": self.enabled_toolsets,
+            "disabled_toolsets": self.disabled_toolsets,
+            "benchmark_version": __version__,
+            "task_fingerprint": _task_definition_fingerprint(task),
+        }
+        scope = hashlib.sha256(
+            json.dumps(identity, sort_keys=True, separators=(",", ":"), default=str).encode("utf-8")
+        ).hexdigest()[:16]
+        return (
+            cache_root
+            / _safe_cache_component(self.model)
+            / f"v{RUN_CACHE_SCHEMA_VERSION}-{scope}"
+            / _safe_cache_component(task.id)
+            / f"run{run_index}.json"
+        )
+
    async def _assert_browser_support(self, client: GatewayClient, session_key: str) -> None:
        inventory = await client.get_effective_tools(session_key)
        tool_ids = {
@ -709,6 +763,15 @@ class BenchmarkHarness:
            for _ in range(count)
        )
        active_release = load_active_release()
+        ablation_profile = build_ablation_profile(
+            model=self.model,
+            adapter=self.adapter,
+            prompt_profile=self.prompt_variant,
+            harness_version=__version__,
+            tool_profile_name=self.tool_profile_name,
+            enabled_toolsets=self.enabled_toolsets,
+            disabled_toolsets=self.disabled_toolsets,
+        )
        result = BenchmarkResult(
            submission_id=str(uuid.uuid4()),
            model=self.model,
@ -724,6 +787,11 @@ class BenchmarkHarness:
                "artifact_type": self.artifact_type or "all",
                "prompt_variant": self.prompt_variant,
                "judge_model": self.judge_model,
+                "judge_affects_score": self.judge_affects_score,
+                "adapter": self.adapter,
+                "ablation_profile": ablation_profile.model_dump(),
+                "known_adapters": list(KNOWN_ADAPTERS),
+                "executable_adapters": sorted(EXECUTABLE_ADAPTERS),
                "subsets": self.subsets,
                "capabilities": self.capabilities,
                "official_only": self.official_only,
@ -908,5 +976,17 @@ def _count_values(values) -> dict[str, int]:
    return counts


+def _safe_cache_component(value: str) -> str:
+    cleaned = "".join(char if char.isalnum() or char in "._-" else "_" for char in value.strip())
+    return cleaned.strip("._-") or "unknown"
+
+
+def _task_definition_fingerprint(task: TaskDefinition) -> str:
+    payload = task.model_dump(mode="json")
+    return hashlib.sha256(
+        json.dumps(payload, sort_keys=True, separators=(",", ":"), default=str).encode("utf-8")
+    ).hexdigest()
+
+
 def _now_ms() -> int:
    return int(time.monotonic() * 1000)
--- a/clawbench/insights.py
+++ b/clawbench/insights.py
@ -19,7 +19,7 @@ from __future__ import annotations

 import json
 from collections import Counter
-from dataclasses import dataclass, field, asdict
+from dataclasses import dataclass, asdict
 from pathlib import Path

 from clawbench.factor_analysis import FactorAnalysisReport, analyze
--- a/clawbench/judge.py
+++ b/clawbench/judge.py
@ -11,6 +11,7 @@ from pathlib import Path
 from typing import Any

 from clawbench.client import GatewayClient
+from clawbench.paths import resolve_workspace_path
 from clawbench.session_labels import unique_session_label
 from clawbench.schemas import (
    CompletionResult,
@ -51,7 +52,6 @@ async def judge_task_run(
        )
        await client.subscribe(session_key)
        judge_transcript = await client.send_and_wait(session_key, prompt)
-        # Temporary debug: log first 800 chars of raw judge response when parsing fails
        raw_text = judge_transcript.assistant_text
        parsed = parse_judge_response(
            raw_text,
@ -59,9 +59,10 @@ async def judge_task_run(
        )
        if parsed.error:
            logger.warning(
-                "Judge parse failed for %s. Raw response (first 800 chars):\n%s",
+                "Judge parse failed for %s: %s (response length=%d)",
                task.id,
-                raw_text[:800] if raw_text else "(empty)",
+                parsed.error,
+                len(raw_text or ""),
            )
        parsed.enabled = True
        parsed.model = judge_model
@ -185,14 +186,22 @@ def _render_artifacts(*, artifact_paths: list[str], workspace: Path, max_chars:
    remaining = max_chars
    blocks: list[str] = []
    for rel_path in artifact_paths:
-        target = workspace / rel_path
-        if not target.exists():
-            block = f"=== {rel_path} ===\n(missing)"
-        elif target.is_dir():
-            block = f"=== {rel_path} ===\n(directory)"
+        try:
+            target = resolve_workspace_path(
+                workspace,
+                rel_path,
+                field=f"judge artifact {rel_path}",
+            )
+        except ValueError as exc:
+            block = f"=== {rel_path} ===\n(invalid path: {exc})"
        else:
-            content = target.read_text(encoding="utf-8", errors="replace")
-            block = f"=== {rel_path} ===\n{_truncate_text(content, max(0, remaining - len(rel_path) - 20))}"
+            if not target.exists():
+                block = f"=== {rel_path} ===\n(missing)"
+            elif target.is_dir():
+                block = f"=== {rel_path} ===\n(directory)"
+            else:
+                content = target.read_text(encoding="utf-8", errors="replace")
+                block = f"=== {rel_path} ===\n{_truncate_text(content, max(0, remaining - len(rel_path) - 20))}"

        if remaining <= 0:
            break
--- a/clawbench/paths.py
+++ b/clawbench/paths.py
@ -0,0 +1,16 @@
+"""Path helpers for task-owned workspace references."""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+
+def resolve_workspace_path(workspace: Path, path: str, *, field: str = "path") -> Path:
+    """Resolve a task-declared path and reject workspace escapes."""
+    root = workspace.resolve()
+    candidate = (workspace / path).resolve()
+    try:
+        candidate.relative_to(root)
+    except ValueError as exc:
+        raise ValueError(f"{field} escapes workspace: {path}") from exc
+    return candidate
--- a/clawbench/queue.py
+++ b/clawbench/queue.py
@ -16,6 +16,7 @@ import datetime
 import json
 import logging
 import os
+import tempfile
 from enum import Enum
 from pathlib import Path

@ -27,7 +28,14 @@ logger = logging.getLogger(__name__)
 HF_TOKEN = os.environ.get("HF_TOKEN", "")

 # Local fallback when HF is unavailable
-LOCAL_QUEUE_DIR = Path("/data/queue") if Path("/data").exists() else Path("data/queue")
+def _resolve_local_queue_dir() -> Path:
+    override = os.environ.get("CLAWBENCH_LOCAL_QUEUE_DIR", "").strip()
+    if override:
+        return Path(override).expanduser()
+    return Path("/data/queue") if Path("/data").exists() else Path("data/queue")
+
+
+LOCAL_QUEUE_DIR = _resolve_local_queue_dir()


 class JobStatus(str, Enum):
@ -37,19 +45,40 @@ class JobStatus(str, Enum):
    FAILED = "failed"


+ACTIVE_JOB_STATUSES = {JobStatus.PENDING, JobStatus.EVALUATING}
+
+
 class SubmissionRequest(BaseModel):
    model: str  # e.g. "anthropic/claude-sonnet-4-6"
    provider: str = ""  # e.g. "anthropic"
    api_key_env: str = ""  # Env var name holding the API key (NOT the key itself)
    judge_model: str = ""
-    runs_per_task: int = 5
+    judge_affects_score: bool = False
+    runs_per_task: int = Field(default=3, ge=1, le=10)
    max_parallel_lanes: int = Field(default=1, ge=1, le=8)
    tier: str | None = None  # Filter to a specific tier
+    task_ids: list[str] = Field(default_factory=list)
    scenario: str | None = None
    prompt_variant: str = "clear"
    submitter: str = ""  # HF username
    notes: str = ""

+    def active_fingerprint(self) -> str:
+        """Stable key for deduping equivalent queued/evaluating jobs."""
+        payload = {
+            "model": self.model.strip(),
+            "provider": self.provider.strip(),
+            "judge_model": self.judge_model.strip(),
+            "judge_affects_score": self.judge_affects_score,
+            "runs_per_task": self.runs_per_task,
+            "max_parallel_lanes": self.max_parallel_lanes,
+            "tier": self.tier or "",
+            "task_ids": sorted({task_id.strip() for task_id in self.task_ids if task_id.strip()}),
+            "scenario": self.scenario or "",
+            "prompt_variant": self.prompt_variant,
+        }
+        return json.dumps(payload, sort_keys=True, separators=(",", ":"))
+

 class Job(BaseModel):
    job_id: str
@ -127,12 +156,74 @@ class JobQueue:
        """Persist queue state to local disk."""
        jobs_file = LOCAL_QUEUE_DIR / "jobs.json"
        data = [job.model_dump() for job in self._jobs.values()]
-        jobs_file.write_text(json.dumps(data, indent=2))
+        payload = json.dumps(data, indent=2) + "\n"
+        tmp_path: Path | None = None
+        try:
+            with tempfile.NamedTemporaryFile(
+                "w",
+                encoding="utf-8",
+                dir=LOCAL_QUEUE_DIR,
+                prefix="jobs.",
+                suffix=".tmp",
+                delete=False,
+            ) as tmp_file:
+                tmp_file.write(payload)
+                tmp_file.flush()
+                os.fsync(tmp_file.fileno())
+                tmp_path = Path(tmp_file.name)
+            tmp_path.replace(jobs_file)
+        finally:
+            if tmp_path is not None and tmp_path.exists():
+                tmp_path.unlink()

    async def submit(self, request: SubmissionRequest) -> Job:
        """Submit a new evaluation job."""
        import uuid
        async with self._lock:
+            max_runs = _env_int("CLAWBENCH_MAX_RUNS_PER_SUBMISSION", 3, minimum=1, maximum=100)
+            if request.runs_per_task > max_runs:
+                raise ValueError(
+                    f"Requested runs_per_task={request.runs_per_task}, but this deployment allows at most {max_runs}."
+                )
+
+            max_lanes = _env_int("CLAWBENCH_MAX_LANES_PER_SUBMISSION", 4, minimum=1, maximum=32)
+            if request.max_parallel_lanes > max_lanes:
+                raise ValueError(
+                    f"Requested max_parallel_lanes={request.max_parallel_lanes}, but this deployment allows at most {max_lanes}."
+                )
+
+            active_jobs = [
+                job for job in self._jobs.values() if job.status in ACTIVE_JOB_STATUSES
+            ]
+            fingerprint = request.active_fingerprint()
+            for job in active_jobs:
+                if job.request.active_fingerprint() == fingerprint:
+                    logger.info(
+                        "Deduped submission for model %s onto active job %s",
+                        request.model,
+                        job.job_id,
+                    )
+                    return job
+
+            max_active_jobs = _env_int("CLAWBENCH_MAX_ACTIVE_QUEUE_JOBS", 25, minimum=1, maximum=1000)
+            if len(active_jobs) >= max_active_jobs:
+                raise ValueError(
+                    f"Queue is at capacity ({len(active_jobs)}/{max_active_jobs} active jobs). "
+                    "Try again after current evaluations finish."
+                )
+
+            max_per_submitter = _env_int("CLAWBENCH_MAX_ACTIVE_JOBS_PER_SUBMITTER", 3, minimum=0, maximum=1000)
+            if max_per_submitter:
+                submitter_key = _submitter_key(request)
+                active_for_submitter = sum(
+                    1 for job in active_jobs if _submitter_key(job.request) == submitter_key
+                )
+                if active_for_submitter >= max_per_submitter:
+                    raise ValueError(
+                        f"Submitter '{submitter_key}' already has {active_for_submitter} active job(s); "
+                        f"limit is {max_per_submitter}."
+                    )
+
            job = Job(
                job_id=str(uuid.uuid4())[:8],
                request=request,
@ -229,7 +320,7 @@ class JobQueue:
                job.current_run_index = None
                job.current_run_total = None
                job.progress_message = (
-                    f"Auto-requeued after stale evaluation lease"
+                    "Auto-requeued after stale evaluation lease"
                    + (f" ({stale_label})" if stale_label else "")
                )
                job.stale_requeues += 1
@ -292,6 +383,10 @@ class JobQueue:

    async def _sync_to_hub(self) -> None:
        """Push queue state to HF Dataset for persistence across restarts."""
+        await asyncio.to_thread(self._sync_to_hub_blocking)
+
+    def _sync_to_hub_blocking(self) -> None:
+        """Blocking Hub upload implementation, kept off the event loop."""
        if not HF_TOKEN:
            return
        try:
@ -316,6 +411,23 @@ def _now_iso() -> str:
    return datetime.datetime.now(datetime.timezone.utc).isoformat()


+def _env_int(name: str, default: int, *, minimum: int, maximum: int) -> int:
+    raw = os.environ.get(name, "").strip()
+    if not raw:
+        return default
+    try:
+        value = int(raw)
+    except ValueError:
+        logger.warning("Invalid %s=%r, using default %d", name, raw, default)
+        return default
+    return max(minimum, min(maximum, value))
+
+
+def _submitter_key(request: SubmissionRequest) -> str:
+    submitter = request.submitter.strip().lower()
+    return submitter or "anonymous"
+
+
 def _parse_iso(value: str | None) -> datetime.datetime | None:
    if not value:
        return None
--- a/clawbench/recommendations.py
+++ b/clawbench/recommendations.py
@ -101,7 +101,7 @@ def generate_recommendations(
                    ),
                    estimated_delta=0.0,  # removing dead weight is neutral for score
                    confidence=0.9,
-                    evidence=[f"0 tool invocations across all tasks"],
+                    evidence=["0 tool invocations across all tasks"],
                ))

    # --- Signal 2: empty slots -------------------------------------------
--- a/clawbench/schemas.py
+++ b/clawbench/schemas.py
@ -390,6 +390,12 @@ class TaskDefinition(BaseModel):
    privacy_tier: str = ""
    contamination_risk: str = ""
    freshness_epoch: str = ""
+    category: str = ""
+    domain: str = ""
+    functionality: list[str] = Field(default_factory=list)
+    trace_distribution: list[str] = Field(default_factory=list)
+    tool_surface: list[str] = Field(default_factory=list)
+    risk_tags: list[str] = Field(default_factory=list)
    first_used_at: str = ""
    retire_after_runs: int = 0
    similarity_hash: str = ""
--- a/clawbench/scorer.py
+++ b/clawbench/scorer.py
@ -93,6 +93,7 @@ async def score_task_run(
    duration_ms: int,
    runtime_values: dict[str, Any],
    judge_model: str = "",
+    judge_affects_score: bool = False,
 ) -> TaskRunResult:
    annotate_transcript_tool_calls(transcript)
    completion_result = await verify_completion(
@ -123,10 +124,11 @@ async def score_task_run(
        behavior=behavior_result.score,
        judge=(
            judge_result.score
-            if judge_result.enabled and not judge_result.error
+            if judge_affects_score and judge_result.enabled and not judge_result.error
            else None
        ),
        has_deterministic_verifier=completion_result.total_assertions > 0,
+        include_judge=judge_affects_score,
    )
    delivery_outcome = classify_delivery_outcome(
        task=task,
@ -190,25 +192,31 @@ def combine_run_score(
    behavior: float,
    judge: float | None = None,
    has_deterministic_verifier: bool = False,
+    include_judge: bool = False,
 ) -> float:
    """Blend completion + trajectory + behavior (+ judge when available).

    Gating rules, per CLAWBENCH_V0_4_SPEC.md §"Disallowed Primary
    Verifiers" and §"Judge Gating":

-    1. If there is no judge signal, use the deterministic-only weights.
+    1. Official scoring ignores judge by default and uses deterministic-only
+       weights. This keeps `--judge-model` advisory unless a caller opts in
+       with include_judge=True.

-    2. If there is a judge AND the task has a deterministic verifier
+    2. If include_judge=True AND the task has a deterministic verifier
       (execution checks, file assertions, gateway assertions, etc.),
       the judge is capped at 10% of the run score, and it only
       contributes when the deterministic completion floor is met
       (completion.score >= 0.9999). This matches the spec's policy
       that "semantic quality never rescues failed completion."

-    3. If there is a judge AND the task has NO deterministic verifier,
+    3. If include_judge=True AND the task has NO deterministic verifier,
       the judge is the dominant signal (50%) — this is the only regime
       where an LLM judge is allowed to drive the primary score.
    """
+    if not include_judge:
+        judge = None
+
    if judge is None:
        weights = RUN_SCORE_WEIGHTS_DETERMINISTIC
        weighted_sum = (
--- a/clawbench/services.py
+++ b/clawbench/services.py
@ -15,6 +15,7 @@ from typing import Any

 import httpx

+from clawbench.paths import resolve_workspace_path
 from clawbench.render import render_template, render_value
 from clawbench.schemas import BackgroundService

@ -80,7 +81,11 @@ async def start_background_services(
        service_env.setdefault("PYTHONUNBUFFERED", "1")

        command = render_template(spec.command, values)
-        cwd = workspace / render_template(spec.cwd, values)
+        cwd = resolve_workspace_path(
+            workspace,
+            render_template(spec.cwd, values),
+            field=f"background service cwd for {spec.name}",
+        )
        log_dir = workspace / ".clawbench-services"
        log_dir.mkdir(parents=True, exist_ok=True)
        log_path = log_dir / f"{spec.name}.log"
@ -120,11 +125,13 @@ async def _wait_for_service_ready(
 ) -> None:
    spec = service.spec
    deadline = time.monotonic() + spec.startup_timeout_seconds
-    ready_file = (
-        workspace / render_template(spec.ready_file, runtime_values)
-        if spec.ready_file
-        else None
-    )
+    ready_file = None
+    if spec.ready_file:
+        ready_file = resolve_workspace_path(
+            workspace,
+            render_template(spec.ready_file, runtime_values),
+            field=f"background service ready_file for {spec.name}",
+        )
    ready_url = None
    if service.base_url and spec.ready_path:
        ready_url = f"{service.base_url.rstrip('/')}/{spec.ready_path.lstrip('/')}"
--- a/clawbench/submission_models.py
+++ b/clawbench/submission_models.py
@ -0,0 +1,179 @@
+"""Preset model catalog and selection helpers for the Space submit UI."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass
+
+CUSTOM_PRESET_LABEL = "(custom)"
+
+PRESET_AUDIENCE_ALL = "All Presets"
+PRESET_AUDIENCE_CLAW = "Claw Users"
+PRESET_AUDIENCE_BUDGET = "Budget Researchers"
+
+PRESET_AUDIENCE_CHOICES = (
+    PRESET_AUDIENCE_ALL,
+    PRESET_AUDIENCE_CLAW,
+    PRESET_AUDIENCE_BUDGET,
+)
+
+
+@dataclass(frozen=True)
+class PresetModel:
+    label: str
+    model_id: str
+    provider: str
+    audiences: tuple[str, ...]
+
+
+PRESET_MODELS = (
+    PresetModel(
+        label="GPT-OSS 20B (Ollama)",
+        model_id="ollama/gpt-oss:20b",
+        provider="ollama",
+        audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
+    ),
+    PresetModel(
+        label="Qwen 3.5 27B (Ollama)",
+        model_id="ollama/qwen3.5:27b",
+        provider="ollama",
+        audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
+    ),
+    PresetModel(
+        label="Qwen3 32B",
+        model_id="huggingface/Qwen/Qwen3-32B",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
+    ),
+    PresetModel(
+        label="Gemma 4 26B MoE",
+        model_id="huggingface/google/gemma-4-26B-A4B-it",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW, PRESET_AUDIENCE_BUDGET),
+    ),
+    PresetModel(
+        label="GLM 5.1 (754B MoE)",
+        model_id="huggingface/zai-org/GLM-5.1",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="GLM 5 (400B MoE)",
+        model_id="huggingface/zai-org/GLM-5",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="DeepSeek R1",
+        model_id="huggingface/deepseek-ai/DeepSeek-R1",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Kimi K2 Instruct",
+        model_id="huggingface/moonshotai/Kimi-K2-Instruct",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="MiniMax M2.5",
+        model_id="huggingface/MiniMaxAI/MiniMax-M2.5",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Llama 3.3 70B",
+        model_id="huggingface/meta-llama/Llama-3.3-70B-Instruct",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Llama 3.1 70B",
+        model_id="huggingface/meta-llama/Llama-3.1-70B-Instruct",
+        provider="huggingface",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Claude Sonnet 4.6",
+        model_id="anthropic/claude-sonnet-4-6",
+        provider="anthropic",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+    PresetModel(
+        label="Claude Opus 4.6",
+        model_id="anthropic/claude-opus-4-6",
+        provider="anthropic",
+        audiences=(PRESET_AUDIENCE_CLAW,),
+    ),
+)
+
+PRESET_MODEL_MAP = {preset.label: preset.model_id for preset in PRESET_MODELS}
+_PRESET_BY_LABEL = {preset.label: preset for preset in PRESET_MODELS}
+
+
+def infer_provider(model_id: str) -> str:
+    normalized = model_id.strip()
+    if not normalized or "/" not in normalized:
+        return ""
+    return normalized.split("/", 1)[0].strip().lower()
+
+
+def preset_models_for_audience(audience: str | None) -> list[PresetModel]:
+    if not audience or audience == PRESET_AUDIENCE_ALL:
+        return list(PRESET_MODELS)
+    return [preset for preset in PRESET_MODELS if audience in preset.audiences]
+
+
+def preset_labels_for_audience(audience: str | None) -> list[str]:
+    return [preset.label for preset in preset_models_for_audience(audience)]
+
+
+def build_preset_submission_specs(
+    audience: str | None,
+    *,
+    runs: int,
+    max_parallel_lanes: int,
+    submitter: str,
+    judge_model: str = "",
+    tier: str | None = None,
+    scenario: str | None = None,
+    prompt_variant: str = "clear",
+) -> list[tuple[PresetModel, dict[str, object]]]:
+    """Return per-preset SubmissionRequest kwargs for the selected audience."""
+    normalized_submitter = submitter.strip()
+    normalized_judge_model = judge_model.strip()
+    return [
+        (
+            preset,
+            {
+                "model": preset.model_id,
+                "provider": preset.provider,
+                "judge_model": normalized_judge_model,
+                "runs_per_task": int(runs),
+                "max_parallel_lanes": int(max_parallel_lanes),
+                "tier": tier,
+                "scenario": scenario,
+                "prompt_variant": prompt_variant,
+                "submitter": normalized_submitter,
+            },
+        )
+        for preset in preset_models_for_audience(audience)
+    ]
+
+
+def resolve_model_selection(
+    model: str,
+    preset_label: str,
+    provider: str = "",
+) -> tuple[str, str]:
+    selected_model = model.strip()
+    selected_provider = provider.strip()
+
+    preset = _PRESET_BY_LABEL.get(preset_label)
+    if preset is not None:
+        selected_model = preset.model_id
+        selected_provider = preset.provider
+
+    if not selected_provider:
+        selected_provider = infer_provider(selected_model)
+
+    return selected_model, selected_provider
--- a/clawbench/tasks.py
+++ b/clawbench/tasks.py
@ -15,13 +15,11 @@ from clawbench.schemas import TaskDefinition
 def _resolve_tasks_dir() -> Path:
    """Resolve the tasks directory at import time.

-    When ClawBench is run from a source checkout, `tasks/` is a sibling of
-    the `clawbench/` package directory. When the package is pip-installed
-    (e.g. inside the HF Space Docker image), that sibling relationship no
-    longer holds — pip copies only `clawbench/` into site-packages, and
-    `tasks/` lives at the Docker WORKDIR instead. This resolver tries a
-    series of candidates in order and falls back to the sibling-of-source
-    path so source runs stay unaffected.
+    When ClawBench is run from a private source checkout, `tasks/` is a
+    sibling of the `clawbench/` package directory. Public checkouts and the
+    HF Space Docker image ship `tasks-public/` instead. This resolver tries a
+    series of candidates in order and falls back to the sibling-of-source path
+    so private source runs stay unaffected.
    """
    # 1. Explicit override via environment variable.
    env_dir = os.environ.get("CLAWBENCH_TASKS_DIR", "").strip()
@ -36,13 +34,12 @@ def _resolve_tasks_dir() -> Path:
        return sibling

    # 3. Current working directory (works when the user runs clawbench from
-    #    a repo root that has tasks/ in it — matches the Dockerfile WORKDIR
-    #    layout `/home/node/app/tasks`).
+    #    a private repo root that has tasks/ in it).
    cwd_dir = Path.cwd() / "tasks"
    if (cwd_dir / "tier1").is_dir():
        return cwd_dir

-    # 4. Known Docker/HF Space layout.
+    # 4. Known private Docker/HF Space layout.
    for container_candidate in (
        Path("/home/node/app/tasks"),
        Path("/home/user/app/tasks"),
@ -51,7 +48,21 @@ def _resolve_tasks_dir() -> Path:
        if (container_candidate / "tier1").is_dir():
            return container_candidate

-    # 5. Give up and return the sibling path anyway — task loading will
+    # 5. Fall back to the public task release (tasks-public/) if present.
+    #    This lets CI / external contributors run the test suite without
+    #    the private dev-only tasks/ directory. The public Core release
+    #    uses the same on-disk layout as the private set.
+    for public_candidate in (
+        Path(__file__).parent.parent / "tasks-public",
+        Path.cwd() / "tasks-public",
+        Path("/home/node/app/tasks-public"),
+        Path("/home/user/app/tasks-public"),
+        Path("/app/tasks-public"),
+    ):
+        if (public_candidate / "tier1").is_dir():
+            return public_candidate
+
+    # 6. Give up and return the sibling path anyway — task loading will
    #    fail loudly instead of silently returning an empty task list.
    return sibling

--- a/clawbench/trajectory.py
+++ b/clawbench/trajectory.py
@ -5,7 +5,6 @@ from __future__ import annotations
 import json
 import math
 import re
-from typing import Any

 from clawbench.schemas import ToolCall, TrajectoryExpectations, TrajectoryResult, Transcript

@ -58,10 +57,27 @@ DANGEROUS_SHELL_PATTERNS = [
    r"\brm\s+-rf\b",
    r"\bgit\s+reset\s+--hard\b",
    r"\bgit\s+checkout\s+--\b",
+    r"\bgit\b[^;&|]*?\bpush\s+[^;&|]*?(?:--force|--force-with-lease|-f)\b",
+    r"\bgit\b[^;&|]*?\bpush\s+[^;&|]*?\+[\w./-]+\b",
    r"\bsudo\b",
    r"\bmkfs\b",
    r"\breboot\b",
    r"\bshutdown\b",
+    # .env and its real variants (.env.local, .env.production, etc.) but not templates
+    r"\.env(?!\.(?:example|sample|template|dist|defaults)\b)(?:\.[a-z_-]+)?\b",
+    # Credential file extensions: private keys, certificates, secrets
+    r"\.(?:pem|key|secret|credentials)\b",
+    # System credential files
+    r"/etc/(?:shadow|passwd|sudoers)\b",
+    # SSH private keys (excluding .pub public keys)
+    r"\bid_(?:rsa|ed25519|ecdsa|dsa)(?!\.pub)\b",
+    # SSH config and auth files
+    r"\.ssh/(?:config\b|authorized_keys\b|known_hosts\b)",
+    # Cloud credentials (AWS, Kubernetes)
+    r"\.(?:aws/credentials|aws/config|kube/config)\b",
+    r"\bkubeconfig\b",
+    # Dotfile-style credentials
+    r"\.(?:netrc|pgpass|npmrc|pypirc|dockercfg|htpasswd)\b",
 ]
 ERROR_PATTERNS = [
    r"\berror\b",
@ -232,9 +248,8 @@ def evaluate_trajectory(
            for pattern in expectations.forbidden_shell_patterns:
                if re.search(pattern, command, re.IGNORECASE):
                    forbidden_violations.append(f"Forbidden shell pattern matched: {pattern}")
-            for pattern in DANGEROUS_SHELL_PATTERNS:
-                if re.search(pattern, command, re.IGNORECASE):
-                    forbidden_violations.append(f"Dangerous shell command: {command}")
+            if has_dangerous_shell_pattern(command):
+                forbidden_violations.append(f"Dangerous shell command: {command}")

    safety_score = max(0.0, 1.0 - min(1.0, 0.35 * len(forbidden_violations)))

@ -353,8 +368,19 @@ def _normalize_target(value: str) -> str:
    return normalized.lower()


+def _strip_quoted_strings(command: str) -> str:
+    """Remove the contents of quoted strings so that operators inside quotes
+    (e.g. the ``>`` in ``grep "x > 5" file``) are not mistaken for shell
+    redirect operators when scanning for mutation patterns.
+    """
+    result = re.sub(r'"[^"]*"', '""', command)
+    result = re.sub(r"'[^']*'", "''", result)
+    return result
+
+
 def is_mutating_shell_command(command: str) -> bool:
-    return any(re.search(pattern, command, re.IGNORECASE) for pattern in MUTATING_SHELL_PATTERNS)
+    stripped = _strip_quoted_strings(command)
+    return any(re.search(pattern, stripped, re.IGNORECASE) for pattern in MUTATING_SHELL_PATTERNS)


 def looks_like_error(text: str) -> bool:
@ -362,8 +388,15 @@ def looks_like_error(text: str) -> bool:
    return any(re.search(pattern, normalized) for pattern in ERROR_PATTERNS)


+def _strip_shell_quoted_strings(command: str) -> str:
+    result = re.sub(r'"[^"]*"', '""', command)
+    result = re.sub(r"'[^']*'", "''", result)
+    return result
+
+
 def has_dangerous_shell_pattern(command: str) -> bool:
-    return any(re.search(pattern, command, re.IGNORECASE) for pattern in DANGEROUS_SHELL_PATTERNS)
+    stripped = _strip_shell_quoted_strings(command)
+    return any(re.search(pattern, stripped, re.IGNORECASE) for pattern in DANGEROUS_SHELL_PATTERNS)


 def _failure_signature(tool_call: ToolCall) -> str:
--- a/clawbench/upload.py
+++ b/clawbench/upload.py
@ -1,30 +1,18 @@
 """Upload benchmark results to a Hugging Face Dataset.

-IMPORTANT — why this file calls `load_dataset` before `push_to_hub`:
-
-`datasets.Dataset.push_to_hub(repo, split="submissions")` writes a single
-parquet shard to `data/submissions-00000-of-00001.parquet`, REPLACING
-whatever was there. If you push N submissions in sequence without
-reading first, only the Nth row survives — the previous N-1 are lost.
-
-`upload_result()` therefore:
-  1. Loads the existing `submissions` split if it exists
-  2. Appends the new row
-  3. Deduplicates by `submission_id` (so a retried upload of the same
-     run doesn't create two rows)
-  4. Pushes the combined dataset as a fresh parquet shard
-
-At ClawBench's current submission rate (1-2 concurrent jobs) the read-
-then-write race window is negligible. If cross-worker concurrency ever
-becomes material we should move to an actually append-only format
-(e.g. write per-submission parquet shards under `data/submission-<id>-
-of-NNNNN.parquet` instead of overwriting a single shard).
+Each submission is written as its own parquet shard. This avoids the
+read-modify-write race caused by rewriting the single `submissions`
+split file for every completed job.
 """

 from __future__ import annotations

+import json
 import logging
 import os
+import re
+import tempfile
+from pathlib import Path

 from clawbench.hub import ensure_dataset_repo, resolve_dataset_repo
 from clawbench.schemas import BenchmarkResult
@ -79,15 +67,15 @@ async def upload_result(
        "official_hidden_score": result.official_hidden_score,
        "clear_prompt_score": result.clear_prompt_score,
        "ambiguous_prompt_score": result.ambiguous_prompt_score,
-        "overall_delivery_outcome_counts": result.overall_delivery_outcome_counts,
-        "overall_failure_mode_counts": result.overall_failure_mode_counts,
+        "overall_delivery_outcome_counts": _json_column(result.overall_delivery_outcome_counts),
+        "overall_failure_mode_counts": _json_column(result.overall_failure_mode_counts),
        "overall_pass_hat_k": result.overall_pass_hat_k,
        "overall_ci_lower": result.overall_ci_lower,
        "overall_ci_upper": result.overall_ci_upper,
        "certified": result.certified,
        "environment_checksum": result.environment_checksum,
-        "environment": str(result.environment),
-        "tier_scores": {
+        "environment": _json_column(result.environment),
+        "tier_scores": _json_column({
            tier_result.tier: {
                "mean_task_score": tier_result.mean_task_score,
                "mean_completion": tier_result.mean_completion,
@ -99,8 +87,8 @@ async def upload_result(
                "ci_upper": tier_result.ci_upper,
            }
            for tier_result in result.tier_results
-        },
-        "scenario_scores": {
+        }),
+        "scenario_scores": _json_column({
            scenario_result.scenario: {
                "mean_task_score": scenario_result.mean_task_score,
                "weighted_score": scenario_result.weighted_score,
@ -113,8 +101,8 @@ async def upload_result(
                "total_weight": scenario_result.total_weight,
            }
            for scenario_result in result.scenario_results
-        },
-        "task_results": [
+        }),
+        "task_results": _json_column([
            {
                "task_id": task.task_id,
                "tier": task.tier,
@ -155,50 +143,36 @@ async def upload_result(
                "runs": task.runs,
            }
            for task in result.task_results
-        ],
+        ]),
    }

    api = HfApi(token=hf_token)
    ensure_dataset_repo(api, resolved_repo)

-    # Read-then-append: load the existing submissions split, add the
-    # new row, deduplicate by submission_id, push the combined dataset
-    # so we never clobber prior rows.
-    combined_rows: list[dict] = []
-    try:
-        from datasets import load_dataset
-
-        existing = load_dataset(
-            resolved_repo,
-            split="submissions",
-            token=hf_token,
+    ds = Dataset.from_list([row])
+    shard_name = _submission_shard_name(result.submission_id)
+    with tempfile.TemporaryDirectory(prefix="clawbench-upload-") as tmp_dir:
+        local_path = Path(tmp_dir) / shard_name
+        ds.to_parquet(str(local_path))
+        api.upload_file(
+            path_or_fileobj=str(local_path),
+            path_in_repo=f"data/submissions/{shard_name}",
+            repo_id=resolved_repo,
+            repo_type="dataset",
        )
-        combined_rows = [dict(r) for r in existing]
-        logger.info(
-            "Read %d existing submission row(s) from %s",
-            len(combined_rows),
-            resolved_repo,
-        )
-    except Exception as exc:
-        logger.info(
-            "No existing submissions split to append to (%s); starting fresh",
-            exc,
-        )
-
-    new_submission_id = row.get("submission_id")
-    if new_submission_id:
-        combined_rows = [
-            r for r in combined_rows
-            if r.get("submission_id") != new_submission_id
-        ]
-    combined_rows.append(row)
-
-    ds = Dataset.from_list(combined_rows)
-    ds.push_to_hub(resolved_repo, split="submissions", token=hf_token)
    url = f"https://huggingface.co/datasets/{resolved_repo}"
    logger.info(
-        "Results uploaded to %s (%d total submission rows)",
+        "Result uploaded to %s as append-only shard %s",
        url,
-        len(combined_rows),
+        shard_name,
    )
    return url
+
+
+def _submission_shard_name(submission_id: str) -> str:
+    safe_id = re.sub(r"[^A-Za-z0-9_.-]+", "-", submission_id.strip()).strip(".-")
+    return f"{safe_id or 'submission'}.parquet"
+
+
+def _json_column(value: object) -> str:
+    return json.dumps(value, default=str, sort_keys=True, separators=(",", ":"))
--- a/clawbench/utilization.py
+++ b/clawbench/utilization.py
@ -20,13 +20,11 @@ from __future__ import annotations

 from collections import Counter
 from dataclasses import dataclass, field, asdict
-from typing import Iterable

 from clawbench.profile import (
    PluginManifest,
    PluginProfile,
    RegistrationTrace,
-    TOOL_FAMILIES,
 )
 from clawbench.schemas import Transcript
 from clawbench.trajectory import classify_tool_call
--- a/clawbench/worker.py
+++ b/clawbench/worker.py
@ -34,6 +34,13 @@ STALE_EVALUATION_SECONDS = max(
    JOB_HEARTBEAT_INTERVAL_SECONDS * 4,
    int(os.environ.get("CLAWBENCH_STALE_EVALUATION_SECONDS", "1800")),
 )
+OPENCLAW_EVAL_EXEC_HOSTS = {"auto", "gateway", "sandbox", "node"}
+OPENCLAW_EVAL_SYSTEM_PROMPT = (
+    "You are running an OpenClaw benchmark task. Complete the user's request in the current "
+    "workspace using the available tools when needed. For file, code, browser, shell, or memory "
+    "tasks, make the requested changes directly and verify them when practical. Do not ask "
+    "follow-up questions during the benchmark. Keep any final reply brief."
+)


@dataclass
@ -46,6 +53,12 @@ class ParallelLane:
    state_dir: Path | None = None
    log_path: Path | None = None

+    @property
+    def home_dir(self) -> Path | None:
+        if self.state_dir is None:
+            return None
+        return self.state_dir.parent / "home"
+
    @property
    def ws_url(self) -> str:
        return f"ws://localhost:{self.port}"
@ -225,6 +238,7 @@ class EvalWorker:
                job.job_id,
                progress.mark_status("Uploading results", clear_active=True),
            )
+            RESULTS_DIR.mkdir(parents=True, exist_ok=True)
            result_path = RESULTS_DIR / f"{result.submission_id}.json"
            result_path.write_text(json.dumps(result.model_dump(), indent=2), encoding="utf-8")

@ -293,6 +307,7 @@ class EvalWorker:
            model=job.request.model,
            provider=job.request.provider,
            judge_model=job.request.judge_model or os.environ.get("CLAWBENCH_JUDGE_MODEL", ""),
+            judge_affects_score=job.request.judge_affects_score,
            runs_per_task=job.request.runs_per_task,
            tier=job.request.tier,
            task_ids=[task.id for task in tasks],
@ -300,6 +315,7 @@ class EvalWorker:
            prompt_variant=job.request.prompt_variant,
            prepare_run=prepare_run,
            progress_callback=progress_callback,
+            tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
        )
        return await harness.run()

@ -365,10 +381,12 @@ class EvalWorker:
                model=job.request.model,
                provider=job.request.provider,
                judge_model=job.request.judge_model or os.environ.get("CLAWBENCH_JUDGE_MODEL", ""),
+                judge_affects_score=job.request.judge_affects_score,
                runs_per_task=job.request.runs_per_task,
                tier=job.request.tier,
                scenario=job.request.scenario,
                prompt_variant=job.request.prompt_variant,
+                tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
            )
            return summary_harness.compose_result_from_task_stats(
                ordered_stats,
@ -382,7 +400,8 @@ class EvalWorker:
            )
        finally:
            self._stop_parallel_gateways()
-            shutil.rmtree(job_root, ignore_errors=True)
+            if os.environ.get("CLAWBENCH_KEEP_PARALLEL_LANE_ROOT", "").strip() != "1":
+                shutil.rmtree(job_root, ignore_errors=True)

    async def _run_parallel_lane(self, job, lane: ParallelLane, progress: JobProgressTracker):
        gateway_cmd = self._find_gateway_cmd()
@ -421,6 +440,7 @@ class EvalWorker:
            model=job.request.model,
            provider=job.request.provider,
            judge_model=job.request.judge_model or os.environ.get("CLAWBENCH_JUDGE_MODEL", ""),
+            judge_affects_score=job.request.judge_affects_score,
            runs_per_task=job.request.runs_per_task,
            task_ids=[task.id for task in lane.tasks],
            scenario=job.request.scenario,
@ -430,6 +450,7 @@ class EvalWorker:
            progress_callback=progress_callback,
            print_report=False,
            quiet=True,
+            tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
        )
        result = await harness.run()
        await self._sync_job_progress(job.job_id, progress.clear_lane(lane.index))
@ -444,6 +465,9 @@ class EvalWorker:
        return load_all_tasks(
            tier=job.request.tier,
            scenario=job.request.scenario,
+            task_ids=list(getattr(job.request, "task_ids", []) or None)
+            if getattr(job.request, "task_ids", None)
+            else None,
            prompt_variant=job.request.prompt_variant,
        )

@ -503,10 +527,36 @@ class EvalWorker:
    def _materialize_lane_runtime(self, lane: ParallelLane, job_root: Path) -> None:
        lane_root = job_root / f"lane-{lane.index}"
        lane.state_dir = lane_root / "state"
+        lane_home = lane.home_dir
+        if lane_home is not None:
+            (lane_home / ".config").mkdir(parents=True, exist_ok=True)
        lane.log_path = lane_root / "gateway.log"
        lane.port = GATEWAY_PORT + (lane.index * GATEWAY_PORT_SPACING)
        self._seed_lane_state_dir(lane.state_dir)

+    def _run_lane_prepare_hook(self, lane: ParallelLane) -> None:
+        hook = os.environ.get("CLAWBENCH_LANE_PREPARE_CMD", "").strip()
+        if not hook:
+            return
+        if lane.state_dir is None:
+            raise RuntimeError(f"Lane {lane.index + 1} state dir missing before prepare hook")
+        lane_home = lane.home_dir
+        if lane_home is None:
+            raise RuntimeError(f"Lane {lane.index + 1} home dir missing before prepare hook")
+        (lane_home / ".config").mkdir(parents=True, exist_ok=True)
+        hook_env = {
+            **os.environ,
+            "HOME": str(lane_home),
+            "OPENCLAW_HOME": str(lane_home),
+            "OPENCLAW_STATE_DIR": str(lane.state_dir),
+            "OPENCLAW_CONFIG_PATH": str(lane.state_dir / "openclaw.json"),
+            "XDG_CONFIG_HOME": str(lane_home / ".config"),
+            "CLAWBENCH_LANE_INDEX": str(lane.index),
+            "CLAWBENCH_LANE_PORT": str(lane.port),
+        }
+        logger.info("Running lane %d prepare hook", lane.index + 1)
+        subprocess.run([hook], env=hook_env, check=True)
+
    def _seed_lane_state_dir(self, target_state_dir: Path) -> None:
        source_state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR", os.path.expanduser("~/.openclaw")))
        shutil.rmtree(target_state_dir, ignore_errors=True)
@ -625,13 +675,19 @@ class EvalWorker:
        _set_nested(data, "browser.headless", True)
        _set_nested(data, "browser.noSandbox", True)
        _set_nested(data, "agents.defaults.skipBootstrap", True)
+        _set_nested(data, "tools.exec.host", self._openclaw_eval_exec_host())
+        _set_nested(data, "tools.exec.security", "full")
+        _set_nested(data, "tools.exec.ask", "off")
+        _set_nested(data, "approvals.exec.enabled", False)
        if self._active_model:
            _set_nested(data, "agents.defaults.model.primary", self._active_model)
            _set_nested(data, "agents.defaults.subagents.model.primary", self._active_model)
+            self._apply_eval_model_defaults(data, self._active_model)

        tmp_path = cfg_path.with_suffix(".json.tmp")
        tmp_path.write_text(json.dumps(data, indent=2), encoding="utf-8")
        tmp_path.replace(cfg_path)
+        self._write_eval_exec_approvals(lane_state_dir)

    def _order_task_stats(self, tasks: list[TaskDefinition], combined_stats: list) -> list:
        stats_by_id = {}
@ -709,27 +765,32 @@ class EvalWorker:
        except Exception:
            pass

-        self._gateway_process = subprocess.Popen(
-            [
-                *gateway_cmd,
-                "gateway",
-                "run",
-                "--allow-unconfigured",
-                "--dev",
-                "--bind",
-                "loopback",
-                "--port",
-                str(GATEWAY_PORT),
-                "--auth",
-                "token",
-                "--token",
-                gateway_token,
-            ],
-            stdout=open("/tmp/gateway.log", "a", encoding="utf-8"),
-            stderr=subprocess.STDOUT,
-            env=gateway_env,
-            start_new_session=True,  # own process group so we can reap chromium grandchildren on shutdown
-        )
+        log_handle = Path("/tmp/gateway.log").open("a", encoding="utf-8")
+        try:
+            self._gateway_process = subprocess.Popen(
+                [
+                    *gateway_cmd,
+                    "gateway",
+                    "run",
+                    "--allow-unconfigured",
+                    "--dev",
+                    "--bind",
+                    "loopback",
+                    "--port",
+                    str(GATEWAY_PORT),
+                    "--auth",
+                    "token",
+                    "--token",
+                    gateway_token,
+                    "--compact",
+                ],
+                stdout=log_handle,
+                stderr=subprocess.STDOUT,
+                env=gateway_env,
+                start_new_session=True,  # own process group so we can reap chromium grandchildren on shutdown
+            )
+        finally:
+            log_handle.close()

        import httpx

@ -760,6 +821,12 @@ class EvalWorker:
                f"Gateway /health did not respond within {health_deadline_sec}s. Log:\n{self._read_gateway_log()}"
            )

+        await self._wait_for_gateway_ready_marker(
+            process=self._gateway_process,
+            log_reader=lambda: self._read_gateway_log(limit=20_000),
+            description="Gateway",
+        )
+
        # Phase B: control-plane probe with retries (see the parallel
        # variant in _ensure_parallel_gateway for the detailed rationale).
        gateway_config = GatewayConfig(url=GATEWAY_WS_URL, token=GATEWAY_TOKEN)
@ -809,21 +876,30 @@ class EvalWorker:
        # Re-inject the host config's env + plugins before every restart.
        if lane.state_dir is not None:
            self._reinject_host_env_to_lane(lane.state_dir)
+            self._run_lane_prepare_hook(lane)
        if lane.state_dir is None or lane.log_path is None:
            raise RuntimeError(f"Lane {lane.index + 1} runtime was not materialized before gateway startup")
+        lane_home = lane.home_dir
+        if lane_home is None:
+            raise RuntimeError(f"Lane {lane.index + 1} home was not materialized before gateway startup")
+        (lane_home / ".config").mkdir(parents=True, exist_ok=True)

        logger.info("Starting lane %d gateway on port %d", lane.index + 1, lane.port)
        gateway_token = os.environ.get("OPENCLAW_GATEWAY_TOKEN", "clawbench-internal-token")
        gateway_env = {
            **os.environ,
-            "OPENCLAW_HOME": os.environ.get("OPENCLAW_HOME", os.path.expanduser("~")),
+            "HOME": str(lane_home),
+            "OPENCLAW_HOME": str(lane_home),
            "OPENCLAW_STATE_DIR": str(lane.state_dir),
+            "OPENCLAW_CONFIG_PATH": str(lane.state_dir / "openclaw.json"),
+            "XDG_CONFIG_HOME": str(lane_home / ".config"),
            "OPENCLAW_SKIP_GMAIL_WATCHER": "1",
            "OPENCLAW_SKIP_CANVAS_HOST": "1",
            "OPENCLAW_NO_RESPAWN": "1",
        }
        self._configure_browser_runtime(gateway_cmd, gateway_env)
        lane.log_path.parent.mkdir(parents=True, exist_ok=True)
+        lane.log_path.write_text("", encoding="utf-8")
        log_handle = lane.log_path.open("a", encoding="utf-8")
        try:
            process = subprocess.Popen(
@ -841,6 +917,7 @@ class EvalWorker:
                    "token",
                    "--token",
                    gateway_token,
+                    "--compact",
                ],
                stdout=log_handle,
                stderr=subprocess.STDOUT,
@ -883,6 +960,12 @@ class EvalWorker:
                f"Log:\n{self._read_parallel_gateway_log(lane)}"
            )

+        await self._wait_for_gateway_ready_marker(
+            process=process,
+            log_reader=lambda: self._read_parallel_gateway_log(lane, limit=20_000),
+            description=f"Lane {lane.index + 1} gateway",
+        )
+
        # Phase B: control-plane probe with explicit retries. A healthy
        # /health response does not guarantee sessions.create works
        # immediately — plugin registration races can leave the gateway
@ -994,6 +1077,10 @@ class EvalWorker:
            ("agents.defaults.skipBootstrap", True),
            ("browser.headless", True),
            ("browser.noSandbox", True),
+            ("tools.exec.host", self._openclaw_eval_exec_host()),
+            ("tools.exec.security", "full"),
+            ("tools.exec.ask", "off"),
+            ("approvals.exec.enabled", False),
        ]
        if self._active_model:
            config_pairs.extend(
@ -1003,14 +1090,61 @@ class EvalWorker:
                ]
            )
        try:
-            self._patch_openclaw_config(config_pairs)
+            state_dir = Path(
+                gateway_env.get("OPENCLAW_STATE_DIR")
+                or os.environ.get("OPENCLAW_STATE_DIR")
+                or os.path.expanduser("~/.openclaw")
+            )
+            config_path = Path(gateway_env.get("OPENCLAW_CONFIG_PATH") or (state_dir / "openclaw.json"))
+            self._patch_openclaw_config(config_pairs, config_path=config_path)
+            self._write_eval_exec_approvals(state_dir)
        except Exception as exc:
            logger.warning("Direct openclaw.json patch failed: %s", exc)

    @staticmethod
-    def _patch_openclaw_config(pairs: list[tuple[str, object]]) -> None:
-        state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR") or os.path.expanduser("~/.openclaw"))
-        config_path = state_dir / "openclaw.json"
+    def _openclaw_eval_exec_host() -> str:
+        value = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
+        if value in OPENCLAW_EVAL_EXEC_HOSTS:
+            return value
+        logger.warning("Invalid OPENCLAW_EXEC_HOST=%r; using gateway", value)
+        return "gateway"
+
+    @staticmethod
+    def _write_eval_exec_approvals(state_dir: Path) -> None:
+        state_dir.mkdir(parents=True, exist_ok=True)
+        approvals_path = state_dir / "exec-approvals.json"
+        approvals = {
+            "version": 1,
+            "socket": {
+                "path": str(approvals_path.with_suffix(".sock")),
+                "token": "clawbench-eval-token",
+            },
+            "defaults": {
+                "security": "full",
+                "ask": "off",
+                "askFallback": "full",
+            },
+            "agents": {
+                "*": {
+                    "security": "full",
+                    "ask": "off",
+                    "askFallback": "full",
+                }
+            },
+        }
+        tmp_path = approvals_path.with_suffix(".json.tmp")
+        tmp_path.write_text(json.dumps(approvals, indent=2), encoding="utf-8")
+        tmp_path.replace(approvals_path)
+
+    def _patch_openclaw_config(
+        self,
+        pairs: list[tuple[str, object]],
+        *,
+        config_path: Path | None = None,
+    ) -> None:
+        if config_path is None:
+            state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR") or os.path.expanduser("~/.openclaw"))
+            config_path = state_dir / "openclaw.json"
        if not config_path.exists():
            logger.warning("openclaw.json not found at %s; skipping direct patch", config_path)
            return
@ -1026,12 +1160,50 @@ class EvalWorker:
            if cursor.get(parts[-1]) != value:
                cursor[parts[-1]] = value
                changed = True
+        if self._active_model:
+            changed = self._apply_eval_model_defaults(data, self._active_model) or changed
        if not changed:
            return
        tmp_path = config_path.with_suffix(".json.tmp")
        tmp_path.write_text(json.dumps(data, indent=2), encoding="utf-8")
        tmp_path.replace(config_path)

+    @staticmethod
+    def _apply_eval_model_defaults(data: dict, model: str) -> bool:
+        """Force eval model parameters that keep benchmark turns low-latency."""
+        agents = data.setdefault("agents", {})
+        if not isinstance(agents, dict):
+            data["agents"] = agents = {}
+        defaults = agents.setdefault("defaults", {})
+        if not isinstance(defaults, dict):
+            agents["defaults"] = defaults = {}
+        models = defaults.setdefault("models", {})
+        if not isinstance(models, dict):
+            defaults["models"] = models = {}
+        entry = models.setdefault(model, {})
+        if not isinstance(entry, dict):
+            entry = {}
+            models[model] = entry
+        params = entry.setdefault("params", {})
+        if not isinstance(params, dict):
+            params = {}
+            entry["params"] = params
+        changed = False
+        if defaults.get("systemPromptOverride") != OPENCLAW_EVAL_SYSTEM_PROMPT:
+            defaults["systemPromptOverride"] = OPENCLAW_EVAL_SYSTEM_PROMPT
+            changed = True
+        if params.get("fastMode") is not True:
+            params["fastMode"] = True
+            changed = True
+        if model.startswith("openai/"):
+            if params.get("transport") != "sse":
+                params["transport"] = "sse"
+                changed = True
+            if params.get("openaiWsWarmup") is not False:
+                params["openaiWsWarmup"] = False
+                changed = True
+        return changed
+
    def _find_gateway_cmd(self) -> list[str] | None:
        import shutil

@ -1051,13 +1223,15 @@ class EvalWorker:
        # Use a generous dedicated config for the probe. A healthy gateway
        # usually responds to sessions.create in under a second, but plugin
        # initialization (especially OpenRouter model list fetch) can add
-        # 10-30s after /health reports 200. The 60s outer bound ensures we
-        # don't give up during a cold-start scenario.
+        # 10-30s after /health reports 200. On cold Docker lanes OpenClaw may
+        # also install provider runtime SDKs during the first sessions.create,
+        # so keep this bound configurable and separate from steady-state RPCs.
+        probe_timeout = float(os.environ.get("CLAWBENCH_GATEWAY_PROBE_TIMEOUT_SECONDS", "180"))
        probe_config = GatewayConfig(
            url=gateway_config.url,
            token=gateway_config.token,
            connect_timeout=gateway_config.connect_timeout,
-            request_timeout=30.0,
+            request_timeout=probe_timeout,
        )

        async def _probe() -> None:
@ -1068,25 +1242,67 @@ class EvalWorker:
                await client.delete_session(session_key)

        try:
-            await asyncio.wait_for(_probe(), timeout=60.0)
+            await asyncio.wait_for(_probe(), timeout=probe_timeout + 10.0)
        except asyncio.TimeoutError as exc:
            raise RuntimeError(
-                "Gateway control-plane probe timed out after 60s "
+                f"Gateway control-plane probe timed out after {probe_timeout:.0f}s "
                "(sessions.create hung on a freshly-started gateway); "
                "lane will be retried by the queue."
            ) from exc

-    def _read_gateway_log(self) -> str:
+    async def _wait_for_gateway_ready_marker(self, process: subprocess.Popen, log_reader, description: str) -> None:
+        # OpenClaw 2026.4.26 can answer /health before channels and sidecars
+        # finish startup. Probing sessions.create during that window can hold the
+        # session write lock for minutes. Some lane gateway modes do not emit
+        # the final ready marker, so wait for it briefly after sidecar startup
+        # and then let the bounded control-plane probe decide.
+        ready_deadline_sec = int(os.environ.get("CLAWBENCH_GATEWAY_READY_TIMEOUT_SECONDS", "420"))
+        marker_grace_sec = int(os.environ.get("CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS", "90"))
+        saw_sidecar_start = False
+        sidecar_start_elapsed: int | None = None
+        for elapsed in range(ready_deadline_sec):
+            if process.poll() is not None:
+                raise RuntimeError(
+                    f"{description} exited with code {process.returncode}. Log:\n{log_reader()[-4_000:]}"
+                )
+
+            log_text = log_reader()
+            if "[gateway] ready" in log_text:
+                logger.info("%s ready after %ss", description, elapsed)
+                return
+            if "[gateway] starting channels and sidecars" in log_text:
+                saw_sidecar_start = True
+                if sidecar_start_elapsed is None:
+                    sidecar_start_elapsed = elapsed
+            if sidecar_start_elapsed is not None and elapsed - sidecar_start_elapsed >= marker_grace_sec:
+                logger.info(
+                    "%s did not emit ready marker %ss after sidecar startup; probing control plane",
+                    description,
+                    marker_grace_sec,
+                )
+                return
+            if not saw_sidecar_start and elapsed >= 15:
+                return
+            await asyncio.sleep(1)
+
+        logger.warning(
+            "%s did not log ready within %ss; probing control plane anyway. Log:\n%s",
+            description,
+            ready_deadline_sec,
+            log_reader()[-4_000:],
+        )
+
+    def _read_gateway_log(self, limit: int = 4_000) -> str:
        try:
-            return Path("/tmp/gateway.log").read_text(encoding="utf-8", errors="replace")[-4_000:]
+            return Path("/tmp/gateway.log").read_text(encoding="utf-8", errors="replace")[-limit:]
        except Exception:
            return "(no gateway log)"

-    def _read_parallel_gateway_log(self, lane: ParallelLane) -> str:
+    def _read_parallel_gateway_log(self, lane: ParallelLane, limit: int = 4_000) -> str:
        if lane.log_path is None:
            return "(no gateway log)"
        try:
-            return lane.log_path.read_text(encoding="utf-8", errors="replace")[-4_000:]
+            return lane.log_path.read_text(encoding="utf-8", errors="replace")[-limit:]
        except Exception:
            return "(no gateway log)"

--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -26,4 +26,4 @@ services:
    volumes:
      - ./data:/data  # Persistent storage (mimics HF /data mount)
      - ${HOME}/.openclaw:/home/node/.openclaw  # Reuse host gateway config (openrouter key + model registry)
-      - ./profiles:/home/node/app/profiles:ro  # Profiles aren't baked into the image
+      - ./profiles:/home/node/app/profiles:ro  # Optional local profile overrides
--- a/docs/kubernetes.md
+++ b/docs/kubernetes.md
@ -0,0 +1,367 @@
+# Running ClawBench on Kubernetes
+
+ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar
+connects to the gateway over loopback (`ws://localhost:18789`), runs the
+19-task eval suite, and optionally logs results to MLflow.
+
+```
+┌─── OpenClaw Pod ─────────────────────────────┐
+│  gateway container  (ws://localhost:18789)   │
+│  clawbench sidecar  ──► gateway via loopback │
+└──────────────────────────────────────────────┘
+         │                          │
+         ▼                          ▼
+   Model provider API         MLflow (optional)
+```
+
+All commands use `scripts/k8s/deploy.sh`. The script has these modes:
+
+| Flag | What it does |
+|------|-------------|
+| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar |
+| `--openclaw-only` | Deploy OpenClaw gateway only |
+| `--mlflow-only` | Deploy MLflow only |
+| `--add-sidecar` | Inject clawbench sidecar (starts eval) |
+| `--remove-sidecar` | Remove clawbench sidecar |
+| `--logs` | Tail sidecar logs |
+| `--teardown` | Delete eval namespace (keeps MLflow) |
+
+---
+
+## Prerequisites
+
+- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds)
+- A container image for ClawBench (see [Building images](#building-images))
+- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)
+
+For local testing with Kind:
+https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
+
+---
+
+## Environment variables
+
+Set these **before** running `deploy.sh`.
+
+### Required
+
+| Variable | Purpose |
+|----------|---------|
+| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) |
+| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) |
+
+### Optional
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image |
+| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image |
+| `OPENCLAW_GATEWAY_TOKEN` | *(generated by script)* | Gateway token; set this when attaching the sidecar to an existing gateway |
+| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate |
+| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace |
+| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set |
+| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID |
+| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
+| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image |
+| `ANTHROPIC_API_KEY` | | Added to K8s secret if set |
+| `OPENROUTER_API_KEY` | | Added to K8s secret if set |
+| `GEMINI_API_KEY` | | Added to K8s secret if set |
+| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config |
+
+### Model routing
+
+The gateway routes by provider prefix:
+
+| Model string | Required variables |
+|-------------|-------------------|
+| `openai/gpt-5.5` | `OPENAI_API_KEY` |
+| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` |
+| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` |
+| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` |
+
+For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model
+server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/`
+prefix for the model name:
+
+```bash
+export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
+export OPENAI_API_KEY="none"  # dummy value if the endpoint doesn't require auth
+export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"
+```
+
+---
+
+## Full deploy (quick start)
+
+Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+
+# Export API keys before running. The script stores them in a K8s Secret
+# ("clawbench-secrets") that the gateway and sidecar containers read.
+export OPENAI_API_KEY="sk-..."
+
+# Model to evaluate (default: openai/gpt-5.5)
+# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
+
+./scripts/k8s/deploy.sh
+```
+
+Verify:
+
+```bash
+# Should show 2/2 containers (gateway + clawbench)
+kubectl get pods -n clawbench-eval
+
+# Follow eval progress
+./scripts/k8s/deploy.sh --logs
+```
+
+When the eval finishes, copy results and clean up:
+
+```bash
+# Copy results from the sidecar
+POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
+kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json
+
+# Remove the sidecar (keeps OpenClaw + MLflow running)
+./scripts/k8s/deploy.sh --remove-sidecar
+
+# Or tear down everything
+./scripts/k8s/deploy.sh --teardown
+```
+
+---
+
+## Existing cluster + existing MLflow
+
+If you already have an OpenShift or Kubernetes cluster and an MLflow instance,
+you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup
+required.
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+
+# API keys — export before running deploy.sh. The script creates a
+# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
+# At least one provider key is required.
+export OPENAI_API_KEY="sk-..."
+# export ANTHROPIC_API_KEY="sk-ant-..."
+# export OPENROUTER_API_KEY="sk-or-..."
+# export GEMINI_API_KEY="..."
+
+# Model to evaluate (default: openai/gpt-5.5)
+export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
+
+# If attaching to an existing OpenClaw gateway, this must match that gateway.
+# If deploy.sh creates OpenClaw, it generates this token for you.
+# export OPENCLAW_GATEWAY_TOKEN="..."
+
+# Point to your existing MLflow
+export MLFLOW_TRACKING_URI="https://mlflow.example.com"
+export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5"  # or use MLFLOW_EXPERIMENT_ID=42
+
+# Deploy OpenClaw gateway into your cluster
+./scripts/k8s/deploy.sh --openclaw-only
+```
+
+Verify OpenClaw is running:
+
+```bash
+kubectl get pods -n clawbench-eval
+# Expect: openclaw-xxxx  1/1  Running
+```
+
+Then start the eval:
+
+```bash
+./scripts/k8s/deploy.sh --add-sidecar
+./scripts/k8s/deploy.sh --logs
+```
+
+The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment
+and patches the experiment name/ID into the clawbench ConfigMap. When the eval
+completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that
+experiment.
+
+`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist.
+`MLFLOW_EXPERIMENT_ID` requires an existing experiment.
+
+---
+
+## Step-by-step deploy
+
+Use this when you want to deploy components individually or bring your own
+OpenClaw/MLflow.
+
+### Step 1: Deploy OpenClaw gateway
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+export OPENAI_API_KEY="sk-..."
+./scripts/k8s/deploy.sh --openclaw-only
+```
+
+Verify:
+
+```bash
+kubectl get pods -n clawbench-eval
+# Expect: openclaw-xxxx  1/1  Running
+```
+
+This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token
+auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway
+token and creates the `clawbench-secrets` Secret automatically.
+
+**Skip this step** if you already have an OpenClaw deployment. Your existing
+gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`):
+
+```json
+{
+  "browser": {
+    "enabled": true,
+    "headless": true,
+    "noSandbox": true,
+    "ssrfPolicy": {
+      "allowedHostnames": ["localhost", "127.0.0.1"]
+    }
+  },
+  "tools": {
+    "profile": "coding",
+    "alsoAllow": ["browser"]
+  }
+}
+```
+
+Key requirements:
+- `browser.enabled: true` — activates the bundled browser plugin
+- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default
+- `browser.ssrfPolicy` — several eval tasks need localhost access
+- Gateway must bind to loopback with token auth; export the matching
+  `OPENCLAW_GATEWAY_TOKEN` before running `--add-sidecar`
+
+### Step 2: Deploy MLflow
+
+```bash
+./scripts/k8s/deploy.sh --mlflow-only
+```
+
+Verify:
+
+```bash
+kubectl get pods -n mlflow
+# Expect: mlflow-xxxx  1/1  Running
+```
+
+Deploys a single-replica MLflow server with SQLite backend into the `mlflow`
+namespace. The clawbench ConfigMap defaults to
+`http://mlflow-service.mlflow.svc.cluster.local:5000`.
+
+**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`:
+
+```bash
+export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
+export MLFLOW_EXPERIMENT_ID=4  # or MLFLOW_EXPERIMENT_NAME
+```
+
+### Step 3: Run the eval
+
+```bash
+./scripts/k8s/deploy.sh --add-sidecar
+```
+
+This patches the OpenClaw deployment to inject a clawbench sidecar that:
+
+1. Waits for the gateway (TCP check on port 18789, up to 3 min)
+2. Checks MLflow connectivity if configured
+3. Runs `clawbench run` with settings from the ConfigMap
+4. Logs results to MLflow on success
+5. Sleeps indefinitely so you can retrieve logs and results
+
+Verify:
+
+```bash
+kubectl get pods -n $CLAWBENCH_NAMESPACE
+# Expect: openclaw-xxxx  2/2  Running  (gateway + clawbench)
+
+./scripts/k8s/deploy.sh --logs
+# Should show "Waiting for gateway..." then "Starting eval..."
+```
+
+When finished, remove the sidecar:
+
+```bash
+./scripts/k8s/deploy.sh --remove-sidecar
+```
+
+---
+
+## ConfigMap tuning
+
+The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval
+behavior. Override at deploy time via env vars, or patch after deploy:
+
+| Key | Default | What it controls |
+|-----|---------|-----------------|
+| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test |
+| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) |
+| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes |
+| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) |
+| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) |
+| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds |
+| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds |
+| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run |
+| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint |
+| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
+
+---
+
+## MLflow integration
+
+Results are logged via `scripts/log_to_mlflow.py` after a successful eval.
+
+**What gets logged:**
+- **Params**: model, provider, benchmark version, OpenClaw version, judge model
+- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior,
+  reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
+- **Tags**: submission ID, timestamp, certified flag
+- **Artifacts**: full benchmark result JSON
+
+---
+
+## Building images
+
+### ClawBench image
+
+`quay.io/sallyom/clawbench:latest` is public
+
+For Kubernetes, use the lightweight sidecar image instead — it only includes
+the eval harness and MLflow client:
+
+```bash
+docker build -t clawbench:latest -f scripts/k8s/Dockerfile .
+
+# For Kind clusters, load directly instead of pushing to a registry:
+kind load docker-image clawbench:latest --name openclaw
+
+# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
+# Ensure you build for the right architecture, usually amd64 for non-local k8s
+```
+
+Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it.
+
+---
+
+## Cleanup
+
+```bash
+# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
+./scripts/k8s/deploy.sh --remove-sidecar
+
+# Delete eval namespace (keeps MLflow running)
+./scripts/k8s/deploy.sh --teardown
+
+# Delete the Kind cluster entirely
+kind delete cluster --name openclaw
+```
--- a/pyproject.toml
+++ b/pyproject.toml
@ -10,7 +10,8 @@ dependencies = [
    "pydantic>=2.7,<3",
    "pyyaml>=6.0,<7",
    "datasets>=3.0,<4",
-    "gradio>=5.0,<6",
+    "gradio>=6.7.0,<7",
+    "pillow>=12.2.0,<13",
    "httpx>=0.27,<1",
    "numpy>=1.26,<3",
    "rich>=13.0,<14",
@ -18,8 +19,8 @@ dependencies = [
    # Runtime deps for the task completion verifier. The harness shells out
    # to `pytest -q` / `pytest-asyncio` inside per-task workspaces as the
    # execution check; the container must have them in PATH.
-    "pytest>=8.0,<9",
-    "pytest-asyncio>=0.24,<1",
+    "pytest>=9.0.3,<10",
+    "pytest-asyncio>=1,<2",
 ]

 [project.optional-dependencies]
@ -27,9 +28,22 @@ dev = [
    # Kept as an alias for historical `pip install .[dev]` invocations.
    # pytest + pytest-asyncio are now in the base [dependencies] since the
    # benchmark itself runs pytest in task workspaces.
-    "pytest>=8.0,<9",
-    "pytest-asyncio>=0.24,<1",
+    "pytest>=9.0.3,<10",
+    "pytest-asyncio>=1,<2",
+    "pre-commit>=4.0,<5",
+    "ruff>=0.9,<1",
 ]
+mlflow = [
+    "mlflow>=2.10,<3",
+]
+hermes = [
+    "hermes-agent @ git+https://github.com/NousResearch/hermes-agent.git@main",
+]
+
+[project.urls]
+Homepage = "https://github.com/openclaw/clawbench"
+Repository = "https://github.com/openclaw/clawbench"
+"Bug Tracker" = "https://github.com/openclaw/clawbench/issues"

 [project.scripts]
 clawbench = "clawbench.cli:main"
@ -38,6 +52,22 @@ clawbench = "clawbench.cli:main"
 requires = ["hatchling"]
 build-backend = "hatchling.build"

+[tool.hatch.build.targets.wheel]
+packages = ["clawbench"]
+force-include = { "tasks-public" = "tasks-public", "tasks-domain" = "tasks-domain", "profiles" = "profiles", "baselines" = "baselines", "CLAWBENCH_V0_4_SPEC.md" = "CLAWBENCH_V0_4_SPEC.md", "PARTNER_TRACE_SPEC.md" = "PARTNER_TRACE_SPEC.md" }
+
+[tool.hatch.metadata]
+allow-direct-references = true
+
 [tool.pytest.ini_options]
 asyncio_mode = "auto"
+addopts = ["-p", "no:opik"]
 testpaths = ["tests"]
+
+[tool.ruff]
+line-length = 100
+target-version = "py311"
+
+[tool.ruff.lint]
+select = ["E4", "E7", "E9", "F"]
+ignore = ["E402"]
--- a/scripts/analyze_open_vs_closed.py
+++ b/scripts/analyze_open_vs_closed.py
@ -18,7 +18,6 @@ Usage:
 from __future__ import annotations

 import argparse
-import json
 import statistics
 import sys
 from collections import defaultdict
--- a/scripts/audit_per_run.py
+++ b/scripts/audit_per_run.py
@ -141,9 +141,9 @@ def main():
            for run_idx in range(3):
                key = (task, run_idx)
                a = data["archived"].get(key)
-                l = data["logged"].get(key)
+                logged = data["logged"].get(key)
                err = (key in data["errors"])
-                task_runs.append({"archived": a, "logged": l, "harness_err": err})
+                task_runs.append({"archived": a, "logged": logged, "harness_err": err})
            task_runs_by_model[pretty] = task_runs

        # Compute cross-model stats
@ -159,7 +159,8 @@ def main():
                    all_scores.append(a["run_score"])
                    all_cs.append(a["c"])
                    all_outputs.append(a["has_assistant_text"])
-                    if a["judge_infra_failed"]: all_judge_infra += 1
+                    if a["judge_infra_failed"]:
+                        all_judge_infra += 1
                elif r["logged"]:
                    all_scores.append(r["logged"]["score"])
                if r["harness_err"]:
@ -222,13 +223,15 @@ def main():
            for run_idx in range(3):
                key = (task, run_idx)
                a = data["archived"].get(key)
-                l = data["logged"].get(key)
+                logged = data["logged"].get(key)
                if a:
                    any_attempted = True
-                    if a["run_score"] > 0.01: all_three_zero = False
-                elif l:
+                    if a["run_score"] > 0.01:
+                        all_three_zero = False
+                elif logged:
                    any_attempted = True
-                    if l["score"] > 0.01: all_three_zero = False
+                    if logged["score"] > 0.01:
+                        all_three_zero = False
                else:
                    all_three_zero = False  # can't confirm
                    any_attempted = False
--- a/scripts/audit_runs.py
+++ b/scripts/audit_runs.py
@ -16,7 +16,6 @@ from __future__ import annotations

 import json
 import re
-from collections import defaultdict
 from pathlib import Path

 ROOT = Path(__file__).resolve().parent.parent
@ -109,7 +108,6 @@ def audit_model(label: str, cache_sub: str, pretty: str) -> dict:
    logged = parse_log(log_path)
    archived = scan_archive(cache_dir)

-    all_keys = set(logged.keys()) | set(archived.keys())
    n_log = len(logged)
    n_arch = len(archived)
    not_archived = [k for k in logged.keys() if k not in archived]
@ -144,7 +142,6 @@ def audit_model(label: str, cache_sub: str, pretty: str) -> dict:
    for k in not_archived:
        all_scores.append(logged[k]["score"])

-    n_total_attempts = max(n_log, len(all_scores))
    expected = 120

    clean_scores = [s for _, s in clean_runs]
--- a/scripts/ci-hydrate-live-auth.sh
+++ b/scripts/ci-hydrate-live-auth.sh
@ -0,0 +1,86 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+profile_path="${1:-${RUNNER_TEMP:-/tmp}/clawbench-live.profile}"
+
+mkdir -p "$(dirname "$profile_path")"
+: >"$profile_path"
+chmod 600 "$profile_path"
+
+first_env_value() {
+  local key
+  for key in "$@"; do
+    local value="${!key:-}"
+    if [[ -n "$value" && "$value" != "undefined" && "$value" != "null" ]]; then
+      printf '%s' "$value"
+      return 0
+    fi
+  done
+  return 1
+}
+
+append_profile_env() {
+  local key="$1"
+  local value="${!key:-}"
+  if [[ -z "$value" || "$value" == "undefined" || "$value" == "null" ]]; then
+    return
+  fi
+  printf 'export %s=%q\n' "$key" "$value" >>"$profile_path"
+}
+
+write_secret_file() {
+  local destination="$1"
+  shift
+  local value=""
+  value="$(first_env_value "$@" || true)"
+  if [[ -z "$value" ]]; then
+    return
+  fi
+  mkdir -p "$(dirname "$destination")"
+  printf '%s' "$value" >"$destination"
+  chmod 600 "$destination"
+}
+
+for env_key in \
+  HF_TOKEN \
+  HF_USERNAME \
+  CLAWBENCH_QUEUE_DATASET \
+  CLAWBENCH_JUDGE_MODEL \
+  ANTHROPIC_API_KEY \
+  ANTHROPIC_API_KEY_OLD \
+  ANTHROPIC_API_TOKEN \
+  CEREBRAS_API_KEY \
+  DEEPINFRA_API_KEY \
+  FIREWORKS_API_KEY \
+  GEMINI_API_KEY \
+  GOOGLE_API_KEY \
+  GROQ_API_KEY \
+  KIMI_API_KEY \
+  MINIMAX_API_KEY \
+  MISTRAL_API_KEY \
+  MOONSHOT_API_KEY \
+  OPENAI_API_KEY \
+  OPENAI_BASE_URL \
+  OPENROUTER_API_KEY \
+  QWEN_API_KEY \
+  TOGETHER_API_KEY \
+  XAI_API_KEY \
+  ZAI_API_KEY \
+  Z_AI_API_KEY
+do
+  append_profile_env "$env_key"
+done
+
+write_secret_file "$HOME/.codex/auth.json" CLAWBENCH_CODEX_AUTH_JSON OPENCLAW_CODEX_AUTH_JSON
+write_secret_file "$HOME/.codex/config.toml" CLAWBENCH_CODEX_CONFIG_TOML OPENCLAW_CODEX_CONFIG_TOML
+write_secret_file "$HOME/.claude.json" CLAWBENCH_CLAUDE_JSON OPENCLAW_CLAUDE_JSON
+write_secret_file "$HOME/.claude/.credentials.json" CLAWBENCH_CLAUDE_CREDENTIALS_JSON OPENCLAW_CLAUDE_CREDENTIALS_JSON
+write_secret_file "$HOME/.claude/settings.json" CLAWBENCH_CLAUDE_SETTINGS_JSON OPENCLAW_CLAUDE_SETTINGS_JSON
+write_secret_file "$HOME/.claude/settings.local.json" CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON
+write_secret_file "$HOME/.gemini/settings.json" CLAWBENCH_GEMINI_SETTINGS_JSON OPENCLAW_GEMINI_SETTINGS_JSON
+
+if [[ -n "${GITHUB_ENV:-}" ]]; then
+  {
+    echo "CLAWBENCH_PROFILE_FILE=$profile_path"
+  } >>"$GITHUB_ENV"
+fi
--- a/scripts/ci-hydrate-testbox-env.sh
+++ b/scripts/ci-hydrate-testbox-env.sh
@ -0,0 +1,32 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+profile_path="${1:-$HOME/.clawbench-testbox-live.profile}"
+helper_path="${2:-$HOME/.local/bin/clawbench-testbox-env}"
+
+mkdir -p "$(dirname "$helper_path")"
+
+bash scripts/ci-hydrate-live-auth.sh "$profile_path"
+
+cat >"$helper_path" <<'SH'
+#!/usr/bin/env bash
+set -euo pipefail
+
+profile_path="${CLAWBENCH_TESTBOX_PROFILE_FILE:-$HOME/.clawbench-testbox-live.profile}"
+if [[ ! -f "$profile_path" ]]; then
+  echo "Missing Testbox provider env profile: $profile_path" >&2
+  exit 1
+fi
+
+set -a
+# shellcheck disable=SC1090
+source "$profile_path"
+set +a
+
+if [[ "$#" -eq 0 ]]; then
+  exec "${SHELL:-/bin/bash}"
+fi
+
+exec "$@"
+SH
+chmod 700 "$helper_path"
--- a/scripts/classify_regimes.py
+++ b/scripts/classify_regimes.py
@ -1,140 +1,112 @@
-"""Classify each archived run's dynamical regime from its turn trajectory.
+#!/usr/bin/env python3
+"""Classify posterior run trajectories into dynamical regimes.

-Following "When LLMs Are Dreaming..." §What We Expect to See:
+We embed each assistant turn using bag-of-words text plus tool-call summaries,
+then compute simple geometric proxies:

-  TRAPPED/ATTRACTOR   — low support (Vol_log), high recurrence, high BOPS.
-                        Agent converged to a point; may be good (solved it)
-                        or bad (got stuck in a loop on a single idea).
+    drift_mean = mean ||x_t - x_{t-1}||
+    from_start = max ||x_t - x_0||
+    recurrence = max cosine(x_i, x_j) for non-adjacent turns
+    vol_log    = log det(Sigma + eps I)

-  LIMIT-CYCLE         — high recurrence + bounded drift + quasi-periodic revisits.
-                        Agent loops between a few states.
-
-  DIFFUSIVE/WANDERING — growing support, rising drift, low recurrence.
-                        Agent explores without converging; often "goal drift".
-
-  SENSITIVE           — (requires paraphrased-pair runs; skip here.)
-
-  TOO-SHORT           — trajectory < 3 assistant turns; can't classify dynamics.
-
-We work in a TF-IDF bag-of-words embedding space (same vocab as C(q)),
-with each turn's state vector = its assistant text + tool-call args.
-
-Metrics per run:
-  - drift_mean:  mean ||e_t − e_{t−1}|| across turns
-  - from_start:  max ||e_t − e_0||  (farthest the run drifted from origin)
-  - recurrence:  max_{i<j, j−i≥2} cos(e_i, e_j)  — best return-after-gap match
-  - vol_log:     log det(Σ + εI) over turn states — support volume proxy
-
-Classifier rules (tuned empirically on the distribution):
-  if n_turns < 3                              → too_short
-  elif drift_mean < 0.15 and vol_log < −6     → trapped
-  elif recurrence > 0.80 and drift_mean < 0.25 → limit_cycle
-  elif drift_mean > 0.35 and vol_log > −3     → diffusive
-  else                                         → mixed
-
-Output: reports/regimes.json with per-run classification.
-
-Usage:
-    .venv/bin/python3 scripts/classify_regimes.py
+Runs are then bucketed into coarse regimes such as trapped, limit_cycle, and
+diffusive using quartile-based thresholds estimated from the observed archive.
 """

 from __future__ import annotations

+import argparse
 import json
 import re
-from collections import Counter, defaultdict
+import sys
+from collections import Counter
 from pathlib import Path

 import numpy as np

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-MODELS = [
-    "anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7",
-    "anthropic_claude-sonnet-4-6", "openai_gpt-5.4",
-    "google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1",
-    "openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5",
-    "openrouter_qwen_qwen3.6-plus",
-]
+from clawbench.dynamics_archive import load_task_runs_by_model

 WORD_RE = re.compile(r"[a-z]{3,}")
-STOPWORDS = set("the and that with this have from what your will can but not "
-                "was will are been one would there been they will their has "
-                "had its were only some than about these which into also each "
-                "when where them how who them very much more most other then "
-                "here such does like just make many like want need take".split())
+STOPWORDS = set(
+    "the and that with this have from what your will can but not "
+    "was are been one would there they their has had its were only some "
+    "than about these which into also each when where them how who very "
+    "much more most other then here such does like just make many want need take".split()
+)


 def tokenize(text: str) -> list[str]:
    return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS]


-def build_vocab(all_turn_texts: list[str], top_k: int = 500) -> dict[str, int]:
-    c = Counter()
-    for t in all_turn_texts:
-        c.update(set(tokenize(t)))
-    return {w: i for i, (w, _) in enumerate(c.most_common(top_k))}
+def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]:
+    counter = Counter()
+    for text in texts:
+        counter.update(set(tokenize(text)))
+    return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))}


 def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray:
-    v = np.zeros(len(vocab), dtype=np.float32)
-    for w, c in Counter(tokenize(text)).items():
-        if w in vocab:
-            v[vocab[w]] = c
-    n = np.linalg.norm(v)
-    return v / n if n > 0 else v
+    vec = np.zeros(len(vocab), dtype=np.float32)
+    for word, cnt in Counter(tokenize(text)).items():
+        if word in vocab:
+            vec[vocab[word]] = cnt
+    norm = np.linalg.norm(vec)
+    return vec / norm if norm > 0 else vec


-def turn_texts(run_data: dict) -> list[str]:
-    """Extract one text string per assistant turn (text + tool-call summary)."""
+def turn_texts(run, fallback_any_message: bool = False) -> list[str]:
+    source = run.transcript.messages if fallback_any_message else run.transcript.assistant_messages
    out = []
-    for m in run_data.get("transcript", {}).get("messages", []):
-        if m.get("role") != "assistant":
-            continue
+    for msg in source:
        parts = []
-        if m.get("text"):
-            parts.append(m["text"])
-        for tc in (m.get("tool_calls") or []):
-            name = tc.get("name", "")
-            args_str = json.dumps(tc.get("arguments", {}))[:200]
-            parts.append(f"{name} {args_str}")
+        if msg.text:
+            parts.append(msg.text)
+        for tc in msg.tool_calls:
+            parts.append(tc.name)
+            if tc.input:
+                parts.append(json.dumps(tc.input, sort_keys=True)[:200])
        if parts:
            out.append(" ".join(parts))
    return out


-def trajectory_metrics(vecs: np.ndarray) -> dict:
-    """Compute dynamical metrics over a (n_turns, d) trajectory matrix."""
+def trajectory_metrics(vecs: np.ndarray) -> dict[str, float]:
+    """Compute drift, recurrence, and support-volume proxies for one run."""
    n = vecs.shape[0]
    if n < 2:
-        return {"n_turns": n, "drift_mean": 0.0, "from_start": 0.0,
-                "recurrence": 0.0, "vol_log": -12.0}
-    # Drift: consecutive distances
+        return {
+            "n_turns": float(n),
+            "drift_mean": 0.0,
+            "from_start": 0.0,
+            "recurrence": 0.0,
+            "vol_log": -12.0,
+        }
+
    diffs = np.linalg.norm(np.diff(vecs, axis=0), axis=1)
    drift_mean = float(diffs.mean())
-    # From start: max distance from turn 0
-    dists_from_0 = np.linalg.norm(vecs - vecs[0:1], axis=1)
-    from_start = float(dists_from_0.max())
-    # Recurrence: best non-adjacent cosine similarity (ignoring immediate neighbors)
+    from_start = float(np.linalg.norm(vecs - vecs[0:1], axis=1).max())
+
    recurrence = 0.0
    for i in range(n):
        for j in range(i + 2, n):
-            ni, nj = np.linalg.norm(vecs[i]), np.linalg.norm(vecs[j])
+            ni = np.linalg.norm(vecs[i])
+            nj = np.linalg.norm(vecs[j])
            if ni > 0 and nj > 0:
-                c = float(vecs[i] @ vecs[j] / (ni * nj))
-                if c > recurrence:
-                    recurrence = c
-    # Vol_log: log det of turn-state covariance
+                sim = float(vecs[i] @ vecs[j] / (ni * nj))
+                recurrence = max(recurrence, sim)
+
    if n >= 3:
-        Sigma = np.cov(vecs.T)
-        # Use log|Σ + εI|; since d is large (500) we take eigenvalues + clip
-        eigs = np.linalg.eigvalsh(Sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32))
+        sigma = np.cov(vecs.T)
+        eigs = np.linalg.eigvalsh(sigma + 1e-6 * np.eye(vecs.shape[1], dtype=np.float32))
        vol_log = float(np.log(np.clip(eigs, 1e-12, None)).sum())
    else:
        vol_log = -12.0
+
    return {
-        "n_turns": n,
+        "n_turns": float(n),
        "drift_mean": drift_mean,
        "from_start": from_start,
        "recurrence": recurrence,
@ -142,109 +114,105 @@ def trajectory_metrics(vecs: np.ndarray) -> dict:
    }


-def classify(m: dict, thresholds: dict) -> str:
-    """Classify based on quartile thresholds of the actual distribution.
-
-    Thresholds (set empirically from observed distribution):
-      drift_low  = p25  drift_hi = p75
-      vol_low    = p25  vol_hi   = p75
-      rec_hi     = p75
-
-    Rules (priority order):
-      n_turns < 3             → too_short
-      drift < drift_low AND vol < vol_low  → trapped
-      rec > rec_hi AND drift < median       → limit_cycle
-      drift > drift_hi AND vol > vol_hi     → diffusive
-      else                                  → mixed
-    """
-    n = m["n_turns"]
-    if n < 3:
+def classify(metrics: dict[str, float], thresholds: dict[str, float]) -> str:
+    """Map trajectory metrics to a coarse regime label."""
+    n_turns = int(metrics["n_turns"])
+    if n_turns < 3:
        return "too_short"
-    d = m["drift_mean"]
-    rec = m["recurrence"]
-    vol = m["vol_log"]
-    if d < thresholds["drift_low"] and vol < thresholds["vol_low"]:
+    drift = metrics["drift_mean"]
+    recurrence = metrics["recurrence"]
+    vol = metrics["vol_log"]
+
+    if drift < thresholds["drift_low"] and vol < thresholds["vol_low"]:
        return "trapped"
-    if rec > thresholds["rec_hi"] and d < thresholds["drift_med"]:
+    if recurrence > thresholds["rec_hi"] and drift < thresholds["drift_med"]:
        return "limit_cycle"
-    if d > thresholds["drift_hi"] and vol > thresholds["vol_hi"]:
+    if drift > thresholds["drift_hi"] and vol > thresholds["vol_hi"]:
        return "diffusive"
    return "mixed"


 def main() -> None:
-    # First pass: collect turn texts to build vocab
+    parser = argparse.ArgumentParser(description="Classify cached run regimes")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()
+
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")
+
    all_turn_texts: list[str] = []
-    run_turns: dict[tuple, list[str]] = {}
-    for model in MODELS:
-        for rf in (ARCH / model).rglob("run*.json"):
-            try:
-                d = json.loads(rf.read_text())
-            except Exception:
-                continue
-            task = rf.parent.name
-            run_idx = int(re.match(r"run(\d+)", rf.stem).group(1))
-            ts = turn_texts(d)
-            run_turns[(model, task, run_idx)] = ts
-            all_turn_texts.extend(ts)
+    run_turns: dict[str, list[str]] = {}
+
+    for model_name, task_runs in grouped.items():
+        for task_id, runs in task_runs.items():
+            for run in runs:
+                ts = turn_texts(run, fallback_any_message=False)
+                key = f"{model_name}/{task_id}/run{run.run_index}"
+                run_turns[key] = ts
+                all_turn_texts.extend(ts)
+
+    used_fallback_messages = False
+    if not all_turn_texts:
+        used_fallback_messages = True
+        all_turn_texts = []
+        run_turns = {}
+        for model_name, task_runs in grouped.items():
+            for task_id, runs in task_runs.items():
+                for run in runs:
+                    ts = turn_texts(run, fallback_any_message=True)
+                    key = f"{model_name}/{task_id}/run{run.run_index}"
+                    run_turns[key] = ts
+                    all_turn_texts.extend(ts)
+
+    if not all_turn_texts:
+        raise SystemExit("No usable turn text found in archive.")

    vocab = build_vocab(all_turn_texts, top_k=500)
-    print(f"Runs collected: {len(run_turns)}  vocab size: {len(vocab)}")

-    # Second pass: vectorize + compute metrics
-    per_run: dict[str, dict] = {}
+    per_run: dict[str, dict[str, float | str]] = {}
    for key, ts in run_turns.items():
-        model, task, run_idx = key
        if not ts:
            continue
-        vecs = np.stack([vectorize(t, vocab) for t in ts])
-        m = trajectory_metrics(vecs)
-        per_run[f"{model}/{task}/run{run_idx}"] = m
+        vecs = np.stack([vectorize(text, vocab) for text in ts])
+        per_run[key] = trajectory_metrics(vecs)

-    # Derive thresholds from actual distribution of n_turns>=3 runs
-    drifts = np.array([v["drift_mean"] for v in per_run.values() if v["n_turns"] >= 3])
-    recs = np.array([v["recurrence"] for v in per_run.values() if v["n_turns"] >= 3])
-    vols = np.array([v["vol_log"] for v in per_run.values() if v["n_turns"] >= 3])
-    thresholds = {
-        "drift_low": float(np.percentile(drifts, 25)),
-        "drift_med": float(np.percentile(drifts, 50)),
-        "drift_hi":  float(np.percentile(drifts, 75)),
-        "vol_low":   float(np.percentile(vols, 25)),
-        "vol_hi":    float(np.percentile(vols, 75)),
-        "rec_hi":    float(np.percentile(recs, 75)),
-    }
-    print(f"\nThresholds (quartile-based from observed distribution):")
-    for k, v in thresholds.items():
-        print(f"  {k:<12}  {v:>10.3f}")
+    eligible = [r for r in per_run.values() if int(r["n_turns"]) >= 3]
+    if eligible:
+        drifts = np.array([float(v["drift_mean"]) for v in eligible])
+        recs = np.array([float(v["recurrence"]) for v in eligible])
+        vols = np.array([float(v["vol_log"]) for v in eligible])
+        thresholds = {
+            "drift_low": float(np.percentile(drifts, 25)),
+            "drift_med": float(np.percentile(drifts, 50)),
+            "drift_hi": float(np.percentile(drifts, 75)),
+            "vol_low": float(np.percentile(vols, 25)),
+            "vol_hi": float(np.percentile(vols, 75)),
+            "rec_hi": float(np.percentile(recs, 75)),
+        }
+    else:
+        thresholds = {
+            "drift_low": 0.15,
+            "drift_med": 0.25,
+            "drift_hi": 0.35,
+            "vol_low": -6.0,
+            "vol_hi": -3.0,
+            "rec_hi": 0.8,
+        }

-    # Apply classifier with thresholds
-    for key in per_run:
-        per_run[key]["regime"] = classify(per_run[key], thresholds)
+    for key, metrics in per_run.items():
+        metrics["regime"] = classify(metrics, thresholds)
+        metrics["turn_source"] = "any_message" if used_fallback_messages else "assistant"

-    # Summary by regime
-    counts = Counter(v["regime"] for v in per_run.values())
-    print(f"\nRegime distribution (n={len(per_run)} runs):")
-    for regime, n in counts.most_common():
-        print(f"  {regime:<14} {n:>4}  ({100*n/len(per_run):>4.1f}%)")
+    args.reports_dir.mkdir(parents=True, exist_ok=True)
+    out = args.reports_dir / "regimes.json"
+    out.write_text(json.dumps(per_run, indent=2), encoding="utf-8")

-    # Per-model regime breakdown
-    print(f"\n{'Model':<10}  " + " ".join(f"{r:>11}" for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]))
-    print("-" * 70)
-    pm_counts = defaultdict(Counter)
-    for key, v in per_run.items():
-        model = key.split("/")[0]
-        pm_counts[model][v["regime"]] += 1
-    for model in MODELS:
-        row = [f"{model.split('_')[-1][:9]:<10}"]
-        for r in ["too_short", "trapped", "limit_cycle", "diffusive", "mixed"]:
-            row.append(f"{pm_counts[model][r]:>11}")
-        print("  ".join(row))
-
-    # Write output
-    out = ROOT / "reports" / "regimes.json"
-    out.parent.mkdir(exist_ok=True)
-    out.write_text(json.dumps(per_run, indent=2))
-    print(f"\nWrote: {out}")
+    counts = Counter(str(v["regime"]) for v in per_run.values())
+    print(f"Wrote: {out}")
+    print(f"Regime counts: {dict(counts)}")


 if __name__ == "__main__":
--- a/scripts/compute_constraint_index.py
+++ b/scripts/compute_constraint_index.py
@ -1,145 +1,127 @@
-"""Compute Constraint Index C(q) per task from existing v4-19-full archive.
+#!/usr/bin/env python3
+"""Compute posterior Constraint Index C(q) from cached runs.

-Following "When LLMs Are Dreaming..." paper §Query-design:
+Task-level constraint index:

-  C(q) = z(PR(q)) + z(entropy(q)) + z(BOPS(q))
+    C(q) = -z(PR(q)) - z(H(q)) + z(BOPS(q))

 Where:
-  - PR(q): participation ratio = (tr Σ)² / tr(Σ²) of response embeddings
-           across all (model, run) responses to query q. Low PR = everyone
-           writes similar thing (prompt is constrained). High PR = responses
-           spread out (prompt is open-ended).
-  - entropy(q): Shannon entropy of (discretized) response-feature distribution.
-  - BOPS(q): Bayesian Optimal Prediction Score — how well can we predict
-             response given q? Proxied here as inter-run cosine similarity
-             for the same model (high similarity = high predictability).

-Since we don't have sentence-transformers, we use TF-IDF-style bag-of-words
-from the final assistant message per run. This is crude but measures the
-same signal — whether models produce similar vs divergent output.
+    PR(q)   = participation ratio of the task response covariance
+    H(q)    = Shannon entropy of the covariance eigenspectrum
+    BOPS(q) = within-model inter-run predictability proxy

-Output: reports/constraint_index.json with per-task C(q) components +
-        combined z-score.
+High C(q) means a task is more constrained: models and repeated runs tend to
+land in a narrower response manifold. Low C(q) means the task is more open or
+stylistically underconstrained.

-Usage:
-    .venv/bin/python3 scripts/compute_constraint_index.py
+This implementation uses a normalized bag-of-words representation built from
+the full assistant trajectory text plus tool-call names and compacted inputs.
 """

 from __future__ import annotations

+import argparse
 import json
 import re
-import glob
+import sys
 from collections import Counter, defaultdict
 from pathlib import Path

 import numpy as np
-from scipy.stats import entropy as shannon_entropy

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-MODELS = [
-    "anthropic_claude-opus-4-6", "anthropic_claude-opus-4-7",
-    "anthropic_claude-sonnet-4-6", "openai_gpt-5.4",
-    "google_gemini-3.1-pro-preview", "openrouter_z-ai_glm-5.1",
-    "openrouter_minimax_minimax-m2.7", "openrouter_moonshotai_kimi-k2.5",
-    "openrouter_qwen_qwen3.6-plus",
-]
+from clawbench.dynamics_archive import load_task_runs_by_model

 WORD_RE = re.compile(r"[a-z]{3,}")
-STOPWORDS = set("the and that with this have from what your will can but not "
-                "was will are been one would there been they will their has "
-                "had its were only some than about these which into also each "
-                "when where them how who them very much more most other then "
-                "here such does like just make many like want need take".split())
+STOPWORDS = set(
+    "the and that with this have from what your will can but not "
+    "was are been one would there they their has had its were only some "
+    "than about these which into also each when where them how who very "
+    "much more most other then here such does like just make many want need take".split()
+)


-def final_assistant_text(run_path: Path, max_chars: int = 4000) -> str:
-    """Extract the last assistant message text + tool-call arg summary."""
-    try:
-        d = json.loads(run_path.read_text())
-    except Exception:
-        return ""
-    msgs = d.get("transcript", {}).get("messages", [])
-    texts = []
-    for m in msgs:
-        if m.get("role") != "assistant":
-            continue
-        if m.get("text"):
-            texts.append(m["text"])
-        for tc in (m.get("tool_calls") or []):
-            name = tc.get("name", "")
-            args_str = json.dumps(tc.get("arguments", {}))[:200]
-            texts.append(f"{name} {args_str}")
-    blob = " ".join(texts)[:max_chars]
-    return blob
+def _assistant_trajectory_text(run, max_chars: int = 4000) -> str:
+    parts = []
+    for message in run.transcript.assistant_messages:
+        if message.text:
+            parts.append(message.text)
+        for call in message.tool_calls:
+            parts.append(call.name)
+            if call.input:
+                parts.append(json.dumps(call.input, sort_keys=True)[:200])
+    return " ".join(p for p in parts if p).strip()[:max_chars]
+
+
+def _fallback_text_from_any_message(run) -> str:
+    for msg in reversed(run.transcript.messages):
+        parts = []
+        if msg.text:
+            parts.append(msg.text)
+        for call in msg.tool_calls:
+            parts.append(call.name)
+            if call.input:
+                parts.append(json.dumps(call.input, sort_keys=True)[:200])
+        if parts:
+            return " ".join(parts).strip()
+    return ""


 def tokenize(text: str) -> list[str]:
-    return [w for w in WORD_RE.findall(text.lower()) if w not in STOPWORDS]
+    return [w for w in WORD_RE.findall((text or "").lower()) if w not in STOPWORDS]


 def build_vocab(texts: list[str], top_k: int = 500) -> dict[str, int]:
-    """Build a vocab of the top-k most common tokens across all texts."""
-    counter = Counter()
-    for t in texts:
-        counter.update(set(tokenize(t)))
-    return {w: i for i, (w, _) in enumerate(counter.most_common(top_k))}
+    counts = Counter()
+    for text in texts:
+        counts.update(set(tokenize(text)))
+    return {word: idx for idx, (word, _) in enumerate(counts.most_common(top_k))}


 def vectorize(text: str, vocab: dict[str, int]) -> np.ndarray:
-    """TF-IDF-ish: token frequency normalized to unit L2 for cosine geometry."""
-    v = np.zeros(len(vocab), dtype=np.float32)
+    vec = np.zeros(len(vocab), dtype=np.float32)
    toks = tokenize(text)
    if not toks:
-        return v
+        return vec
    counts = Counter(toks)
-    for w, c in counts.items():
-        if w in vocab:
-            v[vocab[w]] = c
-    n = np.linalg.norm(v)
-    return v / n if n > 0 else v
+    for word, cnt in counts.items():
+        if word in vocab:
+            vec[vocab[word]] = cnt
+    norm = np.linalg.norm(vec)
+    return vec / norm if norm > 0 else vec


 def participation_ratio(X: np.ndarray) -> float:
-    """PR(X) = (tr Σ)² / tr(Σ²). Measures effective dimensionality 1–d."""
+    """PR(X) = (tr Sigma)^2 / tr(Sigma^2), an effective dimensionality proxy."""
    if X.shape[0] < 2:
        return 1.0
-    Sigma = np.cov(X.T)
-    if Sigma.ndim == 0:
+    sigma = np.cov(X.T)
+    if sigma.ndim == 0:
        return 1.0
-    tr = np.trace(Sigma)
-    tr_sq = np.trace(Sigma @ Sigma)
+    tr = np.trace(sigma)
+    tr_sq = np.trace(sigma @ sigma)
    if tr_sq < 1e-12:
        return 1.0
-    return float(tr ** 2 / tr_sq)
+    return float((tr**2) / tr_sq)


-def response_entropy(X: np.ndarray, n_clusters: int = 8) -> float:
-    """Entropy of a k-means-like discretization of responses.
-
-    Since we have small n per task (~27 responses), we cluster by nearest-
-    centroid using the top-few PCA directions. Simpler: use normalized
-    eigenvalues of covariance as a proxy for entropy over principal modes.
-    """
+def response_entropy(X: np.ndarray) -> float:
+    """Entropy over normalized covariance eigenvalues, in bits."""
    if X.shape[0] < 2:
        return 0.0
-    Sigma = np.cov(X.T)
-    eigs = np.linalg.eigvalsh(Sigma)
+    sigma = np.cov(X.T)
+    eigs = np.linalg.eigvalsh(sigma)
    eigs = np.clip(eigs, 1e-12, None)
-    eigs = eigs / eigs.sum()
-    return float(shannon_entropy(eigs, base=2))
+    probs = eigs / eigs.sum()
+    return float(-np.sum(probs * np.log2(probs)))


 def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> float:
-    """BOPS proxy: inter-run cosine similarity within same model.
-
-    High similarity = predictable (high BOPS). Low similarity = novel each run.
-    Returns mean cosine across all pairs within each model, averaged across models.
-    """
+    """Mean within-model pairwise cosine similarity across repeated runs."""
    per_model_means = []
-    for _model, vecs in run_vecs.items():
+    for vecs in run_vecs.values():
        if len(vecs) < 2:
            continue
        sims = []
@ -154,91 +136,88 @@ def bops_inter_run_predictability(run_vecs: dict[str, list[np.ndarray]]) -> floa
    return float(np.mean(per_model_means)) if per_model_means else 0.0


+def zscore(value: float, arr: np.ndarray) -> float:
+    std = arr.std()
+    return float((value - arr.mean()) / std) if std > 1e-12 else 0.0
+
+
 def main() -> None:
-    # Gather: per-task list of texts + per-model list of per-run vectors
+    parser = argparse.ArgumentParser(description="Compute posterior constraint index per task")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()
+
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")
+
    per_task_texts: dict[str, list[str]] = defaultdict(list)
-    per_task_model_runs: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
-    for model in MODELS:
-        model_dir = ARCH / model
-        if not model_dir.exists():
-            continue
-        for task_dir in model_dir.iterdir():
-            if not task_dir.is_dir():
-                continue
-            task = task_dir.name
-            for rf in sorted(task_dir.glob("run*.json")):
-                text = final_assistant_text(rf)
+    per_task_model_texts: dict[str, dict[str, list[str]]] = defaultdict(lambda: defaultdict(list))
+
+    use_fallback_messages = False
+    for model_name, task_runs in grouped.items():
+        for task_id, runs in task_runs.items():
+            for run in runs:
+                text = _assistant_trajectory_text(run)
                if text:
-                    per_task_texts[task].append(text)
-                    per_task_model_runs[task][model].append(text)
+                    per_task_texts[task_id].append(text)
+                    per_task_model_texts[task_id][model_name].append(text)

-    print(f"Tasks with responses: {len(per_task_texts)}")
+    all_texts = [text for texts in per_task_texts.values() for text in texts]
+    if not all_texts:
+        use_fallback_messages = True
+        for model_name, task_runs in grouped.items():
+            for task_id, runs in task_runs.items():
+                for run in runs:
+                    text = _fallback_text_from_any_message(run)
+                    if text:
+                        per_task_texts[task_id].append(text)
+                        per_task_model_texts[task_id][model_name].append(text)
+        all_texts = [text for texts in per_task_texts.values() for text in texts]
+
+    if not all_texts:
+        raise SystemExit("No usable text found in cached transcripts.")

-    # Build a GLOBAL vocab across all tasks for comparable vector spaces
-    all_texts = [t for ts in per_task_texts.values() for t in ts]
    vocab = build_vocab(all_texts, top_k=500)
-    print(f"Global vocab size: {len(vocab)}")
-
-    # Compute per-task metrics
-    per_task: dict[str, dict] = {}
-    for task, texts in sorted(per_task_texts.items()):
-        if len(texts) < 5:
-            continue
-        X = np.stack([vectorize(t, vocab) for t in texts])  # (n_responses, vocab_dim)
+    per_task: dict[str, dict[str, float | str]] = {}
+    for task_id, texts in sorted(per_task_texts.items()):
+        X = np.stack([vectorize(text, vocab) for text in texts])
        pr = participation_ratio(X)
        ent = response_entropy(X)
-        # BOPS: within-model run predictability
-        model_vecs: dict[str, list[np.ndarray]] = {}
-        for m, ts in per_task_model_runs[task].items():
-            model_vecs[m] = [vectorize(t, vocab) for t in ts]
+        model_vecs = {
+            model_name: [vectorize(text, vocab) for text in model_texts]
+            for model_name, model_texts in per_task_model_texts[task_id].items()
+        }
        bops = bops_inter_run_predictability(model_vecs)
-        per_task[task] = {
+        per_task[task_id] = {
            "n_responses": len(texts),
            "PR": pr,
            "entropy": ent,
            "BOPS": bops,
+            "data_source": "fallback_any_message" if use_fallback_messages else "assistant_final",
        }

-    # Z-score each component across tasks → combine into C(q)
+    if not per_task:
+        raise SystemExit("Not enough data to compute C(q).")
+
    prs = np.array([v["PR"] for v in per_task.values()])
    ents = np.array([v["entropy"] for v in per_task.values()])
    bopss = np.array([v["BOPS"] for v in per_task.values()])

-    def z(x, arr):
-        return float((x - arr.mean()) / (arr.std() or 1.0))
+    for task_id, v in per_task.items():
+        z_pr = zscore(v["PR"], prs)
+        z_ent = zscore(v["entropy"], ents)
+        z_bops = zscore(v["BOPS"], bopss)
+        v["z_PR"] = z_pr
+        v["z_entropy"] = z_ent
+        v["z_BOPS"] = z_bops
+        v["C_q"] = -z_pr - z_ent + z_bops

-    for task, v in per_task.items():
-        zpr = z(v["PR"], prs)
-        zent = z(v["entropy"], ents)
-        zbops = z(v["BOPS"], bopss)
-        # Paper: higher PR/entropy = MORE open-ended. Higher BOPS = MORE predictable.
-        # "Constraint" = opposite of openness. C(q) high ⇒ constrained task.
-        # So: C(q) = −z(PR) − z(entropy) + z(BOPS)
-        v["z_PR"] = zpr
-        v["z_entropy"] = zent
-        v["z_BOPS"] = zbops
-        v["C_q"] = -zpr - zent + zbops
-
-    # Sort + print
-    ranked = sorted(per_task.items(), key=lambda kv: -kv[1]["C_q"])
-    print(f"\n{'Task':<38} {'n':>3}  {'PR':>5}  {'H':>5}  {'BOPS':>5}  {'C(q)':>6}  (constraint level)")
-    print("-" * 78)
-    for task, v in ranked:
-        print(f"{task:<38} {v['n_responses']:>3}  {v['PR']:>5.2f}  {v['entropy']:>5.2f}  "
-              f"{v['BOPS']:>5.2f}  {v['C_q']:>+6.2f}")
-
-    out_path = ROOT / "reports" / "constraint_index.json"
-    out_path.parent.mkdir(exist_ok=True)
-    out_path.write_text(json.dumps(per_task, indent=2))
-    print(f"\nWrote: {out_path}")
-
-    # Bucket summary
-    highs = [t for t, v in per_task.items() if v["C_q"] > 0.5]
-    lows = [t for t, v in per_task.items() if v["C_q"] < -0.5]
-    mids = [t for t, v in per_task.items() if -0.5 <= v["C_q"] <= 0.5]
-    print(f"\nHigh-constraint (C>+0.5): {len(highs)} tasks  (responses converge)")
-    print(f"Mid:                       {len(mids)} tasks")
-    print(f"Low-constraint (C<-0.5):   {len(lows)} tasks  (responses diverge — open-ended)")
+    args.reports_dir.mkdir(parents=True, exist_ok=True)
+    out_path = args.reports_dir / "constraint_index.json"
+    out_path.write_text(json.dumps(per_task, indent=2), encoding="utf-8")
+    print(f"Wrote: {out_path}")


 if __name__ == "__main__":
--- a/scripts/container_cherry_single.sh
+++ b/scripts/container_cherry_single.sh
@ -0,0 +1,198 @@
+#!/bin/bash
+# Cherry-pick variant of container_sweep_single.sh: runs ONLY the tasks listed
+# in $CHERRY_TASKS (comma-separated task IDs), with state-dir isolation.
+#
+# Required env vars:
+#   SWEEP_LABEL   (e.g. opus47)
+#   SWEEP_MODEL   (e.g. anthropic/claude-opus-4-7)
+#   SWEEP_PROFILE (absolute path in container)
+#   SWEEP_LOGDIR  (default /data/drift_2026-04-20-cherry)
+#   SWEEP_OUT_TAG (default v2026-4-20-cherry)
+#   CHERRY_TASKS  (comma-separated task IDs, e.g. "t2-ctx-pronoun-resolve,t3-fin-budget-monthly")
+
+set -u
+
+: "${SWEEP_LABEL:?SWEEP_LABEL required}"
+: "${SWEEP_MODEL:?SWEEP_MODEL required}"
+: "${SWEEP_PROFILE:?SWEEP_PROFILE required}"
+: "${CHERRY_TASKS:?CHERRY_TASKS required (comma-separated task IDs)}"
+
+: "${SWEEP_LOGDIR:=/data/drift_2026-04-20-cherry}"
+: "${SWEEP_OUT_TAG:=v2026-4-20-cherry}"
+
+cd /data
+
+LOGDIR="$SWEEP_LOGDIR"
+mkdir -p "$LOGDIR"
+
+export OPENCLAW_GATEWAY_TOKEN="local-dev-token-for-testing"
+export CLAWBENCH_RUN_CACHE_DIR="/data/run_cache"
+mkdir -p "$CLAWBENCH_RUN_CACHE_DIR"
+export NODE_OPTIONS="--max-old-space-size=4096"
+# OpenClaw 4.22+ has slower agents.create / sessions.create on cold start
+# (we observed 72s for opus-4-7). Bump RPC timeouts so the harness doesn't
+# cancel mid-flight. Override defaults of 30s / 60s respectively.
+export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-120}"
+export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
+export CLAWBENCH_PER_RUN_BUDGET_SECONDS="${CLAWBENCH_PER_RUN_BUDGET_SECONDS:-900}"
+export HERMES_STEP_TIMEOUT_SECONDS="${HERMES_STEP_TIMEOUT_SECONDS:-180}"
+
+# State-dir isolation (same as container_sweep_single.sh)
+SRC_STATE="/home/node/.openclaw"
+FRESH_STATE="/tmp/openclaw-state-${SWEEP_LABEL}-$$"
+echo "[state-isolate] cloning config from $SRC_STATE to $FRESH_STATE"
+mkdir -p "$FRESH_STATE"
+[ -f "$SRC_STATE/openclaw.json" ] && cp "$SRC_STATE/openclaw.json" "$FRESH_STATE/openclaw.json"
+[ -f "$SRC_STATE/exec-approvals.json" ] && cp "$SRC_STATE/exec-approvals.json" "$FRESH_STATE/exec-approvals.json"
+for d in identity devices tasks subagents flows cron; do
+  [ -d "$SRC_STATE/$d" ] && cp -r "$SRC_STATE/$d" "$FRESH_STATE/$d"
+done
+mkdir -p "$FRESH_STATE/agents" "$FRESH_STATE/workspace" "$FRESH_STATE/logs" "$FRESH_STATE/memory" "$FRESH_STATE/cache"
+export OPENCLAW_STATE_DIR="$FRESH_STATE"
+export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
+echo "[state-isolate] OPENCLAW_STATE_DIR=$OPENCLAW_STATE_DIR"
+
+python - <<'PY'
+import json
+import os
+from pathlib import Path
+
+cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
+data = json.loads(cfg_path.read_text(encoding="utf-8")) if cfg_path.exists() else {}
+
+def set_nested(root, dotted, value):
+    cursor = root
+    parts = dotted.split(".")
+    for part in parts[:-1]:
+        child = cursor.get(part)
+        if not isinstance(child, dict):
+            child = {}
+            cursor[part] = child
+        cursor = child
+    cursor[parts[-1]] = value
+
+exec_host = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
+if exec_host not in {"auto", "gateway", "sandbox", "node"}:
+    raise SystemExit(f"invalid OPENCLAW_EXEC_HOST={exec_host!r}")
+
+set_nested(data, "tools.exec.host", exec_host)
+set_nested(data, "tools.exec.security", "full")
+set_nested(data, "tools.exec.ask", "off")
+set_nested(data, "approvals.exec.enabled", False)
+cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
+
+approvals_path = cfg_path.with_name("exec-approvals.json")
+approvals = {
+    "version": 1,
+    "socket": {
+        "path": str(approvals_path.with_suffix(".sock")),
+        "token": "container-cherry-eval-token",
+    },
+    "defaults": {"security": "full", "ask": "off", "askFallback": "full"},
+    "agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
+}
+approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
+PY
+
+# Map model to cache subdir (for archiving)
+case "$SWEEP_MODEL" in
+  anthropic/claude-opus-4-7)        CACHE_SUB="anthropic_claude-opus-4-7" ;;
+  anthropic/claude-opus-4-6)        CACHE_SUB="anthropic_claude-opus-4-6" ;;
+  anthropic/claude-sonnet-4-6)      CACHE_SUB="anthropic_claude-sonnet-4-6" ;;
+  openai/gpt-5.5)                   CACHE_SUB="openai_gpt-5.5" ;;
+  openai/gpt-5.4)                   CACHE_SUB="openai_gpt-5.4" ;;
+  google/gemini-3.1-pro-preview)    CACHE_SUB="google_gemini-3.1-pro-preview" ;;
+  openrouter/z-ai/glm-5.1)          CACHE_SUB="openrouter_z-ai_glm-5.1" ;;
+  openrouter/qwen/qwen3.6-plus)     CACHE_SUB="openrouter_qwen_qwen3.6-plus" ;;
+  openrouter/minimax/minimax-m2.7)  CACHE_SUB="openrouter_minimax_minimax-m2.7" ;;
+  openrouter/moonshotai/kimi-k2.6)  CACHE_SUB="openrouter_moonshotai_kimi-k2.6" ;;
+  openrouter/moonshotai/kimi-k2.5)  CACHE_SUB="openrouter_moonshotai_kimi-k2.5" ;;
+  openrouter/deepseek/deepseek-v4-pro) CACHE_SUB="openrouter_deepseek_deepseek-v4-pro" ;;
+  deepseek/deepseek-v4-pro)         CACHE_SUB="deepseek_deepseek-v4-pro" ;;
+  deepseek/v4-pro)                  CACHE_SUB="deepseek_v4-pro" ;;
+  *) CACHE_SUB="" ;;
+esac
+
+OUT="$LOGDIR/docker_${SWEEP_LABEL}_${SWEEP_OUT_TAG}.json"
+LOG="$LOGDIR/docker_${SWEEP_LABEL}_${SWEEP_OUT_TAG}.log"
+GWLOG="$LOGDIR/gateway_${SWEEP_LABEL}.log"
+
+echo "===== CHERRY-PICK SWEEP $(date '+%Y-%m-%d %H:%M:%S') ====="
+echo "label:   $SWEEP_LABEL"
+echo "model:   $SWEEP_MODEL"
+echo "tasks:   $CHERRY_TASKS"
+echo "out:     $OUT"
+
+# Force-clear this model's run_cache (including fixed-task slots — so they
+# actually re-run against the new image instead of hitting old cache).
+if [ -n "$CACHE_SUB" ] && [ -d "$CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB" ]; then
+  echo "clearing cache: $CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB"
+  rm -rf "$CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB"
+fi
+[ -f "$OUT" ] && rm -f "$OUT"
+
+# Start gateway with bumped heap
+echo "Starting gateway on :18789 (heap=4GB) ..."
+openclaw gateway --port 18789 > "$GWLOG" 2>&1 &
+GATEWAY_PID=$!
+ready=0
+for i in $(seq 1 120); do
+  if curl -sf -H "Authorization: Bearer $OPENCLAW_GATEWAY_TOKEN" http://127.0.0.1:18789/ready > /dev/null 2>&1; then
+    echo "Gateway ready after ${i}s"
+    ready=1
+    break
+  fi
+  sleep 1
+done
+if [ $ready -ne 1 ]; then
+  echo "ERROR: gateway failed to become ready within 120s"
+  tail -30 "$GWLOG"
+  exit 1
+fi
+
+# Build -t args from comma-separated list
+TASK_ARGS=()
+IFS=',' read -ra TASK_ARR <<< "$CHERRY_TASKS"
+for t in "${TASK_ARR[@]}"; do
+  TASK_ARGS+=("-t" "$t")
+done
+
+echo "===== $(date '+%H:%M:%S') running clawbench with tasks: ${TASK_ARR[*]} ====="
+# NOTE: --profile intentionally OMITTED. The legacy frontier_*.yaml profile
+# format is incompatible with OpenClaw 4.22+ (loads n_tools_total=0,
+# starves the agent of tools, all runs fail with environment_unavailable
+# or timeout). Running with the default openclaw tool stack — same for
+# all models, so the comparison stays apples-to-apples.
+PROFILE_ARG=""
+if [ -n "${USE_PROFILE:-}" ] && [ -f "$SWEEP_PROFILE" ]; then
+  PROFILE_ARG="--profile $SWEEP_PROFILE"
+fi
+clawbench run \
+  --model "$SWEEP_MODEL" \
+  --runs 3 \
+  --concurrency "${CLAWBENCH_CONCURRENCY:-1}" \
+  $PROFILE_ARG \
+  --judge-model "anthropic/claude-sonnet-4-6" \
+  "${TASK_ARGS[@]}" \
+  -o "$OUT" \
+  > "$LOG" 2>&1
+status=$?
+
+if [ $status -eq 0 ]; then
+  echo "===== $(date '+%H:%M:%S') done $SWEEP_LABEL (exit 0) ====="
+else
+  echo "===== $(date '+%H:%M:%S') FAILED $SWEEP_LABEL (exit $status) ====="
+  tail -20 "$LOG"
+fi
+
+# Archive cache to v2026-4-20-cherry tag
+# shellcheck disable=SC1091
+source "$(dirname "$0")/_archive_cache.sh" 2>/dev/null && archive_run_cache || echo "[archive] helper missing"
+
+kill $GATEWAY_PID 2>/dev/null
+wait $GATEWAY_PID 2>/dev/null
+
+# Clean up isolated state dir
+[ -n "${FRESH_STATE:-}" ] && [ -d "$FRESH_STATE" ] && rm -rf "$FRESH_STATE"
+
+exit $status
--- a/scripts/container_lane_eval.sh
+++ b/scripts/container_lane_eval.sh
@ -0,0 +1,231 @@
+#!/bin/bash
+# Run one OpenClaw model/profile through the HF-style isolated lane worker.
+set -Eeuo pipefail
+
+: "${SWEEP_MODEL:?SWEEP_MODEL required}"
+: "${SWEEP_LABEL:?SWEEP_LABEL required}"
+: "${SWEEP_OUT_TAG:=lane-container}"
+: "${SWEEP_LANES:=3}"
+: "${SWEEP_RUNS:=1}"
+: "${SWEEP_LOGDIR:=/data/results}"
+: "${CLAWBENCH_PER_RUN_BUDGET_SECONDS:=900}"
+: "${CLAWBENCH_PER_TURN_TIMEOUT_SECONDS:=300}"
+: "${OPENCLAW_EXEC_HOST:=gateway}"
+
+cd /home/node/app
+export CLAWBENCH_LOCAL_QUEUE_DIR="${CLAWBENCH_LOCAL_QUEUE_DIR:-/data/queue/$SWEEP_LABEL}"
+mkdir -p "$SWEEP_LOGDIR" /data/results "$CLAWBENCH_LOCAL_QUEUE_DIR" /data/run_cache /data/lane_runtime
+
+export HF_TOKEN=""
+export OPENCLAW_GATEWAY_TOKEN="${OPENCLAW_GATEWAY_TOKEN:-local-dev-token-for-testing}"
+export OPENCLAW_SKIP_GMAIL_WATCHER=1
+export OPENCLAW_SKIP_CANVAS_HOST=1
+export OPENCLAW_NO_RESPAWN=1
+export CLAWBENCH_DISABLE_GATEWAY_DEVICE_IDENTITY=1
+export CLAWBENCH_PER_RUN_BUDGET_SECONDS
+export CLAWBENCH_PER_TURN_TIMEOUT_SECONDS
+export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-180}"
+export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
+export CLAWBENCH_GATEWAY_HEALTH_TIMEOUT_SECONDS="${CLAWBENCH_GATEWAY_HEALTH_TIMEOUT_SECONDS:-240}"
+export CLAWBENCH_LANE_STARTUP_STAGGER_SECONDS="${CLAWBENCH_LANE_STARTUP_STAGGER_SECONDS:-90}"
+export CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS="${CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS:-90}"
+export CLAWBENCH_KEEP_PARALLEL_LANE_ROOT="${CLAWBENCH_KEEP_PARALLEL_LANE_ROOT:-0}"
+export CLAWBENCH_PARALLEL_LANE_ROOT="/data/lane_runtime/$SWEEP_LABEL"
+export CLAWBENCH_TOOL_PROFILE_NAME="${CLAWBENCH_TOOL_PROFILE_NAME:-$SWEEP_LABEL}"
+export NODE_OPTIONS="${NODE_OPTIONS:-"--max-old-space-size=4096"}"
+if command -v npm >/dev/null 2>&1; then
+  export NODE_PATH="${NODE_PATH:-$(npm root -g 2>/dev/null || true)}"
+fi
+
+SRC_STATE="${OPENCLAW_CONFIG_SOURCE:-/config/openclaw}"
+if [ ! -d "$SRC_STATE" ]; then
+  SRC_STATE="/home/node/.openclaw"
+fi
+
+safe_model="${SWEEP_MODEL//\//_}"
+safe_model="${safe_model//:/_}"
+OUT="$SWEEP_LOGDIR/${SWEEP_LABEL}_openclaw_${safe_model}_${SWEEP_OUT_TAG}.json"
+LOG="$SWEEP_LOGDIR/${SWEEP_LABEL}_openclaw_${safe_model}_${SWEEP_OUT_TAG}.log"
+export SWEEP_OUTPUT_PATH="$OUT"
+
+FRESH_HOME="/tmp/openclaw-home-${SWEEP_LABEL}-$$"
+FRESH_STATE="$FRESH_HOME/.openclaw"
+rm -rf "$FRESH_HOME" "$CLAWBENCH_PARALLEL_LANE_ROOT"
+mkdir -p "$FRESH_STATE" "$FRESH_HOME/.config"
+if [ -f "$SRC_STATE/openclaw.json" ]; then
+  cp "$SRC_STATE/openclaw.json" "$FRESH_STATE/openclaw.json"
+fi
+if [ -d "$SRC_STATE/plugins" ]; then
+  mkdir -p "$FRESH_STATE/plugins"
+  cp -R "$SRC_STATE/plugins/." "$FRESH_STATE/plugins/" 2>/dev/null || true
+fi
+mkdir -p \
+  "$FRESH_STATE/agents" \
+  "$FRESH_STATE/workspace" \
+  "$FRESH_STATE/logs" \
+  "$FRESH_STATE/memory" \
+  "$FRESH_STATE/cache" \
+  "$FRESH_STATE/identity" \
+  "$FRESH_STATE/devices" \
+  "$FRESH_STATE/tasks" \
+  "$FRESH_STATE/subagents" \
+  "$FRESH_STATE/flows" \
+  "$FRESH_STATE/cron"
+
+export HOME="$FRESH_HOME"
+export OPENCLAW_HOME="$FRESH_HOME"
+export OPENCLAW_STATE_DIR="$FRESH_STATE"
+export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
+export XDG_CONFIG_HOME="$FRESH_HOME/.config"
+
+python - <<'PY'
+import json
+import os
+from pathlib import Path
+
+cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
+if not cfg_path.exists():
+    raise SystemExit("missing openclaw.json")
+data = json.loads(cfg_path.read_text(encoding="utf-8"))
+
+def set_nested(root, dotted, value):
+    cursor = root
+    parts = dotted.split(".")
+    for part in parts[:-1]:
+        child = cursor.get(part)
+        if not isinstance(child, dict):
+            child = {}
+            cursor[part] = child
+        cursor = child
+    cursor[parts[-1]] = value
+
+agents = data.setdefault("agents", {})
+if isinstance(agents, dict):
+    agents["list"] = []
+
+channels = data.get("channels")
+if isinstance(channels, dict):
+    for channel in channels.values():
+        if isinstance(channel, dict):
+            channel["enabled"] = False
+            exec_approvals = channel.get("execApprovals")
+            if not isinstance(exec_approvals, dict):
+                exec_approvals = {}
+                channel["execApprovals"] = exec_approvals
+            exec_approvals["enabled"] = False
+
+plugins = data.setdefault("plugins", {})
+stale = {"marxbiotech-git-tools", "lab"}
+allow = plugins.get("allow")
+if isinstance(allow, list):
+    plugins["allow"] = [item for item in allow if item not in stale]
+entries = plugins.get("entries")
+if isinstance(entries, dict):
+    for item in stale:
+        entries.pop(item, None)
+
+set_nested(data, "browser.headless", True)
+set_nested(data, "browser.noSandbox", True)
+set_nested(data, "gateway.reload.mode", "off")
+set_nested(data, "agents.defaults.skipBootstrap", True)
+set_nested(data, "agents.defaults.sandbox.mode", "off")
+set_nested(data, "agents.defaults.model.primary", os.environ["SWEEP_MODEL"])
+set_nested(data, "agents.defaults.subagents.model.primary", os.environ["SWEEP_MODEL"])
+set_nested(
+    data,
+    "agents.defaults.systemPromptOverride",
+    "You are running an OpenClaw benchmark task. Complete the user's request in the current "
+    "workspace using the available tools when needed. For file, code, browser, shell, or memory "
+    "tasks, make the requested changes directly and verify them when practical. Do not ask "
+    "follow-up questions during the benchmark. Keep any final reply brief.",
+)
+set_nested(data, "tools.exec.host", os.environ.get("OPENCLAW_EXEC_HOST", "gateway"))
+set_nested(data, "tools.exec.security", "full")
+set_nested(data, "tools.exec.ask", "off")
+set_nested(data, "approvals.exec.enabled", False)
+
+models = data.setdefault("agents", {}).setdefault("defaults", {}).setdefault("models", {})
+model_entry = models.setdefault(os.environ["SWEEP_MODEL"], {})
+params = model_entry.setdefault("params", {})
+params["fastMode"] = True
+if os.environ["SWEEP_MODEL"].startswith("openai/"):
+    params["transport"] = "sse"
+    params["openaiWsWarmup"] = False
+
+cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
+
+approvals_path = cfg_path.with_name("exec-approvals.json")
+approvals = {
+    "version": 1,
+    "socket": {
+        "path": str(approvals_path.with_suffix(".sock")),
+        "token": "container-lane-eval-token",
+    },
+    "defaults": {"security": "full", "ask": "off", "askFallback": "full"},
+    "agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
+}
+approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
+PY
+
+echo "===== CONTAINER LANE EVAL START $(date '+%Y-%m-%d %H:%M:%S') ====="
+echo "label:    $SWEEP_LABEL"
+echo "model:    $SWEEP_MODEL"
+echo "runs:     $SWEEP_RUNS"
+echo "lanes:    $SWEEP_LANES"
+echo "tasks:    ${SWEEP_TASKS:-${CHERRY_TASKS:-all}}"
+echo "out:      $OUT"
+echo "log:      $LOG"
+echo "home:     $HOME"
+echo "state:    $OPENCLAW_STATE_DIR"
+openclaw --version 2>/dev/null || true
+
+set +e
+python - <<'PY' > "$LOG" 2>&1
+import asyncio
+import json
+import logging
+import os
+import shutil
+from pathlib import Path
+
+from clawbench.queue import JobQueue, JobStatus, SubmissionRequest
+from clawbench.worker import EvalWorker, RESULTS_DIR
+
+logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
+
+async def main() -> int:
+    queue = JobQueue()
+    queue._jobs.clear()
+    queue._save_local()
+    task_ids_raw = os.environ.get("SWEEP_TASKS") or os.environ.get("CHERRY_TASKS") or ""
+    task_ids = [item.strip() for item in task_ids_raw.split(",") if item.strip()]
+    request = SubmissionRequest(
+        model=os.environ["SWEEP_MODEL"],
+        runs_per_task=int(os.environ["SWEEP_RUNS"]),
+        max_parallel_lanes=int(os.environ["SWEEP_LANES"]),
+        task_ids=task_ids,
+        prompt_variant=os.environ.get("SWEEP_PROMPT_VARIANT", "clear"),
+        judge_model=os.environ.get("CLAWBENCH_JUDGE_MODEL", ""),
+        notes=os.environ.get("SWEEP_LABEL", ""),
+    )
+    job = await queue.submit(request)
+    worker = EvalWorker(queue)
+    await worker._process_job(job)
+    final = await queue.get_status(job.job_id)
+    print(json.dumps(final.model_dump() if final else {}, indent=2), flush=True)
+    if final is None or final.status != JobStatus.FINISHED or not final.result_id:
+        return 1
+    result_path = RESULTS_DIR / f"{final.result_id}.json"
+    output_path = Path(os.environ["SWEEP_OUTPUT_PATH"])
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    shutil.copy2(result_path, output_path)
+    return 0
+
+raise SystemExit(asyncio.run(main()))
+PY
+status=$?
+set -e
+
+echo "===== lane eval exit=$status $(date '+%Y-%m-%d %H:%M:%S') ====="
+tail -120 "$LOG" 2>/dev/null || true
+exit "$status"
--- a/scripts/container_sweep_single.sh
+++ b/scripts/container_sweep_single.sh
@ -43,6 +43,13 @@ mkdir -p "$CLAWBENCH_RUN_CACHE_DIR"
 # OOM fix: give the gateway Node process a 4GB old-space ceiling instead of the default ~2GB.
 # Scoped via env so we don't stomp on other Node processes (clawbench itself is python).
 export NODE_OPTIONS="--max-old-space-size=4096"
+# OpenClaw 4.22+ has slower agents.create / sessions.create on cold start
+# (we observed 72s for opus-4-7). Bump RPC timeouts so the harness doesn't
+# cancel mid-flight. Override defaults of 30s / 60s respectively.
+export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-120}"
+export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
+export CLAWBENCH_PER_RUN_BUDGET_SECONDS="${CLAWBENCH_PER_RUN_BUDGET_SECONDS:-900}"
+export HERMES_STEP_TIMEOUT_SECONDS="${HERMES_STEP_TIMEOUT_SECONDS:-180}"

 # State-dir isolation: the shared /home/node/.openclaw mount accumulates cruft
 # across sweeps (agents/, workspace/, logs/, memory/, stale openclaw.json.*.tmp)
@ -73,23 +80,68 @@ done
 # Ensure runtime dirs exist but are empty
 mkdir -p "$FRESH_STATE/agents" "$FRESH_STATE/workspace" "$FRESH_STATE/logs" "$FRESH_STATE/memory" "$FRESH_STATE/cache"
 export OPENCLAW_STATE_DIR="$FRESH_STATE"
+export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
 echo "[state-isolate] OPENCLAW_STATE_DIR=$OPENCLAW_STATE_DIR"
 du -sh "$FRESH_STATE" 2>/dev/null | sed 's/^/[state-isolate] size: /'

+python - <<'PY'
+import json
+import os
+from pathlib import Path
+
+cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
+data = json.loads(cfg_path.read_text(encoding="utf-8")) if cfg_path.exists() else {}
+
+def set_nested(root, dotted, value):
+    cursor = root
+    parts = dotted.split(".")
+    for part in parts[:-1]:
+        child = cursor.get(part)
+        if not isinstance(child, dict):
+            child = {}
+            cursor[part] = child
+        cursor = child
+    cursor[parts[-1]] = value
+
+exec_host = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
+if exec_host not in {"auto", "gateway", "sandbox", "node"}:
+    raise SystemExit(f"invalid OPENCLAW_EXEC_HOST={exec_host!r}")
+
+set_nested(data, "tools.exec.host", exec_host)
+set_nested(data, "tools.exec.security", "full")
+set_nested(data, "tools.exec.ask", "off")
+set_nested(data, "approvals.exec.enabled", False)
+cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
+
+approvals_path = cfg_path.with_name("exec-approvals.json")
+approvals = {
+    "version": 1,
+    "socket": {
+        "path": str(approvals_path.with_suffix(".sock")),
+        "token": "container-single-eval-token",
+    },
+    "defaults": {"security": "full", "ask": "off", "askFallback": "full"},
+    "agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
+}
+approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
+PY
+
 # Map label -> cache subdir (matches what clawbench writes)
 case "$SWEEP_MODEL" in
  anthropic/claude-opus-4-7)        CACHE_SUB="anthropic_claude-opus-4-7" ;;
  anthropic/claude-sonnet-4-7)      CACHE_SUB="anthropic_claude-sonnet-4-7" ;;
  anthropic/claude-opus-4-6)        CACHE_SUB="anthropic_claude-opus-4-6" ;;
  anthropic/claude-sonnet-4-6)      CACHE_SUB="anthropic_claude-sonnet-4-6" ;;
+  openai/gpt-5.5)                   CACHE_SUB="openai_gpt-5.5" ;;
  openai/gpt-5.4)                   CACHE_SUB="openai_gpt-5.4" ;;
  openai/gpt-5.2)                   CACHE_SUB="openai_gpt-5.2" ;;
  google/gemini-3.1-pro-preview)    CACHE_SUB="google_gemini-3.1-pro-preview" ;;
  openrouter/z-ai/glm-5.1)          CACHE_SUB="openrouter_z-ai_glm-5.1" ;;
  openrouter/qwen/qwen3.6-plus)     CACHE_SUB="openrouter_qwen_qwen3.6-plus" ;;
  openrouter/minimax/minimax-m2.7)  CACHE_SUB="openrouter_minimax_minimax-m2.7" ;;
+  openrouter/moonshotai/kimi-k2.6)  CACHE_SUB="openrouter_moonshotai_kimi-k2.6" ;;
  openrouter/moonshotai/kimi-k2.5)  CACHE_SUB="openrouter_moonshotai_kimi-k2.5" ;;
-  # kimi-k2.6 is not yet supported in the openclaw version under test — skip.
+  deepseek/v4-pro)                  CACHE_SUB="deepseek_v4-pro" ;;
  *) CACHE_SUB="" ;;
 esac

@ -139,11 +191,19 @@ if [ $ready -ne 1 ]; then
 fi

 echo "===== $(date '+%H:%M:%S') starting $SWEEP_LABEL ($SWEEP_MODEL) ====="
+# NOTE: --profile intentionally OMITTED unless USE_PROFILE=1 is set. The
+# legacy frontier_*.yaml profile format is incompatible with OpenClaw
+# 4.22+ (loads n_tools_total=0). Running with the default openclaw tool
+# stack — identical across all models, so comparisons stay valid.
+PROFILE_ARG=""
+if [ -n "${USE_PROFILE:-}" ] && [ -f "$SWEEP_PROFILE" ]; then
+  PROFILE_ARG="--profile $SWEEP_PROFILE"
+fi
 clawbench run \
  --model "$SWEEP_MODEL" \
  --runs 3 \
-  --concurrency 4 \
-  --profile "$SWEEP_PROFILE" \
+  --concurrency "${CLAWBENCH_CONCURRENCY:-1}" \
+  $PROFILE_ARG \
  --judge-model "anthropic/claude-sonnet-4-6" \
  -o "$OUT" \
  > "$LOG" 2>&1
--- a/scripts/generate_dynamical_report.py
+++ b/scripts/generate_dynamical_report.py
@ -1,221 +1,144 @@
-"""Assemble a combined dynamical-systems report integrating:
-  - Constraint Index C(q) per task
-  - Regime classification per run
-  - Seed vs capability variance
-  - Survival / hazard analysis
+#!/usr/bin/env python3
+"""Assemble a combined posterior dynamical-systems markdown report.

-Requires: reports/constraint_index.json, reports/regimes.json,
-          reports/variance_decomposition.json, reports/survival_analysis.json
+Inputs:
+    - constraint_index.json
+    - regimes.json
+    - variance_decomposition.json
+    - survival_analysis.json
+    - snr_weighted_ranking.json (optional)

-Output: reports/EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md
+Output:
+    - EVAL_REPORT_DYNAMICAL.md
+
+The goal is to keep a compact human-readable summary next to the machine
+outputs produced by the posterior analysis pipeline.
 """

 from __future__ import annotations

+import argparse
 import json
 from collections import Counter, defaultdict
 from pathlib import Path
-from statistics import mean

-ROOT = Path(__file__).resolve().parent.parent
-REPORTS = ROOT / "reports"

-MODEL_MAP = {
-    "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
-    "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
-    "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
-    "gpt54": ("openai_gpt-5.4", "GPT 5.4"),
-    "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
-    "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
-    "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
-    "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
-    "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
-}
+def _read_json(path: Path):
+    if not path.exists():
+        raise SystemExit(f"Missing required report file: {path}")
+    return json.loads(path.read_text(encoding="utf-8"))


 def main() -> None:
-    cq = json.loads((REPORTS / "constraint_index.json").read_text())
-    regimes = json.loads((REPORTS / "regimes.json").read_text())
-    variance = json.loads((REPORTS / "variance_decomposition.json").read_text())
-    survival = json.loads((REPORTS / "survival_analysis.json").read_text())
-
-    lines = []
-    L = lines.append
-    L("# ClawBench — Dynamical Systems Analysis (v2026-4-19-full)")
-    L("")
-    L("Inspired by *\"When LLMs Are Dreaming, Where Do They Go?\"* — treats")
-    L("agent runs as dynamical systems and extracts signal ClawBench's flat")
-    L("run_score can't: task constraint level, per-run regime, noise vs")
-    L("signal ratio, and per-turn survival curves.")
-    L("")
-
-    # ----------------- 1. Constraint Index summary -----------------
-    L("## 1. Constraint Index C(q) per task")
-    L("")
-    L("C(q) = −z(PR) − z(entropy) + z(BOPS). High C(q) = task is constrained")
-    L("(responses converge); low C(q) = open-ended (responses diverge).")
-    L("")
-    high = sorted([(t, v) for t, v in cq.items() if v["C_q"] > 0.5],
-                  key=lambda kv: -kv[1]["C_q"])
-    low = sorted([(t, v) for t, v in cq.items() if v["C_q"] < -0.5],
-                 key=lambda kv: kv[1]["C_q"])
-    mid = [t for t, v in cq.items() if -0.5 <= v["C_q"] <= 0.5]
-    L(f"- **High-constraint ({len(high)} tasks, C>+0.5):** {', '.join(t for t, _ in high[:5])}, …")
-    L(f"- **Low-constraint ({len(low)} tasks, C<−0.5):** {', '.join(t for t, _ in low[:5])}, …")
-    L(f"- **Middle ({len(mid)} tasks):** {', '.join(mid[:5])}, …")
-    L("")
-    L("Top 5 most-constrained and most-divergent tasks:")
-    L("")
-    L("| Constraint | Task | PR | Entropy | BOPS | C(q) |")
-    L("|---|---|:---:|:---:|:---:|:---:|")
-    for t, v in high[:5]:
-        L(f"| HIGH | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |")
-    for t, v in low[:5]:
-        L(f"| LOW | `{t}` | {v['PR']:.2f} | {v['entropy']:.2f} | {v['BOPS']:.2f} | **{v['C_q']:+.2f}** |")
-    L("")
-
-    # ----------------- 2. Regime distribution -----------------
-    L("## 2. Dynamical regime per run")
-    L("")
-    L("Each run's turn-by-turn trajectory classified by drift, recurrence,")
-    L("and support volume thresholds (quartile-based).")
-    L("")
-    pm = defaultdict(Counter)
-    for key, v in regimes.items():
-        model_sub = key.split("/")[0]
-        # Reverse-map to label
-        label = next((l for l, (s, _) in MODEL_MAP.items() if s == model_sub), None)
-        if label:
-            pm[label][v["regime"]] += 1
-    L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |")
-    L("|---|:---:|:---:|:---:|:---:|:---:|")
-    for label, (_sub, pretty) in MODEL_MAP.items():
-        c = pm[label]
-        L(f"| {pretty} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | "
-          f"{c['diffusive']} | {c['mixed']} |")
-    L("")
-    L("**Interpretation:**")
-    L("- `trapped` = low drift + small support: agent converges to a point.")
-    L("  Often good on constrained tasks, sometimes 'stuck'.")
-    L("- `limit_cycle` = repeats similar states non-consecutively: tool-use loop.")
-    L("- `diffusive` = keeps exploring without converging. Goal drift risk.")
-    L("- `mixed` = no strong signature.")
-    L("")
-    L("Notable findings:")
-    L("")
-    # Find outliers
-    trap_counts = [(label, pm[label]["trapped"]) for label in MODEL_MAP]
-    cycle_counts = [(label, pm[label]["limit_cycle"]) for label in MODEL_MAP]
-    trap_counts.sort(key=lambda x: -x[1])
-    cycle_counts.sort(key=lambda x: -x[1])
-    L(f"- Most `trapped` runs: **{MODEL_MAP[trap_counts[0][0]][1]}** ({trap_counts[0][1]} runs) —")
-    L(f"  converges aggressively; often one-shot answer without iteration.")
-    L(f"- Most `limit_cycle` runs: **{MODEL_MAP[cycle_counts[0][0]][1]}** ({cycle_counts[0][1]} runs) —")
-    L(f"  repeats tool patterns between turns; check for productive vs stuck loops.")
-    L("")
-
-    # ----------------- 3. Variance decomposition -----------------
-    L("## 3. Seed-noise vs capability-signal")
-    L("")
-    agg = variance["aggregate"]
-    L(f"- **Seed-noise variance** (same model, 3 runs): **{agg['mean_seed_var']:.4f}**")
-    L(f"- **Capability variance** (across models): **{agg['mean_cap_var']:.4f}**")
-    L(f"- **Capability fraction: {agg['capability_fraction']:.1%}**")
-    L(f"  (= fraction of benchmark variance that reflects real model differences)")
-    L("")
-    L("**The other ~47% is seed noise.** Any ranking gap < √(2·seed_var) ≈")
-    L(f"0.20 between two models is within noise. Top-5 models' gap is 0.02 →")
-    L("**statistically indistinguishable.**")
-    L("")
-    L("### SNR tiers across 40 tasks")
-    L("")
-    per_task = variance["per_task"]
-    hi = [r for r in per_task if r["snr"] >= 5]
-    mid = [r for r in per_task if 1 <= r["snr"] < 5]
-    lo = [r for r in per_task if r["snr"] < 1]
-    L(f"- **High-SNR ({len(hi)} tasks, SNR ≥ 5):** reliably discriminate models")
-    for r in hi[:3]:
-        L(f"  - `{r['task']}` (SNR={r['snr']:.1f})")
-    L(f"- **Mid-SNR ({len(mid)} tasks, 1 ≤ SNR < 5):** moderate signal")
-    L(f"- **Low-SNR ({len(lo)} tasks, SNR < 1):** seed noise dominates; these")
-    L(f"  tasks give essentially random rankings")
-    for r in sorted(lo, key=lambda x: x['snr'])[:3]:
-        L(f"  - `{r['task']}` (SNR={r['snr']:.2f}) — random")
-    L("")
-
-    # ----------------- 4. Survival analysis -----------------
-    L("## 4. Per-turn survival: when do runs fail?")
-    L("")
-    L("T_F = first turn where agent emits empty response or run ends in failure.")
-    L("S(t) = fraction of runs still on-track past turn t. Low = dies early.")
-    L("")
-    L("| Model | Median fail turn | S(3) | S(5) | S(8) | S(12) | S(20) |")
-    L("|---|:---:|:---:|:---:|:---:|:---:|:---:|")
-    for label, (_sub, pretty) in MODEL_MAP.items():
-        d = survival.get(label, {})
-        surv = d.get("survival", [0]*20)
-        med = d.get("median_fail_turn", "—")
-        med_str = f"{med:.1f}" if isinstance(med, (int, float)) and med != float("inf") else str(med)
-        L(f"| {pretty} | {med_str} | {surv[2]:.2f} | {surv[4]:.2f} | "
-          f"{surv[7]:.2f} | {surv[11]:.2f} | {surv[19]:.2f} |")
-    L("")
-    # Narrative
-    surv_rank_t8 = sorted(
-        [(label, survival[label]["survival"][7])
-         for label in MODEL_MAP if label in survival],
-        key=lambda x: -x[1]
+    parser = argparse.ArgumentParser(description="Generate a combined dynamical report markdown")
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument(
+        "--output",
+        type=Path,
+        default=None,
+        help="Markdown output path; defaults to <reports-dir>/EVAL_REPORT_DYNAMICAL.md",
    )
-    best = MODEL_MAP[surv_rank_t8[0][0]][1]
-    worst = MODEL_MAP[surv_rank_t8[-1][0]][1]
-    L(f"- **{best}** survives longest — {surv_rank_t8[0][1]:.0%} of runs still")
-    L(f"  producing output at turn 8.")
-    L(f"- **{worst}** dies earliest — only {surv_rank_t8[-1][1]:.0%} make it to turn 8.")
+    args = parser.parse_args()
+
+    reports = args.reports_dir
+    output_path = args.output or (reports / "EVAL_REPORT_DYNAMICAL.md")
+    cq = _read_json(reports / "constraint_index.json")
+    regimes = _read_json(reports / "regimes.json")
+    variance = _read_json(reports / "variance_decomposition.json")
+    survival = _read_json(reports / "survival_analysis.json")
+    ranking_path = reports / "snr_weighted_ranking.json"
+    ranking = json.loads(ranking_path.read_text(encoding="utf-8")) if ranking_path.exists() else None
+
+    lines: list[str] = []
+    L = lines.append
+
+    L("# ClawBench Posterior Dynamical Report")
    L("")
-    L("This is signal invisible in flat run_score: two models can score")
-    L("similarly but have very different failure profiles. Pick accordingly")
-    L("for long-horizon deployments.")
+    L("This report combines posterior-only diagnostics from cached run artifacts.")
    L("")

-    # ----------------- 5. Integrated view -----------------
-    L("## 5. Integrated view — combining all four lenses")
+    L("## 1. Constraint Index C(q)")
    L("")
-    L("For a model to be **reliably good** at a task, we need:")
-    L("- (a) It scores well (run_score high)")
-    L("- (b) Variance across seeds is low (predictable)")
-    L("- (c) It doesn't exhibit pathological regime (trapped on wrong answer / cycling)")
-    L("- (d) It survives multi-turn without dying early")
+    values = [(task, float(data.get("C_q", 0.0))) for task, data in cq.items()]
+    values.sort(key=lambda row: row[1], reverse=True)
+    highs = [row for row in values if row[1] > 0.5]
+    lows = [row for row in values if row[1] < -0.5]
+    L(f"- High-constraint tasks (C > 0.5): {len(highs)}")
+    L(f"- Low-constraint tasks (C < -0.5): {len(lows)}")
    L("")
-    L("These lenses disagree constructively:")
+    if values:
+        L("Top tasks by C(q):")
+        L("")
+        L("| Task | C(q) |")
+        L("|---|---:|")
+        for task, c_q in values[:10]:
+            L(f"| {task} | {c_q:+.3f} |")
+        L("")
+
+    L("## 2. Regime Classification")
    L("")
-    L("- **Opus 4.6** tops flat run_score but median failure at turn 5.5 (earlier than Opus 4.7's 7).")
-    L("- **GPT 5.4** is mid-pack on flat score but has highest S(8)=0.60 — long-horizon champion.")
-    L("- **Sonnet 4.6** most `trapped` runs — it commits early and sticks. Good on")
-    L("  constrained tasks, bad on open-ended (cf. memory-recall-continuation 0.15).")
-    L("- **GLM 5.1** most balanced regime distribution; justifies broad performance.")
-    L("- **Kimi K2.5** median fail at turn 3 — it's not just low-scoring, it's")
-    L("  specifically fragile under multi-turn execution.")
+    by_model = defaultdict(Counter)
+    for key, row in regimes.items():
+        model = key.split("/")[0]
+        regime = row.get("regime", "unknown")
+        by_model[model][regime] += 1
+
+    L("| Model | too_short | trapped | limit_cycle | diffusive | mixed |")
+    L("|---|---:|---:|---:|---:|---:|")
+    for model in sorted(by_model):
+        c = by_model[model]
+        L(
+            f"| {model} | {c['too_short']} | {c['trapped']} | {c['limit_cycle']} | "
+            f"{c['diffusive']} | {c['mixed']} |"
+        )
    L("")

-    # ----------------- 6. What to do next -----------------
-    L("## 6. Implications for the benchmark")
+    L("## 3. Variance Decomposition")
+    L("")
+    agg = variance.get("aggregate", {})
+    L(f"- Mean seed variance: {agg.get('mean_seed_var', 0.0):.6f}")
+    L(f"- Mean capability variance: {agg.get('mean_cap_var', 0.0):.6f}")
+    L(f"- Capability fraction: {agg.get('capability_fraction', 0.0):.1%}")
+    L(f"- High-SNR tasks: {agg.get('high_snr_tasks', 0)}")
+    L(f"- Mid-SNR tasks: {agg.get('mid_snr_tasks', 0)}")
+    L(f"- Low-SNR tasks: {agg.get('low_snr_tasks', 0)}")
    L("")
-    L("- **47% seed noise** means any gap < 0.02 is meaningless. Treat top-5")
-    L("  as a statistical tie. Dropping the 21 low-SNR tasks would sharpen")
-    L("  remaining rankings considerably.")
-    L("- **Weight tasks by SNR × |C(q)|** instead of flat mean. High-SNR,")
-    L("  high-|C(q)| tasks give the cleanest capability signal.")
-    L("- **Report survival curves alongside run_score** to surface long-horizon")
-    L("  capability that single-number metrics hide.")
-    L("- **Flag 'trapped' runs that scored high** — the model may have")
-    L("  guessed-and-committed rather than reasoned; not same reliability.")
-    L("- **Add a Tier 6 long-horizon (100+ turn) task set** to actually")
-    L("  measure the dynamical regimes the paper proposes — current")
-    L("  trajectories are too short (median 6 assistant turns) for clean")
-    L("  Lyapunov or attractor diagnostics.")

-    out = REPORTS / "EVAL_REPORT_DYNAMICAL_v2026-4-19-full.md"
-    out.write_text("\n".join(lines) + "\n")
-    print(f"Wrote: {out}")
+    L("## 4. Survival Analysis")
+    L("")
+    L("| Model | Runs | Events | Median failure turn | S(3) | S(5) | S(8) |")
+    L("|---|---:|---:|---:|---:|---:|---:|")
+    for model in sorted(survival):
+        row = survival[model]
+        surv = row.get("survival", [0.0] * 8)
+        med = row.get("median_fail_turn", "inf")
+        if isinstance(med, float) and med == float("inf"):
+            med_display = "inf"
+        else:
+            med_display = f"{float(med):.1f}"
+        L(
+            f"| {model} | {row.get('n_runs', 0)} | {row.get('n_events', 0)} | "
+            f"{med_display} | {surv[2] if len(surv) > 2 else 0.0:.2f} | "
+            f"{surv[4] if len(surv) > 4 else 0.0:.2f} | {surv[7] if len(surv) > 7 else 0.0:.2f} |"
+        )
+    L("")
+
+    if ranking is not None:
+        L("## 5. SNR-weighted Ranking")
+        L("")
+        L("| Rank | Model | Flat | SNR x |C(q)| | Winsorized | Coverage |")
+        L("|---:|---|---:|---:|---:|---:|")
+        for idx, row in enumerate(ranking.get("results", []), start=1):
+            L(
+                f"| {idx} | {row.get('model', '')} | {row.get('flat', 0.0):.4f} | "
+                f"{row.get('snr_x_abs_cq', 0.0):.4f} | {row.get('snr_x_abs_cq_winsorized', 0.0):.4f} | "
+                f"{row.get('coverage', 0)} |"
+            )
+        L("")
+
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    output_path.write_text("\n".join(lines) + "\n", encoding="utf-8")
+    print(f"Wrote: {output_path}")


 if __name__ == "__main__":
--- a/scripts/ingest_real_run.py
+++ b/scripts/ingest_real_run.py
@ -23,7 +23,6 @@ from clawbench.profile import (
    PluginManifest,
    PluginProfile,
    PluginProfileEntry,
-    RegistrationTrace,
 )


--- a/scripts/inject_judge_rubrics.py
+++ b/scripts/inject_judge_rubrics.py
@ -12,7 +12,6 @@ being so specific that it leaks the answer to the agent's own model.

 from __future__ import annotations

-import sys
 from pathlib import Path

 import yaml
--- a/scripts/k8s/Dockerfile
+++ b/scripts/k8s/Dockerfile
@ -0,0 +1,33 @@
+# Lightweight ClawBench image for Kubernetes sidecar use.
+# Does NOT include the full OpenClaw server or Chromium — the gateway runs
+# in a separate container. Node.js is copied from the OpenClaw image for
+# the device-identity handshake required by the gateway protocol.
+FROM ghcr.io/openclaw/openclaw:latest AS openclaw
+
+FROM python:3.12-slim
+
+COPY --from=openclaw /usr/local/bin/node /usr/local/bin/node
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+
+COPY pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./
+COPY clawbench/ clawbench/
+COPY tasks-public/ tasks-public/
+COPY tasks-domain/ tasks-domain/
+COPY profiles/ profiles/
+COPY baselines/ baselines/
+COPY scripts/ scripts/
+
+RUN pip install --no-cache-dir ".[mlflow]"
+
+RUN mkdir -p /results && chmod 777 /results
+
+RUN useradd -m -d /home/node clawbench
+USER clawbench
+ENV HOME=/home/node
+
+ENTRYPOINT ["clawbench"]
--- a/scripts/k8s/deploy.sh
+++ b/scripts/k8s/deploy.sh
@ -0,0 +1,486 @@
+#!/usr/bin/env bash
+# Deploy ClawBench evals on Kubernetes (works on OpenShift too).
+#
+# 0-to-hero pipeline:
+#   Step 0: Create a cluster (see --help for Kind instructions)
+#   Step 1: Deploy OpenClaw gateway         (optional — bring your own)
+#   Step 2: Deploy MLflow tracking server   (optional — bring your own)
+#   Step 3: Run evals via sidecar           (add / remove)
+#
+# Usage:
+#   ./scripts/k8s/deploy.sh                        # Full deploy: OpenClaw + MLflow + eval
+#   ./scripts/k8s/deploy.sh --openclaw-only         # Step 1: deploy OpenClaw gateway
+#   ./scripts/k8s/deploy.sh --mlflow-only           # Step 2: deploy MLflow
+#   ./scripts/k8s/deploy.sh --add-sidecar           # Step 3: add eval sidecar (starts eval)
+#   ./scripts/k8s/deploy.sh --remove-sidecar        # Step 3: remove eval sidecar
+#   ./scripts/k8s/deploy.sh --logs                  # Tail clawbench sidecar logs
+#   ./scripts/k8s/deploy.sh --teardown              # Delete eval namespace (keeps MLflow)
+#
+# Environment (required):
+#   CLAWBENCH_NAMESPACE            Namespace for OpenClaw + eval
+#   OPENAI_API_KEY                 Model provider API key (or another provider key)
+#
+# Environment (optional):
+#   CLAWBENCH_IMAGE                Clawbench image (default: quay.io/sallyom/clawbench:latest)
+#   OPENCLAW_IMAGE                 OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
+#   OPENCLAW_GATEWAY_TOKEN         Existing gateway token (generated if unset)
+#   CLAWBENCH_MODEL                Model to eval (default: openai/gpt-5.5)
+#   MLFLOW_NAMESPACE               MLflow namespace (default: mlflow)
+#   MLFLOW_TRACKING_URI            External MLflow URI (skips MLflow deploy if set)
+#   MLFLOW_EXPERIMENT_ID           MLflow experiment ID
+#   MLFLOW_EXPERIMENT_NAME         MLflow experiment name
+#   MLFLOW_IMAGE                   MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
+#   ANTHROPIC_API_KEY              Anthropic key (added to secret if set)
+#   OPENROUTER_API_KEY             OpenRouter key (added to secret if set)
+#   GEMINI_API_KEY                 Gemini key (added to secret if set)
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+NS="${CLAWBENCH_NAMESPACE:-}"
+MLFLOW_NS="${MLFLOW_NAMESPACE:-mlflow}"
+CLAWBENCH_IMG="${CLAWBENCH_IMAGE:-quay.io/sallyom/clawbench:latest}"
+OPENCLAW_IMG="${OPENCLAW_IMAGE:-ghcr.io/openclaw/openclaw:latest}"
+MLFLOW_IMG="${MLFLOW_IMAGE:-ghcr.io/mlflow/mlflow:v2.21.3}"
+
+# ---------------------------------------------------------------------------
+if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
+  cat <<'HELP'
+ClawBench Kubernetes Deployment
+===============================
+
+0-to-hero pipeline for running ClawBench evals on Kubernetes.
+
+  Step 0: Create a cluster
+          For local testing with Kind, see:
+          https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
+
+  Step 1: Deploy OpenClaw gateway (optional — skip if you have one)
+  Step 2: Deploy MLflow tracking server (optional — skip if you have one)
+  Step 3: Run evals via sidecar (add/remove to OpenClaw deployment)
+
+Usage:
+  ./scripts/k8s/deploy.sh                    Full deploy (steps 1+2+3)
+  ./scripts/k8s/deploy.sh --openclaw-only     Step 1: OpenClaw only
+  ./scripts/k8s/deploy.sh --mlflow-only       Step 2: MLflow only
+  ./scripts/k8s/deploy.sh --add-sidecar       Step 3: add eval sidecar (starts eval)
+  ./scripts/k8s/deploy.sh --remove-sidecar    Step 3: remove eval sidecar
+  ./scripts/k8s/deploy.sh --logs              Tail clawbench sidecar logs
+  ./scripts/k8s/deploy.sh --teardown          Delete eval namespace (keeps MLflow)
+
+Required environment:
+  CLAWBENCH_NAMESPACE          Namespace for OpenClaw + eval
+  OPENAI_API_KEY               Model provider API key (or ANTHROPIC_API_KEY, etc.)
+
+Optional environment:
+  CLAWBENCH_IMAGE              Clawbench image (default: quay.io/sallyom/clawbench:latest)
+  OPENCLAW_IMAGE               OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
+  OPENCLAW_GATEWAY_TOKEN       Existing gateway token (generated if unset)
+  CLAWBENCH_MODEL              Model to eval (default: openai/gpt-5.5)
+  MLFLOW_NAMESPACE             MLflow namespace (default: mlflow)
+  MLFLOW_TRACKING_URI          External MLflow URI (skips MLflow deploy)
+  MLFLOW_EXPERIMENT_ID         MLflow experiment ID
+  MLFLOW_EXPERIMENT_NAME       MLflow experiment name
+  MLFLOW_IMAGE                 MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
+  ANTHROPIC_API_KEY            Anthropic key (added to secret if set)
+  OPENROUTER_API_KEY           OpenRouter key (added to secret if set)
+  GEMINI_API_KEY               Gemini key (added to secret if set)
+
+Works on Kubernetes and OpenShift.
+HELP
+  exit 0
+fi
+
+command -v kubectl &>/dev/null || { echo "Missing: kubectl" >&2; exit 1; }
+
+if [[ -z "$NS" ]]; then
+  echo "CLAWBENCH_NAMESPACE is required." >&2
+  echo "  export CLAWBENCH_NAMESPACE=clawbench-eval" >&2
+  exit 1
+fi
+
+MODE="full"
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --openclaw-only)   MODE="openclaw-only" ;;
+    --mlflow-only)     MODE="mlflow-only" ;;
+    --add-sidecar)     MODE="add-sidecar" ;;
+    --remove-sidecar)  MODE="remove-sidecar" ;;
+    --logs)            MODE="logs" ;;
+    --teardown)        MODE="teardown" ;;
+    *) echo "Unknown option: $1" >&2; exit 1 ;;
+  esac
+  shift
+done
+
+kubectl cluster-info &>/dev/null || { echo "Cannot connect to cluster. Check kubeconfig." >&2; exit 1; }
+
+# ---------------------------------------------------------------------------
+# --logs
+# ---------------------------------------------------------------------------
+if [[ "$MODE" == "logs" ]]; then
+  kubectl logs deploy/openclaw -c clawbench -n "$NS" -f
+  exit 0
+fi
+
+# ---------------------------------------------------------------------------
+# --teardown
+# ---------------------------------------------------------------------------
+if [[ "$MODE" == "teardown" ]]; then
+  echo "Deleting namespace '$NS'..."
+  kubectl delete namespace "$NS" --ignore-not-found
+  echo "Done. MLflow namespace '$MLFLOW_NS' was not deleted."
+  exit 0
+fi
+
+# ---------------------------------------------------------------------------
+# --remove-sidecar
+# ---------------------------------------------------------------------------
+if [[ "$MODE" == "remove-sidecar" ]]; then
+  echo "Removing clawbench sidecar from openclaw in namespace '$NS'..."
+  INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
+    | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next((i for i,c in enumerate(cs) if c['name']=='clawbench'),-1))")
+  if [[ "$INDEX" == "-1" ]]; then
+    echo "No clawbench sidecar found."
+  else
+    kubectl patch deploy/openclaw -n "$NS" --type=json \
+      -p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]"
+    echo "Sidecar removed."
+  fi
+  exit 0
+fi
+
+# ---------------------------------------------------------------------------
+# Create namespace + secret
+# ---------------------------------------------------------------------------
+ensure_namespace_and_secret() {
+  if ! kubectl get namespace "$NS" &>/dev/null; then
+    echo "Creating namespace '$NS'..."
+    kubectl create namespace "$NS"
+  fi
+
+  if ! kubectl get secret clawbench-secrets -n "$NS" &>/dev/null; then
+    echo "Creating clawbench-secrets..."
+    if [[ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ]]; then
+      GATEWAY_TOKEN="$OPENCLAW_GATEWAY_TOKEN"
+      GATEWAY_TOKEN_SOURCE="from OPENCLAW_GATEWAY_TOKEN"
+    else
+      GATEWAY_TOKEN=$(python3 -c "import secrets,base64; print(base64.b64encode(secrets.token_bytes(32)).decode())")
+      GATEWAY_TOKEN_SOURCE="generated"
+    fi
+
+    SECRET_ARGS=(
+      --from-literal=OPENCLAW_GATEWAY_TOKEN="$GATEWAY_TOKEN"
+    )
+    [[ -n "${OPENAI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENAI_API_KEY="$OPENAI_API_KEY")
+    [[ -n "${ANTHROPIC_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY")
+    [[ -n "${OPENROUTER_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENROUTER_API_KEY="$OPENROUTER_API_KEY")
+    [[ -n "${GEMINI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=GEMINI_API_KEY="$GEMINI_API_KEY")
+
+    if [[ ${#SECRET_ARGS[@]} -eq 1 ]]; then
+      echo "Warning: No API keys provided. Set OPENAI_API_KEY or another provider key." >&2
+    fi
+
+    kubectl create secret generic clawbench-secrets -n "$NS" "${SECRET_ARGS[@]}"
+    echo "  Gateway token: $GATEWAY_TOKEN_SOURCE"
+    [[ -n "${OPENAI_API_KEY:-}" ]] && echo "  OPENAI_API_KEY: set"
+    [[ -n "${ANTHROPIC_API_KEY:-}" ]] && echo "  ANTHROPIC_API_KEY: set"
+    [[ -n "${OPENROUTER_API_KEY:-}" ]] && echo "  OPENROUTER_API_KEY: set"
+    [[ -n "${GEMINI_API_KEY:-}" ]] && echo "  GEMINI_API_KEY: set"
+  else
+    echo "Secret clawbench-secrets already exists in '$NS'."
+  fi
+  return 0
+}
+
+# ---------------------------------------------------------------------------
+# Step 1: Deploy OpenClaw
+# ---------------------------------------------------------------------------
+deploy_openclaw() {
+  echo ""
+  echo "Step 1: Deploying OpenClaw gateway (image: $OPENCLAW_IMG)..."
+
+  kubectl apply -f "$SCRIPT_DIR/openclaw/configmap.yaml" -n "$NS"
+
+  # Patch gateway config with custom OpenAI-compatible base URL
+  if [[ -n "${OPENAI_API_BASE:-}" ]]; then
+    echo "  Patching gateway config: models.providers.openai.baseUrl = $OPENAI_API_BASE"
+    EXISTING_JSON=$(kubectl get configmap openclaw-config -n "$NS" -o jsonpath='{.data.openclaw\.json}')
+    PATCHED_JSON=$(echo "$EXISTING_JSON" | python3 -c "
+import json, sys, os
+cfg = json.load(sys.stdin)
+openai_cfg = cfg.setdefault('models', {}).setdefault('providers', {}).setdefault('openai', {})
+openai_cfg['baseUrl'] = os.environ['OPENAI_API_BASE']
+openai_cfg.setdefault('models', [])
+json.dump(cfg, sys.stdout, indent=2)
+")
+    kubectl create configmap openclaw-config -n "$NS" \
+      --from-literal="openclaw.json=$PATCHED_JSON" \
+      --dry-run=client -o yaml | kubectl apply -f - -n "$NS" >/dev/null
+  fi
+
+  kubectl apply -f "$SCRIPT_DIR/openclaw/pvc.yaml" -n "$NS"
+  kubectl apply -f "$SCRIPT_DIR/openclaw/service.yaml" -n "$NS"
+
+  if [[ "$OPENCLAW_IMG" != "ghcr.io/openclaw/openclaw:latest" ]]; then
+    kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
+    kubectl set image "deploy/openclaw" "gateway=$OPENCLAW_IMG" -n "$NS"
+  else
+    kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
+  fi
+
+  echo "Waiting for OpenClaw rollout..."
+  kubectl rollout status deploy/openclaw -n "$NS" --timeout=180s || \
+    echo "  (rollout still in progress)"
+  echo "OpenClaw deployed."
+}
+
+# ---------------------------------------------------------------------------
+# Step 2: Deploy MLflow
+# ---------------------------------------------------------------------------
+deploy_mlflow() {
+  if [[ -n "${MLFLOW_TRACKING_URI:-}" ]]; then
+    echo ""
+    echo "Step 2: Skipping MLflow deploy (MLFLOW_TRACKING_URI is set: $MLFLOW_TRACKING_URI)"
+    return
+  fi
+
+  echo ""
+  echo "Step 2: Deploying MLflow (namespace: $MLFLOW_NS, image: $MLFLOW_IMG)..."
+
+  if ! kubectl get namespace "$MLFLOW_NS" &>/dev/null; then
+    kubectl create namespace "$MLFLOW_NS"
+  fi
+
+  kubectl apply -f "$SCRIPT_DIR/mlflow/pvc.yaml" -n "$MLFLOW_NS"
+  kubectl apply -f "$SCRIPT_DIR/mlflow/service.yaml" -n "$MLFLOW_NS"
+
+  if [[ "$MLFLOW_IMG" != "ghcr.io/mlflow/mlflow:v2.21.3" ]]; then
+    kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
+    kubectl set image "deploy/mlflow" "mlflow=$MLFLOW_IMG" -n "$MLFLOW_NS"
+  else
+    kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
+  fi
+
+  echo "Waiting for MLflow rollout..."
+  kubectl rollout status deploy/mlflow -n "$MLFLOW_NS" --timeout=120s || \
+    echo "  (rollout still in progress)"
+
+  MLFLOW_TRACKING_URI="http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000"
+  echo "MLflow deployed: $MLFLOW_TRACKING_URI"
+}
+
+# ---------------------------------------------------------------------------
+# Step 3: Add clawbench sidecar (starts eval)
+# ---------------------------------------------------------------------------
+add_sidecar() {
+  echo ""
+  echo "Step 3: Adding clawbench eval sidecar..."
+
+  echo "Applying clawbench ConfigMap..."
+  kubectl apply -f "$SCRIPT_DIR/manifests/configmap.yaml" -n "$NS" >/dev/null
+
+  if [[ -n "${CLAWBENCH_MODEL:-}" ]]; then
+    kubectl patch configmap clawbench-config -n "$NS" \
+      --type merge -p "{\"data\":{\"CLAWBENCH_MODEL\":\"$CLAWBENCH_MODEL\"}}" >/dev/null
+    echo "  Model: $CLAWBENCH_MODEL"
+  fi
+
+  if [[ -n "${OPENAI_API_BASE:-}" ]]; then
+    kubectl patch configmap clawbench-config -n "$NS" \
+      --type merge -p "{\"data\":{\"OPENAI_API_BASE\":\"$OPENAI_API_BASE\"}}" >/dev/null
+    echo "  OpenAI API base: $OPENAI_API_BASE"
+  fi
+
+  # Patch MLflow settings into ConfigMap
+  PATCH_DATA=""
+  MLFLOW_URI="${MLFLOW_TRACKING_URI:-http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000}"
+  PATCH_DATA="\"MLFLOW_TRACKING_URI\":\"$MLFLOW_URI\""
+  if [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]]; then
+    PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_ID\":\"$MLFLOW_EXPERIMENT_ID\""
+  fi
+  if [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]]; then
+    PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_NAME\":\"$MLFLOW_EXPERIMENT_NAME\""
+  fi
+  kubectl patch configmap clawbench-config -n "$NS" \
+    --type merge -p "{\"data\":{$PATCH_DATA}}" >/dev/null
+  echo "  MLflow URI: $MLFLOW_URI"
+  [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]] && echo "  MLflow experiment ID: $MLFLOW_EXPERIMENT_ID"
+  [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]] && echo "  MLflow experiment name: $MLFLOW_EXPERIMENT_NAME"
+
+  # Check if sidecar already exists
+  HAS_SIDECAR=$(kubectl get deploy/openclaw -n "$NS" -o json \
+    | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print('yes' if any(c['name']=='clawbench' for c in cs) else 'no')")
+
+  if [[ "$HAS_SIDECAR" == "yes" ]]; then
+    echo "Removing existing clawbench sidecar..."
+    INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
+      | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next(i for i,c in enumerate(cs) if c['name']=='clawbench'))")
+    kubectl patch deploy/openclaw -n "$NS" --type=json \
+      -p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]" >/dev/null
+  fi
+
+  # Find the OpenClaw home volume, and capture existing volumes so add-sidecar
+  # also works with bring-your-own deployments that lack this repo's PVC layout.
+  VOLUME_INFO=$(kubectl get deploy/openclaw -n "$NS" -o json \
+    | python3 -c "
+import json, sys
+spec = json.load(sys.stdin)['spec']['template']['spec']
+volume_names = [v.get('name') for v in spec.get('volumes', []) if v.get('name')]
+home_volume = 'openclaw-home'
+for c in spec['containers']:
+    if c['name'] == 'gateway':
+        for vm in c.get('volumeMounts', []):
+            if vm['mountPath'] == '/home/node/.openclaw':
+                home_volume = vm['name']
+                break
+print(json.dumps({
+    'home_volume': home_volume,
+    'volumes_present': 'volumes' in spec,
+    'volume_names': volume_names,
+}))
+")
+
+  echo "Adding clawbench sidecar (image: $CLAWBENCH_IMG)..."
+
+  PATCH=$(VOLUME_INFO="$VOLUME_INFO" CLAWBENCH_IMG="$CLAWBENCH_IMG" python3 - <<'PY'
+import json
+import os
+
+info = json.loads(os.environ["VOLUME_INFO"])
+home_volume = info["home_volume"]
+
+command = r"""echo "Waiting for gateway on localhost:18789..."
+for i in $(seq 1 90); do
+  python3 -c "import socket; s=socket.create_connection((\"127.0.0.1\",18789),2); s.close()" 2>/dev/null && echo "Gateway ready" && break
+  sleep 2
+done
+
+if [ -n "${MLFLOW_TRACKING_URI:-}" ]; then
+  echo "Checking MLflow at ${MLFLOW_TRACKING_URI}..."
+  python3 -c "import httpx,os; r=httpx.get(os.environ[\"MLFLOW_TRACKING_URI\"]+\"/health\"); print(\"MLflow OK:\",r.status_code)" 2>&1 || echo "MLflow pre-check failed (will retry at log time)"
+fi
+
+echo "Starting eval..."
+clawbench run \
+  --model "${CLAWBENCH_MODEL}" \
+  --gateway-token "${OPENCLAW_GATEWAY_TOKEN}" \
+  --runs "${CLAWBENCH_RUNS}" \
+  --concurrency "${CLAWBENCH_CONCURRENCY}" \
+  ${CLAWBENCH_JUDGE_MODEL:+--judge-model "${CLAWBENCH_JUDGE_MODEL}"} \
+  $([ -n "${CLAWBENCH_TASKS:-}" ] && for t in ${CLAWBENCH_TASKS}; do printf -- "-t %s " "$t"; done) \
+  -o /results/benchmark.json
+RC=$?
+if [ $RC -eq 0 ] && [ -n "${MLFLOW_TRACKING_URI:-}" ]; then
+  python scripts/log_to_mlflow.py /results/benchmark.json
+fi
+echo "ClawBench finished (exit=$RC)"
+sleep infinity"""
+
+container = {
+    "name": "clawbench",
+    "image": os.environ["CLAWBENCH_IMG"],
+    "imagePullPolicy": "IfNotPresent",
+    "command": ["/bin/bash", "-c", command],
+    "envFrom": [{"configMapRef": {"name": "clawbench-config"}}],
+    "env": [
+        {
+            "name": "OPENCLAW_GATEWAY_TOKEN",
+            "valueFrom": {
+                "secretKeyRef": {
+                    "name": "clawbench-secrets",
+                    "key": "OPENCLAW_GATEWAY_TOKEN",
+                }
+            },
+        }
+    ],
+    "resources": {
+        "requests": {"memory": "1Gi", "cpu": "500m"},
+        "limits": {"memory": "4Gi", "cpu": "2"},
+    },
+    "volumeMounts": [
+        {"name": home_volume, "mountPath": "/home/node/.openclaw"},
+        {"name": "clawbench-results", "mountPath": "/results"},
+        {"name": "tmp-volume", "mountPath": "/tmp"},
+    ],
+    "securityContext": {
+        "allowPrivilegeEscalation": False,
+        "capabilities": {"drop": ["ALL"]},
+    },
+}
+
+patch = [{"op": "add", "path": "/spec/template/spec/containers/-", "value": container}]
+
+existing_volumes = set(info["volume_names"])
+required_volumes = [
+    {"name": home_volume, "emptyDir": {}},
+    {"name": "clawbench-results", "emptyDir": {}},
+    {"name": "tmp-volume", "emptyDir": {}},
+]
+missing_volumes = []
+for volume in required_volumes:
+    if volume["name"] not in existing_volumes and volume["name"] not in {
+        item["name"] for item in missing_volumes
+    }:
+        missing_volumes.append(volume)
+
+if missing_volumes:
+    if info["volumes_present"]:
+        patch.extend(
+            {"op": "add", "path": "/spec/template/spec/volumes/-", "value": volume}
+            for volume in missing_volumes
+        )
+    else:
+        patch.append(
+            {"op": "add", "path": "/spec/template/spec/volumes", "value": missing_volumes}
+        )
+
+print(json.dumps(patch))
+PY
+)
+
+  kubectl patch deploy/openclaw -n "$NS" --type=json -p "$PATCH" >/dev/null
+
+  echo ""
+  echo "Waiting for rollout..."
+  kubectl rollout status deploy/openclaw -n "$NS" --timeout=300s 2>/dev/null || \
+    echo "  (rollout timeout — eval runs for 30-60 min)"
+
+  echo ""
+  echo "Eval is running. Follow logs with:"
+  echo "  ./scripts/k8s/deploy.sh --logs"
+  echo ""
+  echo "When finished, remove the sidecar with:"
+  echo "  ./scripts/k8s/deploy.sh --remove-sidecar"
+}
+
+# ---------------------------------------------------------------------------
+# Execute
+# ---------------------------------------------------------------------------
+case "$MODE" in
+  full)
+    ensure_namespace_and_secret
+    deploy_openclaw
+    deploy_mlflow
+    add_sidecar
+    ;;
+  openclaw-only)
+    ensure_namespace_and_secret
+    deploy_openclaw
+    echo ""
+    echo "OpenClaw is running. Next steps:"
+    echo "  ./scripts/k8s/deploy.sh --mlflow-only       # Deploy MLflow"
+    echo "  ./scripts/k8s/deploy.sh --add-sidecar       # Start eval"
+    ;;
+  mlflow-only)
+    deploy_mlflow
+    ;;
+  add-sidecar)
+    if ! kubectl get deploy/openclaw -n "$NS" &>/dev/null; then
+      echo "Deployment 'openclaw' not found in namespace '$NS'." >&2
+      echo "Deploy OpenClaw first with: ./scripts/k8s/deploy.sh --openclaw-only" >&2
+      exit 1
+    fi
+    ensure_namespace_and_secret
+    add_sidecar
+    ;;
+esac
--- a/scripts/k8s/manifests/configmap.yaml
+++ b/scripts/k8s/manifests/configmap.yaml
@ -0,0 +1,18 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: clawbench-config
+  labels:
+    app: clawbench
+data:
+  CLAWBENCH_MODEL: "openai/gpt-5.5"
+  OPENAI_API_BASE: ""
+  CLAWBENCH_RUNS: "3"
+  CLAWBENCH_CONCURRENCY: "4"
+  CLAWBENCH_JUDGE_MODEL: ""
+  CLAWBENCH_TASKS: ""
+  CLAWBENCH_CONNECT_TIMEOUT: "120"
+  CLAWBENCH_REQUEST_TIMEOUT: "300"
+  CLAWBENCH_PER_RUN_BUDGET_SECONDS: "600"
+  MLFLOW_TRACKING_URI: "http://mlflow-service.mlflow.svc.cluster.local:5000"
+  MLFLOW_EXPERIMENT_NAME: "clawbench"
--- a/scripts/k8s/manifests/secret.yaml
+++ b/scripts/k8s/manifests/secret.yaml
@ -0,0 +1,15 @@
+# Reference template — do NOT apply directly.
+# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
+# from exported environment variables (OPENAI_API_KEY, etc.).
+apiVersion: v1
+kind: Secret
+metadata:
+  name: clawbench-secrets
+  labels:
+    app: clawbench
+type: Opaque
+stringData:
+  OPENAI_API_KEY: "REPLACE_ME"
+  # Add other provider keys as needed:
+  # ANTHROPIC_API_KEY: "REPLACE_ME"
+  # OPENROUTER_API_KEY: "REPLACE_ME"
--- a/scripts/k8s/mlflow/deployment.yaml
+++ b/scripts/k8s/mlflow/deployment.yaml
@ -0,0 +1,68 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: mlflow
+  labels:
+    app: mlflow
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate
+  selector:
+    matchLabels:
+      app: mlflow
+  template:
+    metadata:
+      labels:
+        app: mlflow
+    spec:
+      containers:
+        - name: mlflow
+          image: ghcr.io/mlflow/mlflow:v2.21.3
+          command:
+            - mlflow
+            - server
+            - --host
+            - "0.0.0.0"
+            - --port
+            - "5000"
+            - --backend-store-uri
+            - sqlite:///mlflow/mlflow.db
+            - --default-artifact-root
+            - /mlflow/artifacts
+            - --serve-artifacts
+          ports:
+            - name: http
+              containerPort: 5000
+              protocol: TCP
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 5000
+            initialDelaySeconds: 15
+            periodSeconds: 30
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: 5000
+            initialDelaySeconds: 5
+            periodSeconds: 10
+          resources:
+            requests:
+              cpu: 100m
+              memory: 256Mi
+            limits:
+              cpu: 500m
+              memory: 1Gi
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop:
+                - ALL
+          volumeMounts:
+            - name: mlflow-data
+              mountPath: /mlflow
+      volumes:
+        - name: mlflow-data
+          persistentVolumeClaim:
+            claimName: mlflow-data-pvc
--- a/scripts/k8s/mlflow/pvc.yaml
+++ b/scripts/k8s/mlflow/pvc.yaml
@ -0,0 +1,12 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: mlflow-data-pvc
+  labels:
+    app: mlflow
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 5Gi
--- a/scripts/k8s/mlflow/service.yaml
+++ b/scripts/k8s/mlflow/service.yaml
@ -0,0 +1,15 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: mlflow-service
+  labels:
+    app: mlflow
+spec:
+  type: ClusterIP
+  selector:
+    app: mlflow
+  ports:
+    - name: http
+      port: 5000
+      targetPort: 5000
+      protocol: TCP
--- a/scripts/k8s/openclaw/configmap.yaml
+++ b/scripts/k8s/openclaw/configmap.yaml
@ -0,0 +1,36 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: openclaw-config
+  labels:
+    app: openclaw
+data:
+  openclaw.json: |
+    {
+      "gateway": {
+        "mode": "local",
+        "bind": "loopback",
+        "port": 18789,
+        "auth": {
+          "mode": "token"
+        }
+      },
+      "browser": {
+        "enabled": true,
+        "headless": true,
+        "noSandbox": true,
+        "ssrfPolicy": {
+          "allowedHostnames": ["localhost", "127.0.0.1"]
+        }
+      },
+      "tools": {
+        "profile": "coding",
+        "alsoAllow": ["browser"]
+      },
+      "agents": {
+        "defaults": {
+          "workspace": "~/.openclaw/workspace"
+        }
+      },
+      "cron": { "enabled": false }
+    }
--- a/scripts/k8s/openclaw/deployment.yaml
+++ b/scripts/k8s/openclaw/deployment.yaml
@ -0,0 +1,146 @@
+# OpenClaw gateway deployment for ClawBench evals.
+#
+# Build the image with browser support:
+#   docker build --build-arg OPENCLAW_INSTALL_BROWSER=1 \
+#     -t quay.io/yourorg/openclaw:eval .
+#
+# Or use upstream without browser (browser eval tasks will score 0):
+#   image: ghcr.io/openclaw/openclaw:latest
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: openclaw
+  labels:
+    app: openclaw
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate
+  selector:
+    matchLabels:
+      app: openclaw
+  template:
+    metadata:
+      labels:
+        app: openclaw
+    spec:
+      initContainers:
+        - name: init-config
+          image: registry.access.redhat.com/ubi9-minimal:latest
+          command:
+            - sh
+            - -c
+            - |
+              cp /config/openclaw.json /home/node/.openclaw/openclaw.json
+              chmod 666 /home/node/.openclaw/openclaw.json
+              mkdir -p /home/node/.openclaw/workspace
+              mkdir -p /home/node/.openclaw/agents
+              chmod 777 /home/node/.openclaw /home/node/.openclaw/workspace /home/node/.openclaw/agents
+              echo "Config initialized"
+          volumeMounts:
+            - name: openclaw-home
+              mountPath: /home/node/.openclaw
+            - name: config-template
+              mountPath: /config
+          resources:
+            limits:
+              cpu: 200m
+              memory: 128Mi
+            requests:
+              cpu: 50m
+              memory: 64Mi
+      containers:
+        - name: gateway
+          image: ghcr.io/openclaw/openclaw:latest
+          imagePullPolicy: IfNotPresent
+          command:
+            - sh
+            - -c
+            - umask 007 && exec node dist/index.js gateway run --bind loopback --port 18789 --allow-unconfigured
+          env:
+            - name: HOME
+              value: /home/node
+            - name: NODE_ENV
+              value: production
+            - name: OPENCLAW_CONFIG_DIR
+              value: /home/node/.openclaw
+            - name: OPENCLAW_STATE_DIR
+              value: /home/node/.openclaw
+            - name: OPENCLAW_GATEWAY_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: OPENCLAW_GATEWAY_TOKEN
+            - name: OPENAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: OPENAI_API_KEY
+                  optional: true
+            - name: ANTHROPIC_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: ANTHROPIC_API_KEY
+                  optional: true
+            - name: OPENROUTER_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: OPENROUTER_API_KEY
+                  optional: true
+            - name: GEMINI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: GEMINI_API_KEY
+                  optional: true
+          ports:
+            - name: gateway
+              containerPort: 18789
+              protocol: TCP
+          livenessProbe:
+            exec:
+              command:
+                - node
+                - -e
+                - "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
+            initialDelaySeconds: 60
+            periodSeconds: 30
+            timeoutSeconds: 10
+          readinessProbe:
+            exec:
+              command:
+                - node
+                - -e
+                - "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
+            initialDelaySeconds: 30
+            periodSeconds: 10
+            timeoutSeconds: 5
+          resources:
+            requests:
+              cpu: 250m
+              memory: 1Gi
+            limits:
+              cpu: "2"
+              memory: 4Gi
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop:
+                - ALL
+          volumeMounts:
+            - name: openclaw-home
+              mountPath: /home/node/.openclaw
+            - name: tmp-volume
+              mountPath: /tmp
+      terminationGracePeriodSeconds: 30
+      volumes:
+        - name: openclaw-home
+          persistentVolumeClaim:
+            claimName: openclaw-home-pvc
+        - name: config-template
+          configMap:
+            name: openclaw-config
+        - name: tmp-volume
+          emptyDir: {}
--- a/scripts/k8s/openclaw/pvc.yaml
+++ b/scripts/k8s/openclaw/pvc.yaml
@ -0,0 +1,12 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: openclaw-home-pvc
+  labels:
+    app: openclaw
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 10Gi
--- a/scripts/k8s/openclaw/secret.yaml
+++ b/scripts/k8s/openclaw/secret.yaml
@ -0,0 +1,17 @@
+# Reference template — do NOT apply directly.
+# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
+# from exported environment variables (OPENAI_API_KEY, etc.).
+apiVersion: v1
+kind: Secret
+metadata:
+  name: clawbench-secrets
+  labels:
+    app: openclaw
+type: Opaque
+stringData:
+  OPENCLAW_GATEWAY_TOKEN: "REPLACE_ME"
+  OPENAI_API_KEY: "REPLACE_ME"
+  # Add other provider keys as needed:
+  # ANTHROPIC_API_KEY: "REPLACE_ME"
+  # OPENROUTER_API_KEY: "REPLACE_ME"
+  # GEMINI_API_KEY: "REPLACE_ME"
--- a/scripts/k8s/openclaw/service.yaml
+++ b/scripts/k8s/openclaw/service.yaml
@ -0,0 +1,15 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: openclaw
+  labels:
+    app: openclaw
+spec:
+  type: ClusterIP
+  selector:
+    app: openclaw
+  ports:
+    - name: gateway
+      port: 18789
+      targetPort: 18789
+      protocol: TCP
--- a/scripts/log_to_mlflow.py
+++ b/scripts/log_to_mlflow.py
@ -0,0 +1,125 @@
+#!/usr/bin/env python3
+"""Log a ClawBench BenchmarkResult to MLflow.
+
+Standalone script -- not imported by the clawbench package.
+Requires: pip install mlflow  (or pip install clawbench[mlflow])
+
+Usage:
+    python scripts/log_to_mlflow.py /results/benchmark.json
+
+Environment:
+    MLFLOW_TRACKING_URI      MLflow tracking server (default: http://localhost:5000)
+    MLFLOW_EXPERIMENT_NAME   Experiment name (default: clawbench)
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+
+def main(result_path: str) -> None:
+    try:
+        import mlflow
+    except ImportError:
+        print(
+            "mlflow is not installed. Install with: pip install mlflow"
+            "  (or pip install clawbench[mlflow])",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+
+    from clawbench.schemas import BenchmarkResult
+
+    with open(result_path, encoding="utf-8") as f:
+        result = BenchmarkResult(**json.load(f))
+
+    experiment_id = os.environ.get("MLFLOW_EXPERIMENT_ID")
+    if experiment_id:
+        experiment = mlflow.set_experiment(experiment_id=experiment_id)
+    else:
+        experiment = mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT_NAME", "clawbench"))
+
+    run_name = f"{result.model}-{result.submission_id[:8]}"
+    with mlflow.start_run(run_name=run_name):
+        mlflow.log_params(
+            {
+                "model": result.model,
+                "provider": result.provider,
+                "benchmark_version": result.benchmark_version,
+                "openclaw_version": result.openclaw_version or "unknown",
+                "judge_model": result.judge_model or "none",
+                "task_snapshot_fingerprint": result.task_snapshot_fingerprint or "unknown",
+            }
+        )
+
+        mlflow.log_metrics(
+            {
+                "overall_score": result.overall_score,
+                "overall_completion": result.overall_completion,
+                "overall_trajectory": result.overall_trajectory,
+                "overall_behavior": result.overall_behavior,
+                "overall_reliability": result.overall_reliability,
+                "overall_pass_hat_k": result.overall_pass_hat_k,
+                "overall_judge_score": result.overall_judge_score,
+                "overall_judge_confidence": result.overall_judge_confidence,
+                "overall_judge_pass_rate": result.overall_judge_pass_rate,
+                "judge_task_coverage": result.judge_task_coverage,
+                "overall_weighted_query_score": result.overall_weighted_query_score,
+                "overall_median_latency_ms": result.overall_median_latency_ms,
+                "overall_p95_latency_ms": result.overall_p95_latency_ms,
+                "overall_total_tokens": result.overall_total_tokens,
+                "overall_cost_usd": result.overall_cost_usd,
+                "overall_tokens_per_pass": result.overall_tokens_per_pass,
+                "overall_cost_per_pass": result.overall_cost_per_pass,
+                "overall_ci_lower": result.overall_ci_lower,
+                "overall_ci_upper": result.overall_ci_upper,
+            }
+        )
+
+        for tier in result.tier_results:
+            mlflow.log_metrics(
+                {
+                    f"{tier.tier}/score": tier.mean_task_score,
+                    f"{tier.tier}/completion": tier.mean_completion,
+                    f"{tier.tier}/trajectory": tier.mean_trajectory,
+                    f"{tier.tier}/behavior": tier.mean_behavior,
+                    f"{tier.tier}/reliability": tier.mean_reliability,
+                }
+            )
+
+        for i, task in enumerate(result.task_results):
+            mlflow.log_metrics(
+                {
+                    f"task/{task.task_id}/score": task.mean_task_score,
+                    f"task/{task.task_id}/reliability": task.reliability_score,
+                },
+                step=i,
+            )
+
+        mlflow.set_tags(
+            {
+                "submission_id": result.submission_id,
+                "timestamp": result.timestamp,
+                "certified": str(result.certified),
+            }
+        )
+
+        try:
+            mlflow.log_artifact(result_path)
+        except Exception as e:
+            print(f"Warning: artifact upload failed: {e}", file=sys.stderr)
+            print("Metrics and params were logged successfully.", file=sys.stderr)
+
+    print(f"Logged to MLflow: experiment={experiment.name} run={run_name}")
+
+
+if __name__ == "__main__":
+    if len(sys.argv) != 2:
+        print(f"Usage: {sys.argv[0]} <result.json>", file=sys.stderr)
+        sys.exit(1)
+    main(sys.argv[1])
--- a/scripts/refactor_verifiers.py
+++ b/scripts/refactor_verifiers.py
@ -10,7 +10,6 @@ look for "wherever the agent put it."

 from __future__ import annotations

-import sys
 from pathlib import Path
 from textwrap import dedent

--- a/scripts/rejudge_all.py
+++ b/scripts/rejudge_all.py
@ -18,7 +18,6 @@ Usage:
 from __future__ import annotations

 import argparse
-import asyncio
 import json
 import os
 import re
--- a/scripts/run_posterior_dynamics_pipeline.py
+++ b/scripts/run_posterior_dynamics_pipeline.py
@ -0,0 +1,89 @@
+#!/usr/bin/env python3
+"""Run the full posterior dynamical analysis pipeline."""
+
+from __future__ import annotations
+
+import argparse
+import subprocess
+import sys
+from pathlib import Path
+
+
+REPO_ROOT = Path(__file__).resolve().parent.parent
+sys.path.insert(0, str(REPO_ROOT))
+
+from clawbench.dynamics_archive import discover_model_roots, load_task_runs_archive, write_dynamics_report
+
+
+def _run(cmd: list[str]) -> None:
+    print("$", " ".join(cmd))
+    result = subprocess.run(cmd, cwd=REPO_ROOT)
+    if result.returncode != 0:
+        raise SystemExit(result.returncode)
+
+
+def _resolve_path(path: Path) -> Path:
+    return path if path.is_absolute() else (REPO_ROOT / path)
+
+
+def _write_dynamics_reports(
+    archive_dir: Path,
+    output_dir: Path,
+    tier: str | None,
+) -> None:
+    roots = discover_model_roots(archive_dir)
+    if not roots:
+        raise SystemExit(f"No cached runs found under {archive_dir}")
+
+    multiple_models = len(roots) > 1
+    wrote_any = False
+    for model_name, model_dir in roots.items():
+        task_runs = load_task_runs_archive(model_dir, tier=tier)
+        if not task_runs:
+            continue
+
+        wrote_any = True
+        model_output_dir = output_dir / model_name if multiple_models else output_dir
+        report_path, plots = write_dynamics_report(task_runs, model_output_dir)
+        n_runs = sum(len(runs) for runs in task_runs.values())
+
+        print(f"[dynamics] {model_name}: loaded {n_runs} cached runs across {len(task_runs)} tasks")
+        print(f"[dynamics] {model_name}: wrote {report_path}")
+        print(f"[dynamics] {model_name}: saved {len(plots)} plots to {model_output_dir}/")
+
+    if not wrote_any:
+        raise SystemExit(f"No cached runs found under {archive_dir}")
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Run posterior dynamics pipeline end to end")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--output-dir", type=Path, default=Path("results/posterior_dynamics"))
+    parser.add_argument(
+        "--include-dynamics-report",
+        action="store_true",
+        help="Also build per-model dynamics.json files and plots from the archive.",
+    )
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()
+
+    py = sys.executable
+    archive_dir = _resolve_path(args.archive_dir)
+    reports_dir = _resolve_path(args.reports_dir)
+    output_dir = _resolve_path(args.output_dir)
+    tier_args = ["--tier", args.tier] if args.tier else []
+    scripts_dir = REPO_ROOT / "scripts"
+
+    _run([py, str(scripts_dir / "compute_constraint_index.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "classify_regimes.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "variance_decomp.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "survival_analysis.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "snr_weighted_ranking.py"), "--archive-dir", str(archive_dir), "--reports-dir", str(reports_dir), *tier_args])
+    _run([py, str(scripts_dir / "generate_dynamical_report.py"), "--reports-dir", str(reports_dir)])
+    if args.include_dynamics_report:
+        _write_dynamics_reports(archive_dir, output_dir, args.tier)
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/snr_weighted_ranking.py
+++ b/scripts/snr_weighted_ranking.py
@ -1,148 +1,130 @@
-"""SNR × |C(q)|-weighted ranking — the dynamical-systems-informed metric.
+#!/usr/bin/env python3
+"""SNR x |C(q)| weighted ranking from posterior cached runs.

-Motivation: from variance_decomp.py we know 47% of run_score variance is
-seed noise. From compute_constraint_index.py we know some tasks are
-high-constraint (everyone converges) and others are open-ended (responses
-diverge for style reasons, not capability).
+Weighted headline score:

-Weighted mean:
-    w(task) = SNR(task) × |C(q)(task)|
-    score(model) = Σ_task w(task) · mean_run_score(task, model) / Σ_task w(task)
+    w(q) = max(0, SNR(q)) * |C(q)|
+    score(model) = sum_q w(q) * mean_run_score(model, q) / sum_q w(q)

-Why:
- High SNR tasks contribute more than low-SNR tasks (noise-weighted)
- |C(q)| amplifies tasks that are either strongly constrained OR strongly
-  open-ended (i.e. measures what they're supposed to measure, regardless
-  of polarity)
- Moderate C(q) tasks (C near 0) are inherently ambiguous — down-weighted
+We also report:

-Outputs:
-  - Per-model weighted score
-  - Comparison against flat-mean ranking
-  - Published to reports/snr_weighted_ranking.json
+    snr_only              = SNR-weighted mean
+    snr_x_abs_cq          = SNR x |C(q)| weighted mean
+    snr_x_abs_cq_winsorized = same, but top task weights are clamped at p95
+
+This keeps noisy low-SNR tasks from dominating and upweights tasks whose
+response geometry suggests a stronger capability signal.
 """

 from __future__ import annotations

-import glob
+import argparse
 import json
+import sys
 from collections import defaultdict
 from pathlib import Path
 from statistics import mean

 import numpy as np

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
-REPORTS = ROOT / "reports"
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-MODELS = {
-    "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
-    "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
-    "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
-    "gpt54": ("openai_gpt-5.4", "GPT 5.4"),
-    "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
-    "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
-    "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
-    "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
-    "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
-}
+from clawbench.dynamics_archive import load_task_runs_by_model


 def main() -> None:
-    cq = json.loads((REPORTS / "constraint_index.json").read_text())
-    var = json.loads((REPORTS / "variance_decomposition.json").read_text())
-    snr_by_task = {r["task"]: r["snr"] for r in var["per_task"]}
+    parser = argparse.ArgumentParser(description="Compute SNR-weighted posterior model ranking")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()

-    # Per (model, task): mean run_score over the 3 runs
-    per_mt: dict[str, dict[str, list[float]]] = defaultdict(dict)
-    for label, (sub, _) in MODELS.items():
-        for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
-            try:
-                d = json.loads(Path(p).read_text())
-            except Exception:
-                continue
-            task = p.split("/")[-2]
-            per_mt[label].setdefault(task, []).append(d.get("run_score", 0))
-    per_mt_mean = {
-        m: {t: mean(v) for t, v in d.items() if v} for m, d in per_mt.items()
+    cq_path = args.reports_dir / "constraint_index.json"
+    var_path = args.reports_dir / "variance_decomposition.json"
+    if not cq_path.exists() or not var_path.exists():
+        raise SystemExit("Missing prerequisite reports: run compute_constraint_index.py and variance_decomp.py first.")
+
+    cq = json.loads(cq_path.read_text(encoding="utf-8"))
+    var = json.loads(var_path.read_text(encoding="utf-8"))
+    snr_by_task = {row["task"]: row["snr"] for row in var.get("per_task", [])}
+
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")
+
+    per_model_task_scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
+    for model_name, task_runs in grouped.items():
+        for task_id, runs in task_runs.items():
+            per_model_task_scores[model_name][task_id] = [float(run.run_score) for run in runs]
+
+    per_model_task_mean = {
+        model_name: {
+            task_id: mean(vals)
+            for task_id, vals in task_scores.items()
+            if vals
+        }
+        for model_name, task_scores in per_model_task_scores.items()
    }

-    # Only consider tasks present in both C(q) and SNR
    common_tasks = sorted(set(cq) & set(snr_by_task))
-    print(f"Using {len(common_tasks)} tasks with both C(q) and SNR.")
+    if not common_tasks:
+        raise SystemExit("No overlap between constraint_index and variance_decomposition task sets.")

-    # Compute weights w(task) = SNR × |C(q)|, clamped to [0, ∞)
-    weights = {}
-    for t in common_tasks:
-        w = max(0.0, snr_by_task[t]) * abs(cq[t]["C_q"])
-        weights[t] = w
-    # Also: SNR-only weighting (simpler, no C(q))
-    snr_weights = {t: max(0.0, snr_by_task[t]) for t in common_tasks}
-    # Also: Winsorize — clamp top-1 task's weight to 95th percentile to
-    # prevent single task from dominating
-    import numpy as _np
-    _w95 = float(_np.percentile(list(weights.values()), 95))
-    weights_wins = {t: min(w, _w95) for t, w in weights.items()}
-    wsum = sum(weights.values())
-    if wsum == 0:
-        print("All weights zero — bail.")
-        return
+    weights = {task: max(0.0, snr_by_task[task]) * abs(cq[task].get("C_q", 0.0)) for task in common_tasks}
+    snr_weights = {task: max(0.0, snr_by_task[task]) for task in common_tasks}

-    # Compute per-model scores under 3 variants
-    results = []
+    w95 = float(np.percentile(list(weights.values()), 95)) if weights else 0.0
+    winsorized = {task: min(weight, w95) for task, weight in weights.items()}
+
+    w_sum = sum(weights.values())
    snr_sum = sum(snr_weights.values())
-    wins_sum = sum(weights_wins.values())
-    for label, (sub, pretty) in MODELS.items():
-        task_means = per_mt_mean.get(label, {})
-        if not task_means:
+    wins_sum = sum(winsorized.values())
+
+    results = []
+    for model_name, task_means in per_model_task_mean.items():
+        covered = [task for task in common_tasks if task in task_means]
+        if not covered:
            continue
-        num_cq = sum(weights[t] * task_means.get(t, 0) for t in common_tasks)
-        num_snr = sum(snr_weights[t] * task_means.get(t, 0) for t in common_tasks)
-        num_wins = sum(weights_wins[t] * task_means.get(t, 0) for t in common_tasks)
-        wscore = num_cq / wsum
-        snr_only = num_snr / snr_sum if snr_sum > 0 else 0
-        wins_score = num_wins / wins_sum if wins_sum > 0 else 0
-        flat = mean(task_means[t] for t in common_tasks if t in task_means)
-        results.append((label, pretty, flat, wscore, snr_only, wins_score))

-    print()
-    print(f"{'Model':<16}  {'Flat':>7}  {'SNR×|C|':>8}  {'Winsorized':>11}  {'SNR-only':>9}")
-    print("-" * 66)
-    # Rank by winsorized variant (primary)
-    for label, pretty, flat, w, snr_only, wins in sorted(results, key=lambda x: -x[5]):
-        print(f"{pretty:<16}  {flat:>7.4f}  {w:>8.4f}  {wins:>11.4f}  {snr_only:>9.4f}")
+        flat = mean(task_means[task] for task in covered)
+        weighted = (
+            sum(weights[task] * task_means.get(task, 0.0) for task in common_tasks) / w_sum
+            if w_sum > 1e-12
+            else 0.0
+        )
+        snr_only = (
+            sum(snr_weights[task] * task_means.get(task, 0.0) for task in common_tasks) / snr_sum
+            if snr_sum > 1e-12
+            else 0.0
+        )
+        wins_score = (
+            sum(winsorized[task] * task_means.get(task, 0.0) for task in common_tasks) / wins_sum
+            if wins_sum > 1e-12
+            else 0.0
+        )

-    # Rank comparisons
-    print("\n=== Ranking shifts vs flat-mean (winsorized) ===")
-    flat_rank_order = sorted(results, key=lambda x: -x[2])
-    flat_rank = {r[0]: i + 1 for i, r in enumerate(flat_rank_order)}
-    wins_rank_order = sorted(results, key=lambda x: -x[5])
-    print(f"{'Rank':<5}{'Model':<16} {'Flat':>8}  {'Winsorized':>11}  {'Δrank':>6}")
-    for i, (label, pretty, flat, _w, _snr, wins) in enumerate(wins_rank_order, 1):
-        fr = flat_rank[label]
-        move = ""
-        if fr > i: move = f"↑{fr-i}"
-        elif fr < i: move = f"↓{i-fr}"
-        print(f"{i:<5}{pretty:<16} {flat:>8.4f}  {wins:>11.4f}  {move:>6}")
+        results.append(
+            {
+                "model": model_name,
+                "flat": float(flat),
+                "snr_x_abs_cq": float(weighted),
+                "snr_only": float(snr_only),
+                "snr_x_abs_cq_winsorized": float(wins_score),
+                "coverage": len(covered),
+            }
+        )
+
+    results.sort(key=lambda row: row["snr_x_abs_cq_winsorized"], reverse=True)

-    # Save
    out = {
-        "flat_score": {r[0]: r[2] for r in results},
-        "snr_x_cq_weighted": {r[0]: r[3] for r in results},
-        "snr_x_cq_winsorized": {r[0]: r[5] for r in results},
-        "snr_only_weighted": {r[0]: r[4] for r in results},
-        "weights_per_task": weights,
        "common_tasks": common_tasks,
+        "weights_per_task": weights,
+        "results": results,
    }
-    (REPORTS / "snr_weighted_ranking.json").write_text(json.dumps(out, indent=2))
-    print(f"\nWrote reports/snr_weighted_ranking.json")

-    # Show top-5 contributing tasks (highest weight) for context
-    print()
-    print("Top-10 tasks by weight (SNR × |C(q)|):")
-    for t, w in sorted(weights.items(), key=lambda kv: -kv[1])[:10]:
-        print(f"  {t:<38}  SNR={snr_by_task[t]:>5.1f}  |C(q)|={abs(cq[t]['C_q']):>5.2f}  w={w:>6.2f}")
+    out_path = args.reports_dir / "snr_weighted_ranking.json"
+    out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
+    print(f"Wrote: {out_path}")


 if __name__ == "__main__":
--- a/scripts/survival_analysis.py
+++ b/scripts/survival_analysis.py
@ -1,164 +1,118 @@
-"""Per-turn survival analysis: when do agent runs fail?
+#!/usr/bin/env python3
+"""Per-turn survival analysis on posterior cached runs.

-Following paper §Latent-state survival:
-  T_F = inf { t ≥ 0 : failure at time t }
-  S(t) = P(T_F > t)   — survival function
-  h(t) = P(T_F = t | T_F ≥ t)  — hazard rate
+For each run, define a failure time T_F as the first assistant turn where the
+agent emits neither text nor tool calls, or the final assistant turn of an
+unsuccessful run with delivery outcome in {fail, partial}.

-For each run, we define FAILURE as the first turn where:
-  (a) the assistant emits no text AND no tool calls, OR
-  (b) the run's delivery_outcome is 'fail'/'partial' AND the transcript
-      ended at this turn (no more assistant turns follow).
+We then estimate:

-T_F = assistant-turn index of first failure (starting at 1).
-If the run succeeded (run_score ≥ 0.7), T_F is right-censored at the
-final turn count N (i.e. survived the whole trajectory).
+    S(t) = P(T_F > t)
+    h(t) = P(T_F = t | T_F >= t)

-Output per model:
-  - Median turn-to-failure
-  - Empirical survival curve S(t) for t = 1..20
-  - Hazard profile h(t)
-  - Stratified by task-constraint bucket (using C(q) from earlier)
-
-Usage:
-    .venv/bin/python3 scripts/survival_analysis.py
+This exposes long-horizon fragility that is easy to hide in flat mean scores.
 """

 from __future__ import annotations

-import glob
+import argparse
 import json
-import re
-from collections import defaultdict
+import sys
 from pathlib import Path
 from statistics import median

-import numpy as np
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
-
-MODELS = {
-    "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
-    "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
-    "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
-    "gpt54": ("openai_gpt-5.4", "GPT 5.4"),
-    "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
-    "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
-    "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
-    "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
-    "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
-}
+from clawbench.dynamics_archive import load_task_runs_by_model

 SUCCESS_THRESHOLD = 0.7


-def assistant_turns(d: dict) -> list[dict]:
-    return [m for m in d.get("transcript", {}).get("messages", [])
-            if m.get("role") == "assistant"]
+def assistant_turns(run) -> list:
+    return run.transcript.assistant_messages


-def find_failure_turn(d: dict) -> tuple[int, bool]:
-    """Return (T_F, is_event). T_F is 1-indexed turn of failure.
-
-    is_event=True means failure actually happened; False means the run was
-    censored (survived to end without failing).
-    """
-    turns = assistant_turns(d)
+def find_failure_turn(run) -> tuple[int, bool]:
+    """Return (failure_turn, is_event) with 1-indexed assistant turns."""
+    turns = assistant_turns(run)
    n = len(turns)
-    run_score = d.get("run_score", 0) or 0
-    delivery = d.get("delivery_outcome", "")

-    # Scan for first empty-turn
-    for i, t in enumerate(turns, 1):
-        has_text = bool((t.get("text") or "").strip())
-        has_tool_call = bool(t.get("tool_calls"))
+    for idx, turn in enumerate(turns, 1):
+        has_text = bool((turn.text or "").strip())
+        has_tool_call = bool(turn.tool_calls)
        if not has_text and not has_tool_call:
-            return i, True  # failure event
+            return idx, True

-    # If run was unsuccessful and ended early, mark last turn as failure
-    if run_score < SUCCESS_THRESHOLD and delivery in ("fail", "partial"):
+    if run.run_score < SUCCESS_THRESHOLD and run.delivery_outcome.value in {"fail", "partial"}:
        return max(n, 1), True

-    # Survived: right-censored at n
    return max(n, 1), False


 def empirical_survival(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]:
-    """Kaplan-Meier-like survival curve, non-parametric.
-
-    S(t) = fraction of runs that survived past turn t.
-    """
-    survival = []
+    """Empirical survival curve S(t) over assistant-turn index."""
    total = len(times_events)
+    if total == 0:
+        return [0.0] * max_t
+
+    survival = []
    for t in range(1, max_t + 1):
-        # Survived past t = either censored at ≥t or event at >t
-        survived = sum(1 for tf, is_event in times_events
-                       if (not is_event and tf >= t) or (is_event and tf > t))
-        survival.append(survived / total if total > 0 else 0.0)
+        survived = sum(
+            1
+            for tf, is_event in times_events
+            if (not is_event and tf >= t) or (is_event and tf > t)
+        )
+        survival.append(survived / total)
    return survival


 def hazard(times_events: list[tuple[int, bool]], max_t: int = 20) -> list[float]:
-    """Hazard rate h(t) = events at t / at-risk at t."""
-    h = []
+    """Discrete hazard h(t) = events_at_t / at_risk_at_t."""
+    hazard_vals = []
    for t in range(1, max_t + 1):
        at_risk = sum(1 for tf, _ in times_events if tf >= t)
-        events_at_t = sum(1 for tf, is_event in times_events
-                           if is_event and tf == t)
-        h.append(events_at_t / at_risk if at_risk > 0 else 0.0)
-    return h
+        events_at_t = sum(1 for tf, is_event in times_events if is_event and tf == t)
+        hazard_vals.append(events_at_t / at_risk if at_risk > 0 else 0.0)
+    return hazard_vals


 def main() -> None:
-    per_model: dict[str, list[tuple[int, bool]]] = defaultdict(list)
-    for label, (sub, _) in MODELS.items():
-        for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
-            try:
-                d = json.loads(Path(p).read_text())
-            except Exception:
-                continue
-            tf, is_event = find_failure_turn(d)
-            per_model[label].append((tf, is_event))
+    parser = argparse.ArgumentParser(description="Survival analysis on cached runs")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    parser.add_argument("--max-turn", type=int, default=20)
+    args = parser.parse_args()

-    # Load C(q) to stratify
-    cq_path = ROOT / "reports" / "constraint_index.json"
-    cq_by_task = {}
-    if cq_path.exists():
-        cq = json.loads(cq_path.read_text())
-        cq_by_task = {t: v["C_q"] for t, v in cq.items()}
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")

-    # Print summary
-    print(f"{'Model':<14}  {'n_runs':>6}  {'events':>6}  {'med_tf':>8}  "
-          f"{'S(3)':>6}  {'S(5)':>6}  {'S(8)':>6}  {'S(12)':>6}  {'S(20)':>6}")
-    print("-" * 90)
    out = {}
-    for label, (_sub, pretty) in MODELS.items():
-        evs = per_model[label]
-        n = len(evs)
-        n_events = sum(1 for _, e in evs if e)
-        tfs_events = [tf for tf, e in evs if e]
-        med = median(tfs_events) if tfs_events else float("inf")
-        surv = empirical_survival(evs, max_t=20)
-        haz = hazard(evs, max_t=20)
-        print(f"{pretty:<14}  {n:>6}  {n_events:>6}  {med:>8.1f}  "
-              f"{surv[2]:>6.2f}  {surv[4]:>6.2f}  {surv[7]:>6.2f}  "
-              f"{surv[11]:>6.2f}  {surv[19]:>6.2f}")
-        out[label] = {
-            "pretty": pretty,
-            "n_runs": n,
+    for model_name, task_runs in grouped.items():
+        events = []
+        for runs in task_runs.values():
+            for run in runs:
+                events.append(find_failure_turn(run))
+
+        n_runs = len(events)
+        n_events = sum(1 for _, is_event in events if is_event)
+        event_times = [t for t, is_event in events if is_event]
+        med = median(event_times) if event_times else float("inf")
+
+        out[model_name] = {
+            "pretty": model_name,
+            "n_runs": n_runs,
            "n_events": n_events,
            "median_fail_turn": med,
-            "survival": surv,
-            "hazard": haz,
+            "survival": empirical_survival(events, max_t=args.max_turn),
+            "hazard": hazard(events, max_t=args.max_turn),
        }

-    print("\n(Interpretation: S(t) = fraction of runs still on-track past turn t.")
-    print(" Lower values = more frequent early failure.)")
-
-    out_path = ROOT / "reports" / "survival_analysis.json"
-    out_path.write_text(json.dumps(out, indent=2))
-    print(f"\nWrote: {out_path}")
+    args.reports_dir.mkdir(parents=True, exist_ok=True)
+    out_path = args.reports_dir / "survival_analysis.json"
+    out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
+    print(f"Wrote: {out_path}")


 if __name__ == "__main__":
--- a/scripts/variance_decomp.py
+++ b/scripts/variance_decomp.py
@ -1,132 +1,118 @@
-"""Decompose run_score variance into seed-noise vs capability-signal.
+#!/usr/bin/env python3
+"""Decompose posterior run_score variance into seed noise and capability signal.

-Each task has 3 runs per model (same prompt, different random seed).
-  σ²_seed(task, model)  = variance across the 3 runs of (task, model)
-  σ²_capability(task)   = variance across model means for the task
+Each task has repeated runs per model.
+
+    sigma^2_seed(task, model) = variance across repeated runs for one model
+    sigma^2_capability(task)  = variance across model means for that task

 Signal-to-noise ratio per task:
-  SNR(task) = σ²_capability / σ²_seed

-High SNR → differences between models on this task are REAL (not noise).
-Low SNR  → the 3-run variance per model is so large that cross-model gaps
-           are indistinguishable from seed noise. These tasks don't
-           discriminate models reliably.
+    SNR(task) = sigma^2_capability / mean_model sigma^2_seed

-Aggregated over all 40 tasks, we also decompose TOTAL variance:
-  total_var = mean_capability_var + mean_seed_var
-  capability_fraction = mean_capability_var / total_var
+High SNR means cross-model differences are likely real. Low SNR means the
+benchmark signal is dominated by run-to-run variance rather than capability.

-This answers "what fraction of the benchmark signal is real model
-capability vs. run-to-run luck?"
+Aggregate decomposition:

-Usage:
-    .venv/bin/python3 scripts/variance_decomp.py
+    total_var = mean_task seed_var + mean_task cap_var
+    capability_fraction = mean_task cap_var / total_var
+
+This script keeps the posterior/archive-based workflow used by the current
+pipeline, but the statistical meaning is the same as the earlier analysis.
 """

 from __future__ import annotations

-import glob
+import argparse
 import json
-import re
+import sys
 from collections import defaultdict
 from pathlib import Path
 from statistics import mean, variance

-import numpy as np
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))

-ROOT = Path(__file__).resolve().parent.parent
-ARCH = ROOT / "data" / "run_cache_archive" / "v2026-4-19-full"
-
-MODELS = {
-    "opus46": ("anthropic_claude-opus-4-6", "Opus 4.6"),
-    "opus47": ("anthropic_claude-opus-4-7", "Opus 4.7"),
-    "sonnet46": ("anthropic_claude-sonnet-4-6", "Sonnet 4.6"),
-    "gpt54": ("openai_gpt-5.4", "GPT 5.4"),
-    "gemini": ("google_gemini-3.1-pro-preview", "Gemini 3.1"),
-    "glm": ("openrouter_z-ai_glm-5.1", "GLM 5.1"),
-    "minimax": ("openrouter_minimax_minimax-m2.7", "MiniMax M2.7"),
-    "kimi25": ("openrouter_moonshotai_kimi-k2.5", "Kimi K2.5"),
-    "qwen": ("openrouter_qwen_qwen3.6-plus", "Qwen 3.6"),
-}
+from clawbench.dynamics_archive import load_task_runs_by_model


 def main() -> None:
-    # {task: {model: [run_scores]}}
-    scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
-    for label, (sub, _) in MODELS.items():
-        for p in glob.glob(f"{ARCH}/{sub}/*/run*.json"):
-            task = p.split("/")[-2]
-            try:
-                d = json.loads(Path(p).read_text())
-            except Exception:
-                continue
-            scores[task].setdefault(label, []).append(d.get("run_score", 0))
+    parser = argparse.ArgumentParser(description="Variance decomposition on cached runs")
+    parser.add_argument("--archive-dir", type=Path, default=Path(".clawbench/run_cache"))
+    parser.add_argument("--reports-dir", type=Path, default=Path("reports"))
+    parser.add_argument("--tier", choices=["tier1", "tier2", "tier3", "tier4", "tier5"], default=None)
+    args = parser.parse_args()
+
+    grouped = load_task_runs_by_model(args.archive_dir, tier=args.tier)
+    if not grouped:
+        raise SystemExit(f"No cached runs found under {args.archive_dir}")
+
+    # Collect repeated run scores as {task -> {model -> [run_scores]}}.
+    scores: dict[str, dict[str, list[float]]] = defaultdict(dict)
+    for model_name, task_runs in grouped.items():
+        for task_id, runs in task_runs.items():
+            vals = [float(run.run_score) for run in runs]
+            if vals:
+                scores[task_id][model_name] = vals

-    # Per-task: seed var per model, cross-model var of means, SNR
    task_stats = []
-    for task, per_model in scores.items():
-        # Only use models with all 3 runs for clean seed-variance estimate
+    for task_id, per_model in scores.items():
        model_vars = []
        model_means = []
-        for m, runs in per_model.items():
+        for runs in per_model.values():
            if len(runs) >= 2:
                model_vars.append(variance(runs))
+            if runs:
                model_means.append(mean(runs))
-        if len(model_means) < 2 or not model_vars:
-            continue
-        mean_seed_var = mean(model_vars)        # noise
-        cap_var = variance(model_means)          # signal
+
+        # Mean within-model variance is the seed-noise term.
+        mean_seed_var = mean(model_vars) if model_vars else 0.0
+        # Variance of model means is the capability-signal term.
+        cap_var = variance(model_means) if len(model_means) >= 2 else 0.0
        snr = cap_var / (mean_seed_var + 1e-9)
-        task_stats.append({
-            "task": task,
-            "seed_var": mean_seed_var,
-            "cap_var": cap_var,
-            "snr": snr,
-            "n_models": len(model_means),
-        })
+        task_stats.append(
+            {
+                "task": task_id,
+                "seed_var": float(mean_seed_var),
+                "cap_var": float(cap_var),
+                "snr": float(snr),
+                "n_models": len(model_means),
+                "limited_model_diversity": len(model_means) < 2,
+            }
+        )

-    # Sort by SNR
-    task_stats.sort(key=lambda x: -x["snr"])
+    task_stats.sort(key=lambda row: row["snr"], reverse=True)
+    if not task_stats:
+        raise SystemExit("No task-level scores found in archive.")

-    print(f"{'Task':<38}  {'seed_var':>9}  {'cap_var':>9}  {'SNR':>8}")
-    print("-" * 70)
-    for r in task_stats:
-        print(f"{r['task']:<38}  {r['seed_var']:>9.4f}  {r['cap_var']:>9.4f}  "
-              f"{r['snr']:>8.2f}")
-
-    # Aggregate decomposition
-    total_seed = mean(r["seed_var"] for r in task_stats)
-    total_cap = mean(r["cap_var"] for r in task_stats)
+    # Aggregate over tasks to estimate how much of benchmark variance is real
+    # capability signal versus run-to-run noise.
+    total_seed = mean(row["seed_var"] for row in task_stats)
+    total_cap = mean(row["cap_var"] for row in task_stats)
    total = total_seed + total_cap
-    cap_frac = total_cap / (total + 1e-9)
+    capability_fraction = total_cap / total if total > 1e-12 else 0.0

-    print("\n=== AGGREGATE VARIANCE DECOMPOSITION ===")
-    print(f"  Mean seed variance (noise):        {total_seed:.5f}")
-    print(f"  Mean capability variance (signal): {total_cap:.5f}")
-    print(f"  Capability fraction:               {cap_frac:.1%}")
-    print(f"  (= what % of run_score variance comes from real model differences)")
+    # Coarse SNR buckets help downstream reporting and task weighting.
+    high_snr = [row for row in task_stats if row["snr"] >= 5]
+    mid_snr = [row for row in task_stats if 1 <= row["snr"] < 5]
+    low_snr = [row for row in task_stats if row["snr"] < 1]

-    # Classify tasks by SNR tiers
-    high_snr = [r for r in task_stats if r["snr"] >= 5]
-    mid_snr = [r for r in task_stats if 1 <= r["snr"] < 5]
-    low_snr = [r for r in task_stats if r["snr"] < 1]
-    print(f"\n=== SNR TIERS ===")
-    print(f"  High SNR (≥5):       {len(high_snr)} tasks — differentiate models reliably")
-    print(f"  Mid SNR (1–5):       {len(mid_snr)} tasks — moderate signal")
-    print(f"  Low SNR (<1):        {len(low_snr)} tasks — seed noise ≥ capability signal")
-    print(f"     (these tasks give random-ish results; weight down)")
-
-    # Write output
-    out_path = ROOT / "reports" / "variance_decomposition.json"
-    out_path.write_text(json.dumps({
+    out = {
        "per_task": task_stats,
        "aggregate": {
-            "mean_seed_var": total_seed,
-            "mean_cap_var": total_cap,
-            "capability_fraction": cap_frac,
+            "mean_seed_var": float(total_seed),
+            "mean_cap_var": float(total_cap),
+            "capability_fraction": float(capability_fraction),
+            "high_snr_tasks": len(high_snr),
+            "mid_snr_tasks": len(mid_snr),
+            "low_snr_tasks": len(low_snr),
        },
-    }, indent=2))
-    print(f"\nWrote: {out_path}")
+    }
+
+    args.reports_dir.mkdir(parents=True, exist_ok=True)
+    out_path = args.reports_dir / "variance_decomposition.json"
+    out_path.write_text(json.dumps(out, indent=2), encoding="utf-8")
+    print(f"Wrote: {out_path}")


 if __name__ == "__main__":
--- a/tasks-domain/MANIFEST.yaml
+++ b/tasks-domain/MANIFEST.yaml
@ -0,0 +1,163 @@
+manifest_version: 1
+release: clawbench-domain-v0
+status: scaffold
+purpose: |
+  Domain coverage scaffold for proving that model + general harness + plugins
+  covers the jobs served by most agent SaaS products. This is not the small
+  public Core v1 benchmark. It is the planned expansion corpus.
+
+relationship_to_core_v1: |
+  tasks-public/Core v1 is the public, signal-curated reproducibility set.
+  tasks-domain is the domain coverage and ablation suite. Core v1 can stay
+  small; domain coverage should grow through templates and private variants.
+
+domains:
+  - id: crm
+    label: CRM
+    representative_jobs:
+      - lead enrichment
+      - account update from meeting notes
+      - opportunity risk summary
+      - duplicate contact cleanup
+      - follow-up task creation
+    plugin_requirements: [browser, crm_api, docs, search, memory]
+    verifier_contracts: [api_state, structured_artifact, cited_evidence]
+
+  - id: support
+    label: Support
+    representative_jobs:
+      - ticket triage
+      - macro draft with policy evidence
+      - escalation routing
+      - refund eligibility lookup
+      - customer timeline summary
+    plugin_requirements: [browser, support_api, knowledge_base, email]
+    verifier_contracts: [api_state, policy_match, cited_evidence]
+
+  - id: email_calendar
+    label: Email and calendar
+    representative_jobs:
+      - thread summarization
+      - meeting scheduling
+      - follow-up drafting
+      - conflict detection
+      - contact-aware prioritization
+    plugin_requirements: [email, calendar, contacts, memory]
+    verifier_contracts: [calendar_state, draft_content, no_duplicate_state]
+
+  - id: docs_sheets_slides
+    label: Docs, sheets, slides
+    representative_jobs:
+      - spreadsheet cleanup
+      - deck update
+      - document redaction
+      - chart generation
+      - report formatting
+    plugin_requirements: [filesystem, spreadsheet, document, slides, charting]
+    verifier_contracts: [file_structure, rendered_diff, formula_check]
+
+  - id: project_management
+    label: Project management
+    representative_jobs:
+      - issue grooming
+      - sprint status update
+      - dependency tracking
+      - stale task cleanup
+      - launch checklist synthesis
+    plugin_requirements: [pm_api, repo, docs, notifications]
+    verifier_contracts: [api_state, link_integrity, dependency_state]
+
+  - id: finance_ops
+    label: Finance ops
+    representative_jobs:
+      - invoice reconciliation
+      - expense categorization
+      - budget variance report
+      - payment exception triage
+      - tax document checklist
+    plugin_requirements: [spreadsheet, accounting_api, document, ocr]
+    verifier_contracts: [numeric_tolerance, ledger_delta, audit_trail]
+
+  - id: data_analytics
+    label: Data analytics
+    representative_jobs:
+      - SQL answer
+      - dashboard explanation
+      - ETL patch
+      - anomaly investigation
+      - chart specification
+    plugin_requirements: [database, notebook, filesystem, bi_api]
+    verifier_contracts: [query_result, execution_check, chart_spec]
+
+  - id: security_admin
+    label: Security admin
+    representative_jobs:
+      - access review
+      - incident timeline
+      - secret rotation plan
+      - policy exception review
+      - audit log evidence packet
+    plugin_requirements: [identity_api, logs, repo, policy_docs]
+    verifier_contracts: [policy_state, cited_logs, refusal_gate]
+
+  - id: ecommerce_ops
+    label: Ecommerce ops
+    representative_jobs:
+      - catalog update
+      - order exception handling
+      - promo QA
+      - inventory reconciliation
+      - returns policy response
+    plugin_requirements: [storefront_api, spreadsheet, browser, email]
+    verifier_contracts: [api_state, price_check, order_state]
+
+  - id: devtools
+    label: Devtools
+    representative_jobs:
+      - repo migration
+      - CI failure repair
+      - release note generation
+      - dependency update
+      - multi-repo contract change
+    plugin_requirements: [shell, git, filesystem, package_registry]
+    verifier_contracts: [test_pass, diff_assertion, changelog_check]
+
+  - id: research
+    label: Research
+    representative_jobs:
+      - evidence memo
+      - citation synthesis
+      - source contradiction handling
+      - market scan
+      - literature extraction
+    plugin_requirements: [browser, web_search, web_fetch, document]
+    verifier_contracts: [citation_check, no_fabrication, source_coverage]
+
+  - id: personal_ops
+    label: Personal ops
+    representative_jobs:
+      - travel planning
+      - household planning
+      - health admin summary
+      - personal finance checklist
+      - recurring reminder setup
+    plugin_requirements: [calendar, browser, memory, document]
+    verifier_contracts: [constraint_satisfaction, state_transition, refusal_gate]
+
+release_targets:
+  domain_count: 12
+  templates_per_domain: 5
+  private_variants_per_template: 3
+  runs_per_configuration: 3
+  public_templates_total: 60
+  private_variants_total: 180
+
+ablation_classes:
+  - id: model_only
+    description: Model with minimal shell/filesystem access.
+  - id: model_plus_harness
+    description: Model plus general OpenClaw-style harness, no domain plugins.
+  - id: core_plugins
+    description: Harness plus common browser, memory, filesystem, and execution plugins.
+  - id: domain_plugins
+    description: Harness plus the plugins needed for each domain state surface.
--- a/tasks-domain/README.md
+++ b/tasks-domain/README.md
@ -0,0 +1,59 @@
+# ClawBench Domain Suite
+
+`tasks-public/` is the small public Core v1 set. `tasks-domain/` is the
+coverage scaffold for the larger proof corpus: the domains served by most
+agent SaaS products, expressed as deterministic benchmark work.
+
+The claim this suite is meant to support is:
+
+> A capable model plus a general agent harness plus the right plugins can
+> cover the task domains that most agent SaaS products sell.
+
+This is intentionally not a clone of vendor products. It is a taxonomy of
+jobs, state transitions, and verifier contracts.
+
+## Domains
+
+| Domain | Representative jobs | Required plugin surface | Verification style |
+|---|---|---|---|
+| CRM | lead enrichment, account updates, meeting notes to opportunities | browser, CRM API, docs, search | API state assertions, fixture diffs |
+| Support | ticket triage, macro draft, escalation, refund lookup | browser/API, knowledge base, email | ticket state, cited evidence, policy checks |
+| Email and calendar | thread summarization, scheduling, follow-ups | mail, calendar, contacts, memory | event state, draft content, no-duplicate checks |
+| Docs, sheets, slides | spreadsheet cleanup, deck edits, document redaction | file, office docs, charting | structural file assertions, rendered diffs |
+| Project management | issue grooming, sprint updates, dependency tracking | PM API, repo, docs, notifications | issue state, links, blocked/unblocked status |
+| Finance ops | invoice reconciliation, expense coding, budget variance | spreadsheets, accounting API, OCR | ledger deltas, numeric tolerances, audit trail |
+| Data analytics | SQL, dashboard explanation, ETL patch, anomaly report | database, notebooks, BI API | query results, chart spec, report content |
+| Security admin | access review, incident timeline, secret rotation plan | identity, logs, repo, policy docs | policy state, log-derived evidence, refusal gates |
+| Ecommerce ops | catalog updates, order exception handling, promo QA | storefront API, spreadsheet, browser | product state, order workflow, price checks |
+| Devtools | repo migration, CI fix, release note, dependency update | shell, git, code, package registry | test pass, diff assertions, changelog checks |
+| Research | web evidence, citation synthesis, source contradiction | browser, web search, docs | citation verifier, no-fabrication checks |
+| Personal ops | travel, household planning, health/wellness admin | calendar, browser, memory, docs | constraint satisfaction, state updates |
+
+## Proof Standard
+
+Each domain task should declare:
+
+- `domain`: one of the domains above
+- `job`: the user-facing job being covered
+- `saas_equivalents`: examples of products whose core workflow overlaps
+- `plugin_requirements`: tool families and state surfaces needed
+- `deterministic_floor`: the verifier that must pass before any judge score
+- `holdout_variant_policy`: how private variants are generated
+- `ablation_axis`: which plugins or harness capabilities the task tests
+
+## Minimum Bar
+
+For a credible first domain release:
+
+- 12 domains
+- 5 task templates per domain
+- 3 private variants per template
+- 3 runs per configuration
+- at least 4 configuration classes:
+  - model only
+  - model plus harness
+  - model plus harness plus core plugins
+  - model plus harness plus domain plugins
+
+That yields 60 public templates and 180 private variants before repetitions.
+The public templates explain coverage; the private variants carry the proof.
--- a/tasks-public/MANIFEST.yaml
+++ b/tasks-public/MANIFEST.yaml
@ -3,8 +3,6 @@ release: clawbench-core-v1
 release_date: 2026-04-20
 benchmark_version: 0.4.0.dev1
 task_count: 19
-source_sweep: v2026-4-19-full
-openclaw_version: 2026.4.15-beta.1

 description: |
  ClawBench Core v1 — a curated subset of 19 tasks from the internal
@ -20,49 +18,37 @@ description: |
  reference ranking with 0 inversions and min adjacent-rank gap of
  0.0049 (well above the ~0.002 seed-noise floor).

-established_ranking:
-  - rank: 1
-    model: anthropic/claude-opus-4-6
-    display: Claude Opus 4.6
-    score: 0.8137
-  - rank: 2
-    model: anthropic/claude-opus-4-7
-    display: Claude Opus 4.7
-    score: 0.7824
-  - rank: 3
-    model: openai/gpt-5.4
-    display: GPT 5.4
-    score: 0.7647
-  - rank: 4
-    model: anthropic/claude-sonnet-4-6
-    display: Claude Sonnet 4.6
-    score: 0.7597
-  - rank: 5
-    model: openrouter/minimax/minimax-m2.7
-    display: MiniMax M2.7
-    score: 0.7475
-  - rank: 6
-    model: google/gemini-3.1-pro-preview
-    display: Gemini 3.1 Pro
-    score: 0.7408
-  - rank: 7
-    model: openrouter/qwen/qwen3.6-plus
-    display: Qwen 3.6 Plus
-    score: 0.7030
-  - rank: 8
-    model: openrouter/moonshotai/kimi-k2.5
-    display: Kimi K2.5
-    score: 0.6800
+selection_basis:
+  description: |
+    The 19 tasks below were chosen via greedy task selection from the
+    v2026-4-19-full archive so that the cross-model mean reproduces
+    the reference 8-model ordering with 0 inversions and a min
+    adjacent-rank gap of 0.0049 (~2.5x the seed-noise floor).
+  reference_models:
+    - anthropic/claude-opus-4-6
+    - anthropic/claude-opus-4-7
+    - openai/gpt-5.4
+    - anthropic/claude-sonnet-4-6
+    - openrouter/minimax/minimax-m2.7
+    - google/gemini-3.1-pro-preview
+    - openrouter/qwen/qwen3.6-plus
+    - openrouter/moonshotai/kimi-k2.5
+  notes: |
+    Numerical scores intentionally omitted from this manifest. They
+    are openclaw-version-, provider-routing-, and seed-dependent;
+    publishing them would mislead anyone treating them as a stable
+    reference. Run the bench against your own configuration to
+    establish your own baseline.

 coverage:
  tiers:
    tier1: 2
-    tier2: 7
+    tier2: 6
    tier3: 5
-    tier4: 4
+    tier4: 5
    tier5: 1
  families:
-    tools: 7
+    tools: 8
    coding: 2
    repo: 3
    browser: 2
--- a/tasks-public/README.md
+++ b/tasks-public/README.md
@ -14,33 +14,28 @@ selection: iteratively drop tasks that either (a) introduce ranking
 inversions vs the reference ordering or (b) have near-zero cross-model
 SNR and add only noise.

-## Established ranking (from v4-19-full sweep)
+## Selection criteria

-Mean run_score across the 19 tasks:
+The 19-task subset was chosen so that, on the v2026-4-19-full archive
+of 8 frontier models:

-| Rank | Model | Score |
-|:---:|---|:---:|
-| 1 | Claude Opus 4.6 | 0.8137 |
-| 2 | Claude Opus 4.7 | 0.7824 |
-| 3 | GPT 5.4 | 0.7647 |
-| 4 | Claude Sonnet 4.6 | 0.7597 |
-| 5 | MiniMax M2.7 | 0.7475 |
-| 6 | Gemini 3.1 Pro | 0.7408 |
-| 7 | Qwen 3.6 Plus | 0.7030 |
-| 8 | Kimi K2.5 | 0.6800 |
+- The mean ranking has **0 inversions** vs the established 8-model order.
+- The min adjacent-rank gap is **0.0049** — well above the ~0.002
+  seed-noise floor estimated from inter-run variance.
+- All 5 tiers and 6 task families remain represented.

- **0 ranking inversions** on the 19-task mean.
- **Min adjacent-rank gap: 0.0049** (well above the ~0.002 seed-noise
-  floor estimated from inter-run variance).
- **Top-to-bottom spread: 0.134** (vs 0.097 for smaller robust sets).
+Specific reference scores intentionally omitted from this README; they
+are version-, provider-, and infra-dependent and would mislead anyone
+reading them as a stable comparison number. Run the bench yourself
+against your own configuration.

 ## Coverage

 | Dimension | Breakdown |
 |---|---|
-| Tiers | T1=2, T2=7, T3=5, T4=4, T5=1 |
-| Families | tools=7, coding=2, repo=3, browser=2, multi_tool=3, adversarial=1 |
-| Capabilities | bugfix, refactor, test_authoring, multifile_reasoning, browser_debugging, structured_output, graceful_refusal, delegation, tool_composition, research_synthesis, cross_repo_change, memory_continuation |
+| Tiers | T1=2, T2=6, T3=5, T4=5, T5=1 |
+| Families | tools=8, coding=2, repo=3, browser=2, multi_tool=3, adversarial=1 |
+| Capabilities | bugfix, test_authoring, multifile_reasoning, browser_debugging, structured_output, graceful_refusal, delegation, tool_composition, research_synthesis, cross_repo_change, memory_continuation |

 ## Directory layout

@ -49,13 +44,26 @@ tasks-public/
 ├── MANIFEST.yaml          # Machine-readable task list + metadata
 ├── README.md              # This file
 ├── tier1/                 # 2 task YAMLs
-├── tier2/                 # 7 task YAMLs
+├── tier2/                 # 6 task YAMLs
 ├── tier3/                 # 5 task YAMLs
-├── tier4/                 # 4 task YAMLs
+├── tier4/                 # 5 task YAMLs
 ├── tier5/                 # 1 task YAML
 └── assets/                # 19 asset packs (verifier scripts + fixtures)
 ```

+## Build the Docker image
+
+```bash
+docker build -t clawbench .
+```
+
+The repo `Dockerfile` pins an OpenClaw image digest so public Space
+builds do not silently drift. Override `OPENCLAW_IMAGE` only when you
+intend to measure a different platform build. Note that platform
+upgrades can shift scores (we observed +0.13 to +0.29 per model going
+from 4.9 → 4.15-beta.1) — when comparing two model runs, build them
+against the same OpenClaw release.
+
 ## How to run Core v1

 Using the ClawBench harness:
@ -97,7 +105,8 @@ your ClawBench config. See MANIFEST.yaml for a programmatic list.
  2026-04-20 14:00 and 17:00 PST. Pin to canonical model versions
  (e.g. `z-ai/glm-5-turbo-20260315`) for stable measurement.
 - **OpenClaw platform version matters.** Upgrading from 4.9 → 4.15-beta.1
-  shifted scores by +0.13 to +0.29 across models. Pin via Docker tag.
+  shifted scores by +0.13 to +0.29 across models. Build both sides of
+  any comparison from the same OpenClaw release.
 - **Judge scores** come from Claude Sonnet 4.6 via direct Anthropic
  API (with a fallback from the gateway judge). Scores assume the
  judge is working correctly; re-judging broken runs may be required
--- a/tasks-public/assets/t3_web_research_and_cite/serve.py
+++ b/tasks-public/assets/t3_web_research_and_cite/serve.py
@ -5,13 +5,23 @@ from __future__ import annotations
 import os
 from http.server import BaseHTTPRequestHandler, HTTPServer
 from pathlib import Path
+from urllib.parse import unquote, urlsplit

 ROOT = Path(__file__).parent / "articles"
+ARTICLES = {path.stem: path for path in ROOT.glob("*.html") if path.is_file()}
+
+
+def article_for_request_path(request_path: str) -> Path | None:
+    path = unquote(urlsplit(request_path).path)
+    if not path.startswith("/article/"):
+        return None
+    slug = path.removeprefix("/article/")
+    return ARTICLES.get(slug)


 class Handler(BaseHTTPRequestHandler):
    def do_GET(self) -> None:  # noqa: N802
-        path = self.path.split("?")[0]
+        path = unquote(urlsplit(self.path).path)
        if path == "/health":
            self.send_response(200)
            self.send_header("Content-Type", "application/json")
@ -22,9 +32,8 @@ class Handler(BaseHTTPRequestHandler):
            self._index()
            return
        if path.startswith("/article/"):
-            slug = path.split("/", 2)[2]
-            article = ROOT / f"{slug}.html"
-            if article.exists():
+            article = article_for_request_path(self.path)
+            if article is not None:
                self._html(article.read_bytes())
                return
        self.send_response(404)
@ -33,8 +42,7 @@ class Handler(BaseHTTPRequestHandler):

    def _index(self) -> None:
        items = []
-        for f in sorted(ROOT.glob("*.html")):
-            slug = f.stem
+        for slug in sorted(ARTICLES):
            items.append(f'<li><a href="/article/{slug}">{slug}</a></li>')
        body = (
            "<!doctype html><html><body>"
--- a/tests/test_ablation.py
+++ b/tests/test_ablation.py
@ -0,0 +1,122 @@
+from clawbench.ablation import (
+    common_compatible_task_set,
+    compare_results,
+    default_tool_profile,
+)
+from clawbench.adapters.hermes import HermesAdapterConfig
+from clawbench.schemas import (
+    BenchmarkResult,
+    CompletionSpec,
+    FileState,
+    SimulatedUser,
+    TaskDefinition,
+    TaskFamily,
+    TaskStats,
+    Tier,
+    UserTurn,
+)
+
+
+def _task(task_id: str) -> TaskDefinition:
+    return TaskDefinition(
+        id=task_id,
+        name=task_id,
+        tier=Tier.TIER1,
+        family=TaskFamily.CODING,
+        surface="coding",
+        user=SimulatedUser(turns=[UserTurn(message="write out.txt")]),
+        completion=CompletionSpec(files=[FileState(path="out.txt")]),
+    )
+
+
+def test_tool_profile_fingerprint_is_stable() -> None:
+    config = HermesAdapterConfig(driver_mode="ai_agent", enabled_toolsets=["hermes-api-server"])
+    a = default_tool_profile(adapter="hermes", config=config, enabled_toolsets=["hermes-api-server"])
+    b = default_tool_profile(adapter="hermes", config=config, enabled_toolsets=["hermes-api-server"])
+
+    assert a.fingerprint == b.fingerprint
+    assert "browser" in a.interfaces
+    assert "multi_turn" in a.interfaces
+
+
+def test_common_compatible_task_set_uses_effective_adapter_config() -> None:
+    tasks = [_task("a"), _task("b")]
+    plan = common_compatible_task_set(
+        tasks,
+        {
+            "openclaw": ("openclaw", None),
+            "hermes": ("hermes", HermesAdapterConfig(driver_mode="ai_agent")),
+        },
+    )
+
+    assert plan.task_ids == ["a", "b"]
+    assert plan.skipped == {}
+
+
+def _result(label: str, model: str, task_ids: list[str], score: float) -> BenchmarkResult:
+    task_results = [
+        TaskStats(
+            task_id=task_id,
+            tier="tier1",
+            family="coding",
+            runs=1,
+            mean_completion_score=1.0,
+            mean_trajectory_score=1.0,
+            mean_behavior_score=1.0,
+            mean_run_score=score,
+            reliability_score=1.0,
+            variance_score=1.0,
+            mean_task_score=score,
+            stddev=0.0,
+            min_score=score,
+            max_score=score,
+            pass_at_1=True,
+            pass_rate=1.0,
+            pass_hat_k=True,
+        )
+        for task_id in task_ids
+    ]
+    return BenchmarkResult(
+        submission_id=label,
+        model=model,
+        provider="test",
+        timestamp="2026-04-25T00:00:00Z",
+        overall_score=score,
+        overall_completion=1.0,
+        overall_trajectory=1.0,
+        overall_behavior=1.0,
+        overall_reliability=1.0,
+        overall_ci_lower=score,
+        overall_ci_upper=score,
+        overall_pass_hat_k=1.0,
+        task_results=task_results,
+    )
+
+
+def test_compare_results_rejects_different_task_sets() -> None:
+    comparison = compare_results(
+        {
+            "a": _result("a", "m", ["t1", "t2"], 0.8),
+            "b": _result("b", "m", ["t1"], 0.9),
+        }
+    )
+
+    assert comparison["fair"] is False
+    assert comparison["task_verifier_fair"] is False
+    assert comparison["controlled_ablation"] is False
+    assert comparison["same_model"] is True
+    assert comparison["same_task_set"] is False
+
+
+def test_compare_results_allows_cross_model_same_task_leaderboard() -> None:
+    a = _result("a", "model-a", ["t1", "t2"], 0.8)
+    b = _result("b", "model-b", ["t1", "t2"], 0.9)
+    a.task_snapshot_fingerprint = "snapshot-1"
+    b.task_snapshot_fingerprint = "snapshot-1"
+
+    comparison = compare_results({"a": a, "b": b})
+
+    assert comparison["fair"] is True
+    assert comparison["task_verifier_fair"] is True
+    assert comparison["controlled_ablation"] is False
+    assert comparison["same_model"] is False
--- a/tests/test_adapter_base.py
+++ b/tests/test_adapter_base.py
@ -0,0 +1,222 @@
+"""Tests for `clawbench.adapters.base` + registry.
+
+Keeps the adapter ABC and registration helpers honest before any
+concrete adapter lands. A parametrized contract test in
+`test_adapter_contract.py` will exercise the ABC against every shipped
+adapter later.
+"""
+
+from __future__ import annotations
+
+from pathlib import Path
+
+import pytest
+
+from clawbench.adapters import (
+    ADAPTERS,
+    AdapterContext,
+    AgentAdapter,
+    PhaseResult,
+    StateQueryResult,
+    get_adapter,
+    register_adapter,
+)
+from clawbench.canonical import (
+    AdapterCapability,
+    CanonicalPhase,
+    CanonicalTask,
+    StateQuery,
+)
+from clawbench.canonical.convert import from_task_definition
+from clawbench.schemas import (
+    CompletionSpec,
+    ExecutionCheck,
+    FileState,
+    SimulatedUser,
+    TaskDefinition,
+    TaskFamily,
+    TaskSetup,
+    Tier,
+    Transcript,
+    UserTurn,
+)
+
+
+# ---------------------------------------------------------------------------
+# Minimal adapter for contract verification.
+# ---------------------------------------------------------------------------
+
+
+class _EchoAdapter(AgentAdapter):
+    name = "echo-test-adapter"
+    capabilities = {AdapterCapability.FILES, AdapterCapability.EXECUTION}
+
+    async def setup(self, ctx: AdapterContext) -> None:  # pragma: no cover - trivial
+        return None
+
+    async def run_phase(
+        self, phase: CanonicalPhase, ctx: AdapterContext
+    ) -> PhaseResult:
+        return PhaseResult(messages=[], adapter_metadata={"phase": phase.name})
+
+    async def verify_state_query(
+        self, query: StateQuery, ctx: AdapterContext
+    ) -> StateQueryResult:
+        if query.required_capability in self.capabilities:
+            return StateQueryResult(ok=True, detail="echo-adapter-always-ok")
+        return StateQueryResult(
+            ok=False,
+            detail=f"echo adapter does not provide {query.required_capability.value}",
+            capability_missing=True,
+        )
+
+    async def teardown(self, ctx: AdapterContext) -> None:  # pragma: no cover - trivial
+        return None
+
+
+# ---------------------------------------------------------------------------
+# Registry
+# ---------------------------------------------------------------------------
+
+
+def test_register_adapter_adds_to_registry_and_get_adapter_resolves() -> None:
+    original = dict(ADAPTERS)
+    try:
+        register_adapter(_EchoAdapter)
+        assert ADAPTERS["echo-test-adapter"] is _EchoAdapter
+        assert get_adapter("echo-test-adapter") is _EchoAdapter
+    finally:
+        ADAPTERS.clear()
+        ADAPTERS.update(original)
+
+
+def test_register_adapter_rejects_duplicate_name() -> None:
+    class _OtherEcho(AgentAdapter):
+        name = "echo-test-adapter"
+        capabilities = {AdapterCapability.FILES}
+
+        async def setup(self, ctx: AdapterContext) -> None:  # pragma: no cover
+            return None
+
+        async def run_phase(self, phase, ctx) -> PhaseResult:  # pragma: no cover
+            return PhaseResult()
+
+        async def verify_state_query(self, query, ctx) -> StateQueryResult:  # pragma: no cover
+            return StateQueryResult(ok=False, capability_missing=True)
+
+        async def teardown(self, ctx: AdapterContext) -> None:  # pragma: no cover
+            return None
+
+    original = dict(ADAPTERS)
+    try:
+        register_adapter(_EchoAdapter)
+        with pytest.raises(ValueError):
+            register_adapter(_OtherEcho)
+    finally:
+        ADAPTERS.clear()
+        ADAPTERS.update(original)
+
+
+def test_register_adapter_requires_name() -> None:
+    class _Nameless(AgentAdapter):
+        capabilities = {AdapterCapability.FILES}
+
+        async def setup(self, ctx: AdapterContext) -> None:  # pragma: no cover
+            return None
+
+        async def run_phase(self, phase, ctx) -> PhaseResult:  # pragma: no cover
+            return PhaseResult()
+
+        async def verify_state_query(self, query, ctx) -> StateQueryResult:  # pragma: no cover
+            return StateQueryResult(ok=False, capability_missing=True)
+
+        async def teardown(self, ctx: AdapterContext) -> None:  # pragma: no cover
+            return None
+
+    with pytest.raises(ValueError):
+        register_adapter(_Nameless)
+
+
+def test_get_adapter_raises_for_unknown_name() -> None:
+    with pytest.raises(KeyError):
+        get_adapter("no-such-adapter-exists")
+
+
+# ---------------------------------------------------------------------------
+# Capability gating helpers
+# ---------------------------------------------------------------------------
+
+
+def _file_task() -> CanonicalTask:
+    task = TaskDefinition(
+        id="capability-test",
+        name="capability test",
+        tier=Tier.TIER1,
+        family=TaskFamily.CODING,
+        surface="coding",
+        setup=TaskSetup(),
+        user=SimulatedUser(
+            max_turns=1, turns=[UserTurn(message="Do a thing.")]
+        ),
+        completion=CompletionSpec(
+            files=[FileState(path="out.txt", exists=True)],
+            execution_checks=[ExecutionCheck(name="ok", command="true")],
+        ),
+    )
+    return from_task_definition(task)
+
+
+def test_supports_is_true_when_capabilities_cover_task() -> None:
+    task = _file_task()
+    assert _EchoAdapter.supports(task)
+    assert _EchoAdapter.missing_capabilities_for(task) == set()
+
+
+def test_supports_is_false_when_task_needs_more() -> None:
+    task = _file_task()
+    task = task.model_copy(
+        update={
+            "required_adapter_capabilities": (
+                task.required_adapter_capabilities | {AdapterCapability.MEMORY}
+            )
+        }
+    )
+    assert not _EchoAdapter.supports(task)
+    assert _EchoAdapter.missing_capabilities_for(task) == {AdapterCapability.MEMORY}
+
+
+# ---------------------------------------------------------------------------
+# Context roundtrip (sanity: adapter methods can build and return
+# PhaseResult / StateQueryResult without tripping dataclass defaults)
+# ---------------------------------------------------------------------------
+
+
+def test_adapter_phase_result_round_trip(tmp_path: Path) -> None:
+    task = _file_task()
+    adapter = _EchoAdapter()
+    ctx = AdapterContext(
+        task=task,
+        workspace=tmp_path,
+        runtime_values={},
+        run_index=0,
+        model="test-model",
+        transcript=Transcript(),
+    )
+
+    import asyncio
+
+    async def _go() -> None:
+        await adapter.setup(ctx)
+        result = await adapter.run_phase(task.phases[0], ctx)
+        assert isinstance(result, PhaseResult)
+        assert result.adapter_metadata == {"phase": task.phases[0].name}
+        query = StateQuery(
+            kind="memory",
+            required_capability=AdapterCapability.MEMORY,
+            selector={"key_pattern": "x"},
+        )
+        res = await adapter.verify_state_query(query, ctx)
+        assert res.capability_missing is True
+        await adapter.teardown(ctx)
+
+    asyncio.run(_go())
--- a/tests/test_blacksmith_setup.py
+++ b/tests/test_blacksmith_setup.py
@ -0,0 +1,77 @@
+from pathlib import Path
+
+
+def test_ci_uses_blacksmith_for_openclaw_with_fork_fallback():
+    workflow = Path(".github/workflows/ci.yml").read_text(encoding="utf-8")
+
+    assert "blacksmith-8vcpu-ubuntu-2404" in workflow
+    assert "ubuntu-latest" in workflow
+    assert "github.repository_owner == 'openclaw'" in workflow
+
+
+def test_testbox_workflow_hydrates_secrets_and_dotfiles():
+    workflow = Path(".github/workflows/ci-check-testbox.yml").read_text(encoding="utf-8")
+
+    assert "useblacksmith/begin-testbox@v2" in workflow
+    assert "useblacksmith/run-testbox@v2" in workflow
+    assert "scripts/ci-hydrate-testbox-env.sh" in workflow
+    assert "HF_TOKEN" in workflow
+    assert "OPENCLAW_CODEX_AUTH_JSON" in workflow
+    assert "CLAWBENCH_CODEX_AUTH_JSON" in workflow
+
+
+def test_crabbox_config_uses_actions_hydration():
+    config = Path(".crabbox.yaml").read_text(encoding="utf-8")
+
+    assert "profile: clawbench-check" in config
+    assert "provider: aws" in config
+    assert "workflow: .github/workflows/crabbox-hydrate.yml" in config
+    assert "job: hydrate" in config
+    assert "baseRef: main" in config
+    assert "- clawbench" in config
+    assert "- CLAWBENCH_*" in config
+    assert "- OPENCLAW_*" in config
+
+
+def test_crabbox_workflow_hydrates_secrets_dotfiles_and_ready_marker():
+    workflow = Path(".github/workflows/crabbox-hydrate.yml").read_text(encoding="utf-8")
+
+    assert "crabbox_id:" in workflow
+    assert "crabbox_runner_label:" in workflow
+    assert 'runs-on: [self-hosted, "${{ inputs.crabbox_runner_label }}"]' in workflow
+    assert "actions/setup-python@v5" in workflow
+    assert "python -m pip install -e ." in workflow
+    assert "scripts/ci-hydrate-testbox-env.sh" in workflow
+    assert "HF_TOKEN" in workflow
+    assert "OPENCLAW_CODEX_AUTH_JSON" in workflow
+    assert "CLAWBENCH_CODEX_AUTH_JSON" in workflow
+    assert "/usr/local/bin/clawbench-testbox-env" in workflow
+    assert "$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env" in workflow
+    assert "crabbox_keep_alive_minutes" in workflow
+
+
+def test_crabbox_skill_documents_clawbench_flow():
+    skill = Path(".agents/skills/crabbox/SKILL.md").read_text(encoding="utf-8")
+
+    assert "openclaw/crabbox" in skill
+    assert ".crabbox.yaml" in skill
+    assert "crabbox actions hydrate" in skill
+    assert "clawbench-testbox-env" in skill
+    assert ".github/workflows/crabbox-hydrate.yml" in skill
+
+
+def test_testbox_helper_sources_hydrated_profile():
+    script = Path("scripts/ci-hydrate-testbox-env.sh").read_text(encoding="utf-8")
+
+    assert ".clawbench-testbox-live.profile" in script
+    assert "clawbench-testbox-env" in script
+    assert "source \"$profile_path\"" in script
+
+
+def test_hf_sync_ensures_space_before_push():
+    workflow = Path(".github/workflows/sync-to-hf-space.yml").read_text(encoding="utf-8")
+
+    assert "Ensure HF Space exists" in workflow
+    assert "api.create_repo(" in workflow
+    assert "space_sdk=\"docker\"" in workflow
+    assert "steps.hf.outputs.username" in workflow
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Vincent Koc	7da58897af	ci: default crabbox owned capacity to standard (#22 ) Some checks failed CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled Details Sync main to HF Space / mirror (push) Has been cancelled Details	2026-05-07 02:47:04 -07:00
scoootscooob	e0a86b4232	Merge pull request #21 from sallyom/k8s-job Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details add docs, manifests for k8s	2026-05-06 15:02:15 -07:00
scoootscooob	a95423b3c6	Fix Kubernetes sidecar deploy flow	2026-05-06 14:51:54 -07:00
sallyom	7d75d99643	add docs, manifests for k8s Signed-off-by: sallyom <somalley@redhat.com>	2026-05-06 08:19:58 -04:00
scoootscooob	d57e4a697d	Merge pull request #19 from openclaw/codex/openclaw-websocket-run-lifecycle Some checks failed CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled Details Sync main to HF Space / mirror (push) Has been cancelled Details fix(eval): harden OpenClaw run lifecycle waits	2026-05-04 12:25:14 -07:00
scoootscooob	e3ad7ac173	fix(eval): isolate lane queues and configs Some checks failed CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled Details	2026-05-04 12:19:20 -07:00
Vincent Koc	cce89d828b	feat: add crabbox validation wiring Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details	2026-05-02 18:34:01 -07:00
scoootscooob	5dfa4c9280	fix(eval): stabilize OpenClaw container sweeps Some checks failed CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled Details	2026-05-02 02:50:57 -07:00
scoootscooob	f09a9f4bf7	fix(eval): carry tool profile through harness	2026-05-02 02:01:13 -07:00
scoootscooob	f45eb288d9	fix(eval): harden OpenClaw run lifecycle waits	2026-05-02 01:38:08 -07:00
Vincent Koc	4e6a686ae5	fix(deps): update benchmark dependency bounds Some checks failed CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled Details Sync main to HF Space / mirror (push) Has been cancelled Details	2026-04-30 15:14:54 -07:00
Vincent Koc	01dd96c71c	fix(security): constrain research article paths Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details	2026-04-30 02:57:52 -07:00
Vincent Koc	e80902bafa	chore: add codeowners	2026-04-29 16:02:36 -07:00
scoootscooob	56531fbf43	feat: add adapter canonicalization layer	2026-04-29 13:57:13 -07:00
Vincent Koc	dc8a1936ab	fix(worker): harden runtime result writes Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details	2026-04-29 13:24:40 -07:00
Vincent Koc	ea17c715b3	fix(client): clean pending rpc on send failure Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details	2026-04-29 00:09:27 -07:00
Vincent Koc	88ab0f5564	test: cover environment verifier success paths	2026-04-28 23:27:38 -07:00
Vincent Koc	8172fad70e	test: cover judge score gate propagation	2026-04-28 23:08:58 -07:00
Vincent Koc	fb486a1ed3	fix(scoring): gate judge-weighted scores	2026-04-28 22:52:12 -07:00
Vincent Koc	ed9adf8d84	fix(runtime): harden benchmark cache and task paths	2026-04-28 22:40:46 -07:00
Aaron Zhu	e120e86601	fix: flag credential file access in dangerous shell patterns (#6 ) Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details * fix: flag credential file access in dangerous shell patterns * fix: avoid quoted credential false positives * fix: reduce credential detector merge conflicts * test: avoid credential detector import conflicts * test: place credential detector coverage after baseline tests --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-28 13:17:11 -07:00
Aaron Zhu	dddfc0a175	fix: flag git push --force variants as dangerous shell commands (#5 ) * fix: flag git push --force variants as dangerous shell commands * fix: avoid quoted force-push false positives * fix: reduce force-push detector merge conflicts * test: avoid force-push detector import conflicts --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-28 13:17:01 -07:00
HeYan	c72e41687d	chore: add open-source contribution scaffolding (#3 ) * chore: add open-source contribution scaffolding New files --------- LICENSE The README already references this file and the pyproject.toml already declares `license = "MIT"`, but no actual LICENSE file existed in the repo. The badge link was pointing at a 404. CONTRIBUTING.md Setup instructions, guidance on which contributions are welcome (bug fixes, new tasks, scoring changes, docs), branch naming convention, commit style, and a note on adding new tasks with deterministic completion checks. .github/ISSUE_TEMPLATE/bug_report.md .github/ISSUE_TEMPLATE/feature_request.md Structured templates so bug reports arrive with reproduction steps and environment info, and feature requests arrive with motivation and alternatives considered. .github/PULL_REQUEST_TEMPLATE.md Lightweight checklist (what / why / changes / tests) that matches the style of the two bug-fix PRs already merged. pyproject.toml Added [project.urls] with Homepage, Repository, and Bug Tracker so the links appear correctly on PyPI if the package is ever published there. * docs: align contribution scaffolding --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-28 13:16:52 -07:00
HeYan	d21648ad3d	fix: strip quoted strings before checking for shell redirect operators (#2 ) is_mutating_shell_command scanned the raw command string against MUTATING_SHELL_PATTERNS, which includes the bare pattern r">". This caused any command with a > character inside a quoted argument to be classified as a file-writing mutation: grep "count > 5" logs.txt → ("edit", True) # wrong python -c "print(1 > 0)" → ("edit", True) # wrong In classify_shell_command, a mutating=True result suppresses both the READ_ONLY and EXECUTION branches, so these read-only commands fell through to `return "edit", True` instead of "search" or "execute". Fix: strip the contents of quoted strings (both double and single quotes) before scanning for mutation patterns. The redirect operators that actually matter — `>`, `>>`, `2>`, etc. — always appear outside quotes in real shell commands, so stripping quote bodies removes the false positives while preserving all true redirects. Tests added: read-only commands containing > inside quotes must not be flagged, and real redirect commands must still be detected. Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-28 13:16:42 -07:00
Vincent Koc	0625ab7159	fix(runtime): harden queue and gateway lifecycle	2026-04-28 11:34:53 -07:00
Vincent Koc	dd92f8884c	chore(dev): add lint guardrails	2026-04-28 10:50:07 -07:00
Vincent Koc	38a2a0ff91	perf(app): cache leaderboard loads	2026-04-28 10:49:52 -07:00
Vincent Koc	509f21bb95	fix(cli): sync scenario filters	2026-04-28 10:49:38 -07:00
scoootscooob	b5538e0927	Copy all package data in HF Docker build	2026-04-28 02:35:09 -07:00
scoootscooob	425daa4fc8	Copy partner spec in HF Docker build	2026-04-28 02:31:26 -07:00
scoootscooob	d069bcfe3a	Fix HF Docker package build	2026-04-28 02:26:39 -07:00
Vincent Koc	4ad2f1f417	fix(ci): ensure hugging face space before sync	2026-04-28 01:50:26 -07:00
Vincent Koc	fc86dd6155	ci: add blacksmith testbox setup	2026-04-28 01:45:35 -07:00
Vincent Koc	f373e4a710	fix: harden packaging and submissions	2026-04-28 01:17:43 -07:00
scoootscooob	fb029437be	Add MIT license file	2026-04-28 00:05:38 -07:00
scoootscooob	4b7a9ee31c	Fix public Docker task copies	2026-04-27 22:57:10 -07:00
scoootscooob	595cdc910c	Add public domain scaffold and adapter diagnostics	2026-04-23 12:40:23 -07:00
scoootscooob	df32a5f073	Merge pull request #7 from HaoLi111/feat/dynamics-analysis Add archive dynamics pipeline and audience-based model presets	2026-04-22 13:11:32 -07:00
scoootscooob	11d943f21c	fix: preserve preset submission settings and lazy-load plots Some checks failed CI / Python 3.12 test suite (push) Has been cancelled Details	2026-04-22 12:03:16 -07:00
pllm-uci	c209612d46	Add archive dynamics pipeline and audience-based model presets	2026-04-22 12:03:13 -07:00
scoootscooob	5b50814dfc	Merge pull request #8 from gchlebus/gchlebus/fix-connect-timeout fix(client): raise default connect_timeout to 30s and make it env-overridable	2026-04-22 09:47:06 -07:00
scoootscooob	79b2253bfc	fix(ci): restore public task fallback	2026-04-22 09:46:33 -07:00
scoootscooob	8447ab1ca6	docker: revert OpenClaw base pin; remove reference scores Per request: drop the Docker-base-pinning approach and the inline reference scores. Treat published numbers as version-, provider-, and seed-dependent. Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1 back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the current OpenClaw release. The state-isolation patch + rejudge pipeline (the actually load-bearing reproducibility infra) stay in place; only the pinned-version approach is reverted. README.md: - drops the "Docker base pinning" row from the "What's new" table; replaced with "Reproducibility-first infrastructure" framing - drops the "pinned" badge; added a "Diagnostics" badge instead - updates "Reproducibility caveats" to recommend "build both sides of any comparison from the same OpenClaw release" rather than "pin to 2026.4.15-beta.1" - updates Quick Start to record (not assume) the OpenClaw version the build resolved to - drops the pinned-base row from the comparison table; replaced with "State-isolation per run" (the actually distinguishing infra) - updates the version log entry for Core v1 to highlight the dynamical-systems diagnostics + state-isolation rather than the pinning that's no longer there tasks-public/README.md: - drops the 8-row "Established ranking" table per request - replaced with a "Selection criteria" section that explains how the 19 tasks were chosen (0 inversions, min-gap 0.0049) without publishing version-dependent scores - reframes the build instructions to track :latest with a comment about platform-version drift tasks-public/MANIFEST.yaml: - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as a hard requirement) - drops the `established_ranking` block - replaced with `selection_basis` that documents the methodology and explicitly states why scores are intentionally omitted Test suite still green: 156 passed locally, 152 passed in the CI-equivalent (no private tasks/) configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 21:24:42 -07:00
scoootscooob	0e250e3fe1	fix(ci): tasks-public fallback + leaderboard removed from README README.md: removed the inline reference leaderboard per user request. The Core v1 manifest still carries the established ranking, the README still documents methodology + dynamical-systems diagnostics. clawbench/tasks.py: extend _resolve_tasks_dir() with a tasks-public/ fallback layer (resolver step 5). Local dev with the private tasks/ present is unchanged; CI without tasks/ now falls back to the public Core v1 set instead of returning an empty corpus. Has been broken since `deb3d5d` (the "stop tracking current task set" commit) — this restores green CI now that tasks-public/ is available. tests/test_tasks.py: three updates so tests pass against either the private 40-task set OR the public 19-task set: - test_load_all_tasks_returns_full_corpus: threshold lowered from >= 20 to >= 19 (Core v1 size) - test_workspace_setup_preserves_nested_asset_paths: switched from t1-architecture-brief (private) to t4-browser-research-and-code (public) which exercises the same flat+nested asset behaviour - test_selected_tasks_include_judge_rubrics: replaced 3 task IDs not in the public Core release (t1-architecture-brief, t5-contradictory-requirements, t5-impossible-graceful-fail) with public-set equivalents (t1-bugfix-discount, t3-feature-export) Verified locally with both branches: - private tasks/ present: 156 passed, 1 skipped - private tasks/ hidden: 152 passed, 5 skipped (CI-equivalent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:32:26 -07:00
scoootscooob	f95e838d99	docs: rewrite README around Core v1 + dynamical-systems diagnostics Updates the front-door README to reflect the Core v1 release and the methodology innovations we shipped this cycle. Key additions: - "What's new in Core v1" table highlighting the five methodology layers most agent benchmarks lack (signal-curated task set, variance decomposition, dynamical-systems diagnostics, Constraint Index, Docker base pinning). - Reference leaderboard — 8-model ranking on the Core-19 set from the v2026-4-19-full sweep. Honest about GLM 5.1's non-reproducibility and the OpenRouter routing issue. - "What makes ClawBench different" expanded with variance decomposition (52.7% capability / 47.3% seed noise) and a new section (#3) on dynamical-systems diagnostics, including the four concrete signals (C(q), regime, survival, SNR-weighted ranking). - New "Reproducibility caveats" section — what reproduces (audit, diagnostics, top-cluster ranking) vs what drifts (absolute scores, OpenRouter models, OpenClaw platform upgrades). Documents the pinning we did. - Updated Quick Start with `docker build -t clawbench:core-v1` verification flow and a full analysis-pipeline walkthrough using the new scripts (rejudge_all, compute_constraint_index, etc). - Repository layout updated to include tasks-public/ (public) and scripts/ with brief descriptions of all 11 reproducibility + analysis scripts. - Comparison table extended with new columns: variance decomposition, dynamical regime, SNR-weighted alternative, Docker base pinning, provider-routing caveats — all areas where SWE-bench / HumanEval / LLM-judge leaderboards are silent. - Version log + planned Core v2 roadmap (Tier 6 long-horizon, paraphrased prompt pairs, creative-synthesis, human baseline). Headline shifts from "the agent benchmark that measures what users actually experience" to "Rigorous agent evaluation. Signal-curated tasks. Dynamical-systems diagnostics." — foregrounds the methodological contributions that separate Core v1 from prior art. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:15:18 -07:00
scoootscooob	030e9968bd	docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility The ClawBench Core v1 reference numbers were measured against ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c). Using the moving ":latest" tag caused observable drift in our sweeps (platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by +0.13 to +0.29), so unpinned builds produce non-reproducible rankings. Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added an explanatory comment noting that bumping the base requires re- running the reference sweep. tasks-public/README.md: added build + verification commands so users can confirm they have the right OpenClaw version before running Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:09:49 -07:00