[BREAKGLASS] The agent benchmark that scores the full stack — harness, config, and model — not just the LLM. Trace-based scoring, reliability metrics, configuration diagnostics. https://huggingface.co/spaces/ScoootScooob/clawbench

Go to file

scoootscooob abf3500f69 Some checks failed CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled Details fix(harness): keep gateway RPC sockets alive		2026-05-02 14:51:52 -07:00
.agents/skills/blacksmith-testbox	ci: add blacksmith testbox setup	2026-04-28 01:45:35 -07:00
.github	fix(ci): ensure hugging face space before sync	2026-04-28 01:50:26 -07:00
baselines	baselines: merge provenance docs into BASELINE_SOURCES.md	2026-04-10 20:36:18 -07:00
clawbench	fix(harness): keep gateway RPC sockets alive	2026-05-02 14:51:52 -07:00
docs	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
patches	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
profiles	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
scripts	chore(repo): clean public benchmark surface	2026-05-02 12:18:58 -07:00
tasks-domain	Add public domain scaffold and adapter diagnostics	2026-04-23 12:40:23 -07:00
tasks-public	chore(repo): clean public benchmark surface	2026-05-02 12:18:58 -07:00
tests	fix(harness): keep gateway RPC sockets alive	2026-05-02 14:51:52 -07:00
.dockerignore	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
.gitignore	tasks: stop tracking current task set; fix t2 integration test for emptyNote	2026-04-19 12:29:52 -07:00
.python-version	fix: preserve preset submission settings and lazy-load plots	2026-04-22 12:03:16 -07:00
app.py	fix: harden packaging and submissions	2026-04-28 01:17:43 -07:00
CLAWBENCH_V0_4_SPEC.md	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
docker-compose.yml	worker: harden gateway runtime and resume behavior	2026-04-11 15:27:14 -07:00
Dockerfile	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
Dockerfile.clawbench-426-agent-hotfix	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
Dockerfile.gbrain	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
Dockerfile.main	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
Dockerfile.openclaw-426-agent-hotfix	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
LICENSE	Add MIT license file	2026-04-28 00:05:38 -07:00
PARTNER_TRACE_SPEC.md	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
pyproject.toml	feat(eval): stabilize full-suite adapter runs	2026-05-02 10:24:03 -07:00
README.md	chore(repo): clean public benchmark surface	2026-05-02 12:18:58 -07:00
SPACE_README.md	chore(repo): clean public benchmark surface	2026-05-02 12:18:58 -07:00

README.md

title	emoji	colorFrom	colorTo	sdk	app_port	pinned	license
ClawBench	🦞	red	yellow	docker	7860	true	mit

ClawBench

Trace-scored agent evaluation for OpenClaw.

What This Repo Contains

ClawBench evaluates AI agents by running real local tasks, capturing the execution trace, and scoring both the final state and the process used to get there.

The public repository contains:

tasks-public/: Core v1, a 19-task public reproducibility suite.
clawbench/: the benchmark harness, adapters, canonical task conversion, scoring, statistics, and diagnostics.
profiles/: example model/profile definitions.
scripts/: reusable analysis and container runner utilities.
tests/: unit and integration coverage for the public harness.

The private holdout is intentionally not included:

private task YAML files,
private task assets and verifier scripts,
private expected outputs,
private run traces, logs, and per-task reports.

Internal hidden-suite runs can restore a private tasks/ directory locally. The public code is designed to run without that directory by falling back to tasks-public/.

Core v1

Core v1 is a signal-curated 19-task public release selected from the internal development pool. It preserves tier and family coverage while avoiding tasks whose public release would leak holdout material or add mostly run-to-run noise.

Dimension	Breakdown
Tasks	19
Runs per official comparison	3 per task
Total runs per model	57
Tiers	T1=2, T2=6, T3=5, T4=5, T5=1
Families	tools=8, coding=2, repo=3, browser=2, multi_tool=3, adversarial=1

The manifest is the source of truth:

python3 - <<'PY'
import yaml
manifest = yaml.safe_load(open("tasks-public/MANIFEST.yaml"))
for task in manifest["tasks"]:
    print(task["id"])
PY

Scoring

Each run is scored from four signals:

Axis	Weight	What it measures
Completion	40%	Deterministic task checks such as tests, exact outputs, DOM assertions, and file verification
Trajectory	30%	Tool-use quality such as read-before-write, self-verification, recovery, and tool-family fit
Behavior	20%	Planning, progress updates, blocker handling, and destructive-command avoidance
Judge	Up to 10%	Optional semantic quality, gated so it cannot rescue failed deterministic checks

Reliability is first-class. Official comparisons run each task three times and report per-task variance, pass rate, pass^k, confidence intervals, and worst-of-n style robustness signals.

Quick Start

Install locally:

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .

List public tasks:

clawbench list-tasks --tasks-dir tasks-public

Run a small public smoke:

export OPENCLAW_GATEWAY_TOKEN=<your-token>

clawbench run \
  --model anthropic/claude-opus-4-6 \
  --runs 1 \
  --task t1-bugfix-discount \
  --task t1-fs-quick-note \
  --output results/public_smoke.json

Run the full Core v1 task list:

TASK_ARGS=$(python3 - <<'PY'
import yaml
manifest = yaml.safe_load(open("tasks-public/MANIFEST.yaml"))
print(" ".join(f"--task {task['id']}" for task in manifest["tasks"]))
PY
)

clawbench run \
  --model anthropic/claude-opus-4-6 \
  --runs 3 \
  --concurrency 4 \
  $TASK_ARGS \
  --output results/core_v1_opus46.json

Build the public Space image:

docker build -t clawbench .
docker run --rm --entrypoint openclaw clawbench --version

Hidden-Suite Reproduction

The hidden full-suite runner is public, but the task content is not. To rerun an internal hidden-suite comparison, restore the private task archive into ./tasks/ before building the hidden eval image. Do not commit that directory, its logs, or generated per-task traces.

docker build -f Dockerfile.openclaw-426-agent-hotfix \
  -t openclaw-426-agent-hotfix:latest .

docker build -f Dockerfile.clawbench-426-agent-hotfix \
  -t clawbench-openclaw-426-agent-hotfix:latest .

The public repo intentionally does not include exact private task IDs, prompts, assets, expected artifacts, or trace-derived private reports.

Analysis Tools

Reusable scripts that operate on public or private result archives:

scripts/container_lane_eval.sh: isolated OpenClaw lane runner.
scripts/container_adapter_eval.sh: adapter/model runner for fair adapter comparisons.
scripts/run_posterior_dynamics_pipeline.py: one-shot offline dynamics analysis.
scripts/compute_constraint_index.py: task-level constraint index.
scripts/variance_decomp.py: seed-noise vs capability-signal decomposition.
scripts/survival_analysis.py: per-turn failure survival curves.
scripts/snr_weighted_ranking.py: SNR-weighted ranking.

Generated data, traces, and reports are local artifacts and are ignored by Git.

Repository Layout

clawbench/
├── clawbench/                  # Harness, adapters, scoring, diagnostics
├── tasks-public/               # Core v1 public task suite
├── tasks-domain/               # Domain expansion scaffold
├── profiles/                   # Model/profile definitions
├── scripts/                    # Reusable runners and offline analysis
├── tests/                      # Public test suite
├── Dockerfile                  # Public HF Space image
├── Dockerfile.main             # Main-variant public image
├── Dockerfile.openclaw-426-agent-hotfix
├── Dockerfile.clawbench-426-agent-hotfix
├── CLAWBENCH_V0_4_SPEC.md
└── PARTNER_TRACE_SPEC.md

Testing

python -m pytest -q

The test suite includes public-surface checks to keep the README and Space description aligned with tasks-public/MANIFEST.yaml.

License

MIT. See LICENSE.

Citation

@software{clawbench,
  title  = {ClawBench: Trace-Scored Agent Benchmark with Dynamical-Systems Diagnostics},
  author = {ScoootScooob},
  year   = {2026},
  url    = {https://github.com/openclaw/clawbench}
}