[BREAKGLASS] The agent benchmark that scores the full stack — harness, config, and model — not just the LLM. Trace-based scoring, reliability metrics, configuration diagnostics. https://huggingface.co/spaces/ScoootScooob/clawbench
Go to file
scoootscooob abf3500f69
Some checks failed
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled
fix(harness): keep gateway RPC sockets alive
2026-05-02 14:51:52 -07:00
.agents/skills/blacksmith-testbox ci: add blacksmith testbox setup 2026-04-28 01:45:35 -07:00
.github fix(ci): ensure hugging face space before sync 2026-04-28 01:50:26 -07:00
baselines baselines: merge provenance docs into BASELINE_SOURCES.md 2026-04-10 20:36:18 -07:00
clawbench fix(harness): keep gateway RPC sockets alive 2026-05-02 14:51:52 -07:00
docs feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
patches feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
profiles feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
scripts chore(repo): clean public benchmark surface 2026-05-02 12:18:58 -07:00
tasks-domain Add public domain scaffold and adapter diagnostics 2026-04-23 12:40:23 -07:00
tasks-public chore(repo): clean public benchmark surface 2026-05-02 12:18:58 -07:00
tests fix(harness): keep gateway RPC sockets alive 2026-05-02 14:51:52 -07:00
.dockerignore feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
.gitignore tasks: stop tracking current task set; fix t2 integration test for emptyNote 2026-04-19 12:29:52 -07:00
.python-version fix: preserve preset submission settings and lazy-load plots 2026-04-22 12:03:16 -07:00
app.py fix: harden packaging and submissions 2026-04-28 01:17:43 -07:00
CLAWBENCH_V0_4_SPEC.md feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
docker-compose.yml worker: harden gateway runtime and resume behavior 2026-04-11 15:27:14 -07:00
Dockerfile feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
Dockerfile.clawbench-426-agent-hotfix feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
Dockerfile.gbrain feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
Dockerfile.main feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
Dockerfile.openclaw-426-agent-hotfix feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
LICENSE Add MIT license file 2026-04-28 00:05:38 -07:00
PARTNER_TRACE_SPEC.md feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
pyproject.toml feat(eval): stabilize full-suite adapter runs 2026-05-02 10:24:03 -07:00
README.md chore(repo): clean public benchmark surface 2026-05-02 12:18:58 -07:00
SPACE_README.md chore(repo): clean public benchmark surface 2026-05-02 12:18:58 -07:00

title emoji colorFrom colorTo sdk app_port pinned license
ClawBench 🦞 red yellow docker 7860 true mit

ClawBench

Trace-scored agent evaluation for OpenClaw.

Python 3.11+ License: MIT Core v1: 19 tasks HF Dataset


What This Repo Contains

ClawBench evaluates AI agents by running real local tasks, capturing the execution trace, and scoring both the final state and the process used to get there.

The public repository contains:

  • tasks-public/: Core v1, a 19-task public reproducibility suite.
  • clawbench/: the benchmark harness, adapters, canonical task conversion, scoring, statistics, and diagnostics.
  • profiles/: example model/profile definitions.
  • scripts/: reusable analysis and container runner utilities.
  • tests/: unit and integration coverage for the public harness.

The private holdout is intentionally not included:

  • private task YAML files,
  • private task assets and verifier scripts,
  • private expected outputs,
  • private run traces, logs, and per-task reports.

Internal hidden-suite runs can restore a private tasks/ directory locally. The public code is designed to run without that directory by falling back to tasks-public/.

Core v1

Core v1 is a signal-curated 19-task public release selected from the internal development pool. It preserves tier and family coverage while avoiding tasks whose public release would leak holdout material or add mostly run-to-run noise.

Dimension Breakdown
Tasks 19
Runs per official comparison 3 per task
Total runs per model 57
Tiers T1=2, T2=6, T3=5, T4=5, T5=1
Families tools=8, coding=2, repo=3, browser=2, multi_tool=3, adversarial=1

The manifest is the source of truth:

python3 - <<'PY'
import yaml
manifest = yaml.safe_load(open("tasks-public/MANIFEST.yaml"))
for task in manifest["tasks"]:
    print(task["id"])
PY

Scoring

Each run is scored from four signals:

Axis Weight What it measures
Completion 40% Deterministic task checks such as tests, exact outputs, DOM assertions, and file verification
Trajectory 30% Tool-use quality such as read-before-write, self-verification, recovery, and tool-family fit
Behavior 20% Planning, progress updates, blocker handling, and destructive-command avoidance
Judge Up to 10% Optional semantic quality, gated so it cannot rescue failed deterministic checks

Reliability is first-class. Official comparisons run each task three times and report per-task variance, pass rate, pass^k, confidence intervals, and worst-of-n style robustness signals.

Quick Start

Install locally:

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .

List public tasks:

clawbench list-tasks --tasks-dir tasks-public

Run a small public smoke:

export OPENCLAW_GATEWAY_TOKEN=<your-token>

clawbench run \
  --model anthropic/claude-opus-4-6 \
  --runs 1 \
  --task t1-bugfix-discount \
  --task t1-fs-quick-note \
  --output results/public_smoke.json

Run the full Core v1 task list:

TASK_ARGS=$(python3 - <<'PY'
import yaml
manifest = yaml.safe_load(open("tasks-public/MANIFEST.yaml"))
print(" ".join(f"--task {task['id']}" for task in manifest["tasks"]))
PY
)

clawbench run \
  --model anthropic/claude-opus-4-6 \
  --runs 3 \
  --concurrency 4 \
  $TASK_ARGS \
  --output results/core_v1_opus46.json

Build the public Space image:

docker build -t clawbench .
docker run --rm --entrypoint openclaw clawbench --version

Hidden-Suite Reproduction

The hidden full-suite runner is public, but the task content is not. To rerun an internal hidden-suite comparison, restore the private task archive into ./tasks/ before building the hidden eval image. Do not commit that directory, its logs, or generated per-task traces.

docker build -f Dockerfile.openclaw-426-agent-hotfix \
  -t openclaw-426-agent-hotfix:latest .

docker build -f Dockerfile.clawbench-426-agent-hotfix \
  -t clawbench-openclaw-426-agent-hotfix:latest .

The public repo intentionally does not include exact private task IDs, prompts, assets, expected artifacts, or trace-derived private reports.

Analysis Tools

Reusable scripts that operate on public or private result archives:

  • scripts/container_lane_eval.sh: isolated OpenClaw lane runner.
  • scripts/container_adapter_eval.sh: adapter/model runner for fair adapter comparisons.
  • scripts/run_posterior_dynamics_pipeline.py: one-shot offline dynamics analysis.
  • scripts/compute_constraint_index.py: task-level constraint index.
  • scripts/variance_decomp.py: seed-noise vs capability-signal decomposition.
  • scripts/survival_analysis.py: per-turn failure survival curves.
  • scripts/snr_weighted_ranking.py: SNR-weighted ranking.

Generated data, traces, and reports are local artifacts and are ignored by Git.

Repository Layout

clawbench/
├── clawbench/                  # Harness, adapters, scoring, diagnostics
├── tasks-public/               # Core v1 public task suite
├── tasks-domain/               # Domain expansion scaffold
├── profiles/                   # Model/profile definitions
├── scripts/                    # Reusable runners and offline analysis
├── tests/                      # Public test suite
├── Dockerfile                  # Public HF Space image
├── Dockerfile.main             # Main-variant public image
├── Dockerfile.openclaw-426-agent-hotfix
├── Dockerfile.clawbench-426-agent-hotfix
├── CLAWBENCH_V0_4_SPEC.md
└── PARTNER_TRACE_SPEC.md

Testing

python -m pytest -q

The test suite includes public-surface checks to keep the README and Space description aligned with tasks-public/MANIFEST.yaml.

License

MIT. See LICENSE.

Citation

@software{clawbench,
  title  = {ClawBench: Trace-Scored Agent Benchmark with Dynamical-Systems Diagnostics},
  author = {ScoootScooob},
  year   = {2026},
  url    = {https://github.com/openclaw/clawbench}
}