clawbench/SPACE_README.md
scoootscooob cebd1c8026
Some checks are pending
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run
chore(repo): clean public benchmark surface
2026-05-02 12:18:58 -07:00

2.5 KiB

title emoji colorFrom colorTo sdk app_port pinned license
ClawBench 🦞 red yellow docker 7860 true mit

ClawBench

Execution-first benchmark for AI models acting as OpenClaw agents.

Benchmark Shape

public suite   : Core v1
tasks          : 19
runs/model     : 57 for official Core v1 comparisons
tiers          : 5
browser tasks  : 2
primary metric : trace-scored task score plus reliability

What Gets Scored

Layer Verification style
Completion pytest, exact output checks, browser flow checks, file checks, and verifier scripts
Trajectory read-before-write, self-verification, recovery quality, tool-family fit, and safety rules
Behavior deterministic transcript checks for planning, progress, blockers, and safe handling
Reliability repeated runs with pass^k, pass rate, and score variance

The advisory judge is optional and cannot replace deterministic verification.

Runtime Flow

task yaml + assets
  -> isolated workspace
  -> optional local background services
  -> OpenClaw agent session
  -> transcript + tool-result capture
  -> completion / trajectory / behavior scoring
  -> reliability aggregation

Public Task Inventory

The Space uses tasks-public/MANIFEST.yaml as the source of truth. Current Core v1 tasks are:

Task Tier Family
t1-bugfix-discount tier1 coding
t1-fs-quick-note tier1 tools
t2-add-tests-normalizer tier2 coding
t2-browser-form-fix tier2 browser
t2-config-loader tier2 repo
t2-fs-find-that-thing tier2 tools
t2-msg-summarize-thread tier2 tools
t2-priv-redact-doc tier2 tools
t3-data-pipeline-report tier3 multi_tool
t3-data-sql-query tier3 tools
t3-feature-export tier3 repo
t3-msg-inbox-triage tier3 tools
t3-web-research-and-cite tier3 tools
t4-browser-research-and-code tier4 browser
t4-cross-repo-migration tier4 repo
t4-delegation-repair tier4 multi_tool
t4-life-trip-plan tier4 tools
t4-memory-recall-continuation tier4 multi_tool
t5-hallucination-resistant-evidence tier5 adversarial

Holdout Policy

Private task bodies, assets, expected outputs, verifier details, run traces, logs, and per-task private reports are not part of the public Space. Public Core v1 is intended for reproducibility and development; hidden-suite runs use the same harness with a private task directory restored locally.