openclaw/clawbench

Fork 0

scoootscooob cebd1c8026

CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run

Details

CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run

Details

chore(repo): clean public benchmark surface

2026-05-02 12:18:58 -07:00

2.5 KiB

Raw Permalink Blame History

title	emoji	colorFrom	colorTo	sdk	app_port	pinned	license
ClawBench	🦞	red	yellow	docker	7860	true	mit

ClawBench

Execution-first benchmark for AI models acting as OpenClaw agents.

Benchmark Shape

public suite   : Core v1
tasks          : 19
runs/model     : 57 for official Core v1 comparisons
tiers          : 5
browser tasks  : 2
primary metric : trace-scored task score plus reliability

What Gets Scored

Layer	Verification style
Completion	`pytest`, exact output checks, browser flow checks, file checks, and verifier scripts
Trajectory	read-before-write, self-verification, recovery quality, tool-family fit, and safety rules
Behavior	deterministic transcript checks for planning, progress, blockers, and safe handling
Reliability	repeated runs with pass^k, pass rate, and score variance

The advisory judge is optional and cannot replace deterministic verification.

Runtime Flow

task yaml + assets
  -> isolated workspace
  -> optional local background services
  -> OpenClaw agent session
  -> transcript + tool-result capture
  -> completion / trajectory / behavior scoring
  -> reliability aggregation

Public Task Inventory

The Space uses tasks-public/MANIFEST.yaml as the source of truth. Current Core v1 tasks are:

Task	Tier	Family
`t1-bugfix-discount`	tier1	coding
`t1-fs-quick-note`	tier1	tools
`t2-add-tests-normalizer`	tier2	coding
`t2-browser-form-fix`	tier2	browser
`t2-config-loader`	tier2	repo
`t2-fs-find-that-thing`	tier2	tools
`t2-msg-summarize-thread`	tier2	tools
`t2-priv-redact-doc`	tier2	tools
`t3-data-pipeline-report`	tier3	multi_tool
`t3-data-sql-query`	tier3	tools
`t3-feature-export`	tier3	repo
`t3-msg-inbox-triage`	tier3	tools
`t3-web-research-and-cite`	tier3	tools
`t4-browser-research-and-code`	tier4	browser
`t4-cross-repo-migration`	tier4	repo
`t4-delegation-repair`	tier4	multi_tool
`t4-life-trip-plan`	tier4	tools
`t4-memory-recall-continuation`	tier4	multi_tool
`t5-hallucination-resistant-evidence`	tier5	adversarial

Holdout Policy

Private task bodies, assets, expected outputs, verifier details, run traces, logs, and per-task private reports are not part of the public Space. Public Core v1 is intended for reproducibility and development; hidden-suite runs use the same harness with a private task directory restored locally.

2.5 KiB Raw Permalink Blame History