* chore: add open-source contribution scaffolding
New files
---------
LICENSE
The README already references this file and the pyproject.toml already
declares `license = "MIT"`, but no actual LICENSE file existed in the
repo. The badge link was pointing at a 404.
CONTRIBUTING.md
Setup instructions, guidance on which contributions are welcome (bug
fixes, new tasks, scoring changes, docs), branch naming convention,
commit style, and a note on adding new tasks with deterministic
completion checks.
.github/ISSUE_TEMPLATE/bug_report.md
.github/ISSUE_TEMPLATE/feature_request.md
Structured templates so bug reports arrive with reproduction steps and
environment info, and feature requests arrive with motivation and
alternatives considered.
.github/PULL_REQUEST_TEMPLATE.md
Lightweight checklist (what / why / changes / tests) that matches the
style of the two bug-fix PRs already merged.
pyproject.toml
Added [project.urls] with Homepage, Repository, and Bug Tracker so the
links appear correctly on PyPI if the package is ever published there.
* docs: align contribution scaffolding
---------
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
is_mutating_shell_command scanned the raw command string against
MUTATING_SHELL_PATTERNS, which includes the bare pattern r">". This
caused any command with a > character inside a quoted argument to be
classified as a file-writing mutation:
grep "count > 5" logs.txt → ("edit", True) # wrong
python -c "print(1 > 0)" → ("edit", True) # wrong
In classify_shell_command, a mutating=True result suppresses both the
READ_ONLY and EXECUTION branches, so these read-only commands fell
through to `return "edit", True` instead of "search" or "execute".
Fix: strip the contents of quoted strings (both double and single
quotes) before scanning for mutation patterns. The redirect operators
that actually matter — `>`, `>>`, `2>`, etc. — always appear outside
quotes in real shell commands, so stripping quote bodies removes the
false positives while preserving all true redirects.
Tests added: read-only commands containing > inside quotes must not be
flagged, and real redirect commands must still be detected.
Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
The default connect_timeout=15.0 is shorter than the
observed first-session setup time against a freshly started
OpenClaw gateway (we've measured phase0_session_setup
~20-25s during containerised benchmark runs), which creates a
race where the client gives up before the gateway is ready for
the first turn. Downstream the adapter then surfaces this as
an ``empty_response`` with zero transcript steps, which looks
like a model failure when it's really an environment timing
issue.
Concrete repro from a 19-task public_dev run:
task: t4-life-trip-plan
failed: reward=0, failure_category=empty_response,
duration_ms=0, total_ms=16352, response hash
= SHA256 of empty string
rerun: score=0.927 standalone, phase0_session_setup=21.2s
Change:
* GatewayConfig.connect_timeout default 15.0 -> 30.0
* GatewayConfig.request_timeout default kept at 60.0 but
now explicitly documented and overridable for symmetry
* Both are now overridable via environment variables
CLAWBENCH_CONNECT_TIMEOUT / CLAWBENCH_REQUEST_TIMEOUT
so ops can tune further without a code change.
* Invalid env values are logged and fall back to the default
rather than blowing up benchmark runs.
* Adds three unit tests covering default, env override, and
invalid-env fallback behaviour.
Reported-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.
Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.
README.md:
- drops the "Docker base pinning" row from the "What's new" table;
replaced with "Reproducibility-first infrastructure" framing
- drops the "pinned" badge; added a "Diagnostics" badge instead
- updates "Reproducibility caveats" to recommend "build both sides
of any comparison from the same OpenClaw release" rather than
"pin to 2026.4.15-beta.1"
- updates Quick Start to record (not assume) the OpenClaw version
the build resolved to
- drops the pinned-base row from the comparison table; replaced
with "State-isolation per run" (the actually distinguishing infra)
- updates the version log entry for Core v1 to highlight the
dynamical-systems diagnostics + state-isolation rather than the
pinning that's no longer there
tasks-public/README.md:
- drops the 8-row "Established ranking" table per request
- replaced with a "Selection criteria" section that explains how
the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
publishing version-dependent scores
- reframes the build instructions to track :latest with a comment
about platform-version drift
tasks-public/MANIFEST.yaml:
- drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
a hard requirement)
- drops the `established_ranking` block
- replaced with `selection_basis` that documents the methodology
and explicitly states why scores are intentionally omitted
Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README.md: removed the inline reference leaderboard per user request.
The Core v1 manifest still carries the established ranking, the
README still documents methodology + dynamical-systems diagnostics.
clawbench/tasks.py: extend _resolve_tasks_dir() with a tasks-public/
fallback layer (resolver step 5). Local dev with the private tasks/
present is unchanged; CI without tasks/ now falls back to the public
Core v1 set instead of returning an empty corpus. Has been broken
since deb3d5d (the "stop tracking current task set" commit) — this
restores green CI now that tasks-public/ is available.
tests/test_tasks.py: three updates so tests pass against either the
private 40-task set OR the public 19-task set:
- test_load_all_tasks_returns_full_corpus: threshold lowered from
>= 20 to >= 19 (Core v1 size)
- test_workspace_setup_preserves_nested_asset_paths: switched from
t1-architecture-brief (private) to t4-browser-research-and-code
(public) which exercises the same flat+nested asset behaviour
- test_selected_tasks_include_judge_rubrics: replaced 3 task IDs
not in the public Core release (t1-architecture-brief,
t5-contradictory-requirements, t5-impossible-graceful-fail) with
public-set equivalents (t1-bugfix-discount, t3-feature-export)
Verified locally with both branches:
- private tasks/ present: 156 passed, 1 skipped
- private tasks/ hidden: 152 passed, 5 skipped (CI-equivalent)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the front-door README to reflect the Core v1 release and the
methodology innovations we shipped this cycle. Key additions:
- "What's new in Core v1" table highlighting the five methodology
layers most agent benchmarks lack (signal-curated task set,
variance decomposition, dynamical-systems diagnostics, Constraint
Index, Docker base pinning).
- Reference leaderboard — 8-model ranking on the Core-19 set from the
v2026-4-19-full sweep. Honest about GLM 5.1's non-reproducibility
and the OpenRouter routing issue.
- "What makes ClawBench different" expanded with variance
decomposition (52.7% capability / 47.3% seed noise) and a new
section (#3) on dynamical-systems diagnostics, including the four
concrete signals (C(q), regime, survival, SNR-weighted ranking).
- New "Reproducibility caveats" section — what reproduces (audit,
diagnostics, top-cluster ranking) vs what drifts (absolute scores,
OpenRouter models, OpenClaw platform upgrades). Documents the
pinning we did.
- Updated Quick Start with `docker build -t clawbench:core-v1`
verification flow and a full analysis-pipeline walkthrough using
the new scripts (rejudge_all, compute_constraint_index, etc).
- Repository layout updated to include tasks-public/ (public) and
scripts/ with brief descriptions of all 11 reproducibility +
analysis scripts.
- Comparison table extended with new columns: variance decomposition,
dynamical regime, SNR-weighted alternative, Docker base pinning,
provider-routing caveats — all areas where SWE-bench / HumanEval /
LLM-judge leaderboards are silent.
- Version log + planned Core v2 roadmap (Tier 6 long-horizon,
paraphrased prompt pairs, creative-synthesis, human baseline).
Headline shifts from "the agent benchmark that measures what users
actually experience" to "Rigorous agent evaluation. Signal-curated
tasks. Dynamical-systems diagnostics." — foregrounds the
methodological contributions that separate Core v1 from prior art.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ClawBench Core v1 reference numbers were measured against
ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c).
Using the moving ":latest" tag caused observable drift in our sweeps
(platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by
+0.13 to +0.29), so unpinned builds produce non-reproducible rankings.
Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added
an explanatory comment noting that bumping the base requires re-
running the reference sweep.
tasks-public/README.md: added build + verification commands so users
can confirm they have the right OpenClaw version before running Core
v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stages a curated 19-task subset of the internal 40-task dev pool as
the public ClawBench release. Selected via greedy task elimination
from the v2026-4-19-full sweep archive so that:
(a) mean run_score across these 19 tasks reproduces the established
8-model ranking with zero inversions and min adjacent-rank gap
of 0.0049 (well above the ~0.002 seed-noise floor);
(b) coverage is preserved across tiers 1-5 and across the tools,
coding, repo, browser, multi_tool, and adversarial families;
(c) tasks with broken verifiers or near-zero cross-model SNR are
dropped (21 tasks retained as private holdout, not published).
Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs
per task, C+T+B+J weighted score):
1. Claude Opus 4.6 0.8137
2. Claude Opus 4.7 0.7824
3. GPT 5.4 0.7647
4. Claude Sonnet 4.6 0.7597
5. MiniMax M2.7 0.7475
6. Gemini 3.1 Pro 0.7408
7. Qwen 3.6 Plus 0.7030
8. Kimi K2.5 0.6800
Deliverables:
tasks-public/MANIFEST.yaml — machine-readable task list + metadata
tasks-public/README.md — rationale, usage, reproducibility notes
tasks-public/tier{1..5}/*.yaml — 19 task definitions
tasks-public/assets/*/ — 19 asset packs (verifiers + fixtures)
The internal dev set remains in tasks/ (gitignored) and retains 40
tasks for future expansion. Not published:
- 9 ceiling tasks (all frontier models score >0.85)
- 9 noise tasks (cross-model SNR < 0.5)
- 3 ranking-breaker tasks (e.g. t2-node-search-patch,
t5-contradictory-requirements)
Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs
for perturbation-sensitivity measurement, and creative-synthesis
tasks — all currently absent from Core v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Treats agent runs as stochastic trajectories in semantic state space
and extracts signal that flat run_score averages away. Inspired by
the "When LLMs Are Dreaming, Where Do They Go?" framework: task
constraint characterization, per-run regime classification, seed-vs-
capability variance decomposition, per-turn survival, SNR-weighted
ranking.
Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external
model dependencies) as the semantic state proxy since sentence
embeddings would require torch. Crude but sufficient for the signals
the paper calls out.
scripts/compute_constraint_index.py: computes C(q) per task from
archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is
participation ratio of response covariance, entropy is eigenvalue
entropy, and BOPS is inter-run cosine (predictability proxy). High
C(q) = tasks where models converge to similar answers; low C(q) =
open-ended tasks where models diverge for style reasons.
scripts/classify_regimes.py: per-run regime classifier. Computes
drift_mean, from_start, recurrence, vol_log over turn trajectories.
Quartile-based thresholds label each run as too_short / trapped /
limit_cycle / diffusive / mixed. Reveals per-model tendencies:
Gemini traps frequently (one-shot answer without iteration), GPT
loops tool patterns, GLM is most balanced.
scripts/variance_decomp.py: decomposes run_score variance per task
into seed variance (3 runs of same model) vs capability variance
(across model means). SNR = cap_var / seed_var. Exposes that 47% of
benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and
give essentially random rankings.
scripts/survival_analysis.py: per-turn empirical survival S(t) and
hazard h(t). T_F = first turn where assistant emits empty response
or run ends in failure. Reveals long-horizon capability that flat
scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at
60% rate.
scripts/snr_weighted_ranking.py: SNR × |C(q)|-weighted ranking (with
winsorization at p95 to prevent single-task dominance). Headline
metric that weights discriminating + signal-rich tasks more than
noisy or consensus tasks. Also emits SNR-only and flat variants for
comparison.
scripts/generate_dynamical_report.py: assembles all four diagnostic
JSONs into a single markdown report with per-model regime tables,
SNR tiers, survival curves, and integrated interpretation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tools for auditing archive coverage, rejudging judge-infra failures
via direct Anthropic API (bypasses the gateway path that sometimes
returns "Gateway is restarting" / empty judge results), and producing
fair multi-model comparison reports.
scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs
and archive JSONs side-by-side. Reports coverage %, clean mean,
coverage-normalized score, infra-zero count, judge-infra remaining
vs rejudged.
scripts/audit_per_run.py: per-run cross-model audit. Flags tasks
where all models score zero (broken task/verifier), verifier
rejects-valid-outputs (C=0 but agent produced text), harness-error
clusters, model-specific pathologies.
scripts/rejudge_all.py: re-runs judge scoring on archive runs where
the gateway judge failed. Uses direct anthropic SDK against
claude-sonnet-4-6, rewrites judge_result fields in place, recomputes
run_score per the C+T+B+J weighting.
scripts/generate_fair_report.py: produces an 8/9-model comparison
markdown report. Supports --exclude to drop specific models, headlines
"clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0
task pass counts, and coverage parity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/container_sweep_single.sh: clone pristine OpenClaw state to
/tmp/ per sweep before starting the gateway. Carries over config
(openclaw.json, identity/, devices/, exec-approvals.json, tasks/,
subagents/, flows/, cron/) but leaves runtime dirs (agents/,
workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR
to the isolated dir so the gateway writes to /tmp instead of the
shared host mount. Fixes the cascading "RPC agents.create timed out
after 60s" failures caused by 4k+ stale agents accumulating across
sequential sweeps.
profiles/frontier_qwen_3_6.yaml: fix base_model from
openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus
(no dash). The dashed slug is unknown to OpenRouter and silently fails;
the no-dash version is the real canonical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Context:
The current 40-task set is being split into a private holdout set plus a
new public set. The public repo will ship a different task set that
doesn't give away the holdout; in the meantime, stop tracking the current
tasks/ directory so benchmarking can continue locally without exposing
the set externally.
Changes:
- .gitignore: add tasks/ and lab-pr68627/ (vendored PR content, also
moving out of the public repo).
- git rm --cached tasks/: remove from tracking (files remain on disk
locally).
- tests/test_integration_checks.py:
* Module-level pytest.mark.skipif that skips the whole file when
tasks/ is absent — so CI against the public repo (no tasks)
stays green once the private set moves out.
* Update the t2-node-search-patch fixture to also define emptyNote()
since the task was hardened with that distractor. Without this, the
integration test asserts score==1.0 but gets 0.0 (the new
"emptyNote stays empty" test fails against a fixture that never
defines emptyNote).
Follow-up (separate work):
Public task set lands in a subsequent commit. Holdout access path
(encrypted-in-repo or private-repo) gets wired into the harness's
private_tasks_root / hidden_tasks_dir plumbing.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
All 5 of these tasks were clearing at 0.85-1.00 across the frontier-4
on v4.14 — narrow spread means they don't differentiate models. Each
now has a specific trap that catches naive approaches:
- t1-refactor-csv-loader: introduces divergent normalization requirements
between load_rows (lowercase) and summarize_inventory (preserve first-
seen case). Naive "lowercase everywhere in parse_inventory_row" fails
2 of 3 tests. Proper refactor returns original case in the helper.
- t3-node-multifile-refactor: adds a 3rd caller (audit.js) requiring
preserved userId case + minute-precision timestamp, diverging from
auth.js and report.js. Single-function extraction fails 2 of 4 tests;
agent must handle two normalization modes.
- t4-browser-research-and-code: docs rewritten with distractors —
v1/v2/v3 versions, required/optional/cross-endpoint headers, rate
limits, payload limits. Tests check 6 facts including negative-match
for X-Admin-Token distractor (scoped to /v2/admin only).
- t2-node-search-patch: adds emptyNote() factory in render.js with
legitimate empty body: "" that MUST NOT be patched. Naive grep-replace
of `body: ""` now fails the emptyNote test. Also adds whitespace-
trimming test for filterNotes.
- t4-memory-recall-continuation: requires storing 3 SEPARATE memory
entries (beta-regions, retry-budget, apac-gating) instead of one.
Release notes include operational-notes distractors that must NOT
be codified. flags.py gains APAC_GATED_UNTIL field. Handoff verifier
added to check all 3 facts in the handoff artifact.
All 5 tasks verified: properly-implemented starter patches pass all
tests, the new traps specifically fail naive implementations.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Every one of these tasks has an execution_check script (verify_*.py) that
already does a recursive workspace search — it greps for required content
across every agent-written .md/.txt/.csv regardless of filename. The
completion.files block was redundant and actively penalized models that
wrote to reasonable alternate paths (analysis.md vs budget_report.md).
Before: total=1 (file) + N (exec) → if file path didn't match, score was
capped at N/(N+1). On t3-fin-budget-monthly, 14 of 15 prior sweep runs
failed specifically on "FILE budget_report.md: File does not exist".
After: total=N. Verifier is the source of truth. Judge rubric already
tells graders "don't penalize non-standard paths" — this aligns completion
scoring with that stated policy.
Fixed tasks (all had recursive verifiers):
t1-fs-quick-note, t1-life-translate, t2-ctx-pronoun-resolve,
t2-err-instruction-ambig, t2-fs-cleanup-downloads, t2-fs-find-that-thing,
t2-msg-summarize-thread, t2-priv-redact-doc, t2-skill-excel-rollup,
t2-sys-memory-roundtrip, t2-web-quick-fact, t3-cal-reschedule-cascade,
t3-data-sql-query, t3-fin-budget-monthly, t3-msg-inbox-triage,
t3-social-bill-split, t3-web-research-and-cite, t4-ctx-long-recall,
t4-life-trip-plan
Spot-checked that each verifier's required-content set already covered
the content_contains constraints that were also dropped.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
- scripts/_archive_cache.sh: snapshot run_cache/<model>/ to
run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json.
Sourced by sweep scripts so transcripts survive the next sweep's
cache wipe and stay available for audits.
- scripts/container_sweep_single.sh: base multi-model sweep.
Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so
their caches are force-cleared at sweep start. Calls archive helper
on exit.
- scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast
fix validation (~20 min) instead of full 3-run sweep (~60 min).
- Dockerfile.main: parametrized clawbench-on-openclaw image with
ARG BASE for pinning to any openclaw tag.
- scripts/git_checkpoint.py + README: documented checkpoint workflow
for tagging known-good states during risky work.
- .gitignore: un-ignore scripts/, keep targeted ignores for
__pycache__, .tmp, .local.py.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>