The default connect_timeout=15.0 is shorter than the
observed first-session setup time against a freshly started
OpenClaw gateway (we've measured phase0_session_setup
~20-25s during containerised benchmark runs), which creates a
race where the client gives up before the gateway is ready for
the first turn. Downstream the adapter then surfaces this as
an ``empty_response`` with zero transcript steps, which looks
like a model failure when it's really an environment timing
issue.
Concrete repro from a 19-task public_dev run:
task: t4-life-trip-plan
failed: reward=0, failure_category=empty_response,
duration_ms=0, total_ms=16352, response hash
= SHA256 of empty string
rerun: score=0.927 standalone, phase0_session_setup=21.2s
Change:
* GatewayConfig.connect_timeout default 15.0 -> 30.0
* GatewayConfig.request_timeout default kept at 60.0 but
now explicitly documented and overridable for symmetry
* Both are now overridable via environment variables
CLAWBENCH_CONNECT_TIMEOUT / CLAWBENCH_REQUEST_TIMEOUT
so ops can tune further without a code change.
* Invalid env values are logged and fall back to the default
rather than blowing up benchmark runs.
* Adds three unit tests covering default, env override, and
invalid-env fallback behaviour.
Reported-by: Grzegorz Chlebus <gchlebus@nvidia.com>
Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.
Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.
README.md:
- drops the "Docker base pinning" row from the "What's new" table;
replaced with "Reproducibility-first infrastructure" framing
- drops the "pinned" badge; added a "Diagnostics" badge instead
- updates "Reproducibility caveats" to recommend "build both sides
of any comparison from the same OpenClaw release" rather than
"pin to 2026.4.15-beta.1"
- updates Quick Start to record (not assume) the OpenClaw version
the build resolved to
- drops the pinned-base row from the comparison table; replaced
with "State-isolation per run" (the actually distinguishing infra)
- updates the version log entry for Core v1 to highlight the
dynamical-systems diagnostics + state-isolation rather than the
pinning that's no longer there
tasks-public/README.md:
- drops the 8-row "Established ranking" table per request
- replaced with a "Selection criteria" section that explains how
the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
publishing version-dependent scores
- reframes the build instructions to track :latest with a comment
about platform-version drift
tasks-public/MANIFEST.yaml:
- drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
a hard requirement)
- drops the `established_ranking` block
- replaced with `selection_basis` that documents the methodology
and explicitly states why scores are intentionally omitted
Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README.md: removed the inline reference leaderboard per user request.
The Core v1 manifest still carries the established ranking, the
README still documents methodology + dynamical-systems diagnostics.
clawbench/tasks.py: extend _resolve_tasks_dir() with a tasks-public/
fallback layer (resolver step 5). Local dev with the private tasks/
present is unchanged; CI without tasks/ now falls back to the public
Core v1 set instead of returning an empty corpus. Has been broken
since deb3d5d (the "stop tracking current task set" commit) — this
restores green CI now that tasks-public/ is available.
tests/test_tasks.py: three updates so tests pass against either the
private 40-task set OR the public 19-task set:
- test_load_all_tasks_returns_full_corpus: threshold lowered from
>= 20 to >= 19 (Core v1 size)
- test_workspace_setup_preserves_nested_asset_paths: switched from
t1-architecture-brief (private) to t4-browser-research-and-code
(public) which exercises the same flat+nested asset behaviour
- test_selected_tasks_include_judge_rubrics: replaced 3 task IDs
not in the public Core release (t1-architecture-brief,
t5-contradictory-requirements, t5-impossible-graceful-fail) with
public-set equivalents (t1-bugfix-discount, t3-feature-export)
Verified locally with both branches:
- private tasks/ present: 156 passed, 1 skipped
- private tasks/ hidden: 152 passed, 5 skipped (CI-equivalent)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates the front-door README to reflect the Core v1 release and the
methodology innovations we shipped this cycle. Key additions:
- "What's new in Core v1" table highlighting the five methodology
layers most agent benchmarks lack (signal-curated task set,
variance decomposition, dynamical-systems diagnostics, Constraint
Index, Docker base pinning).
- Reference leaderboard — 8-model ranking on the Core-19 set from the
v2026-4-19-full sweep. Honest about GLM 5.1's non-reproducibility
and the OpenRouter routing issue.
- "What makes ClawBench different" expanded with variance
decomposition (52.7% capability / 47.3% seed noise) and a new
section (#3) on dynamical-systems diagnostics, including the four
concrete signals (C(q), regime, survival, SNR-weighted ranking).
- New "Reproducibility caveats" section — what reproduces (audit,
diagnostics, top-cluster ranking) vs what drifts (absolute scores,
OpenRouter models, OpenClaw platform upgrades). Documents the
pinning we did.
- Updated Quick Start with `docker build -t clawbench:core-v1`
verification flow and a full analysis-pipeline walkthrough using
the new scripts (rejudge_all, compute_constraint_index, etc).
- Repository layout updated to include tasks-public/ (public) and
scripts/ with brief descriptions of all 11 reproducibility +
analysis scripts.
- Comparison table extended with new columns: variance decomposition,
dynamical regime, SNR-weighted alternative, Docker base pinning,
provider-routing caveats — all areas where SWE-bench / HumanEval /
LLM-judge leaderboards are silent.
- Version log + planned Core v2 roadmap (Tier 6 long-horizon,
paraphrased prompt pairs, creative-synthesis, human baseline).
Headline shifts from "the agent benchmark that measures what users
actually experience" to "Rigorous agent evaluation. Signal-curated
tasks. Dynamical-systems diagnostics." — foregrounds the
methodological contributions that separate Core v1 from prior art.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ClawBench Core v1 reference numbers were measured against
ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c).
Using the moving ":latest" tag caused observable drift in our sweeps
(platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by
+0.13 to +0.29), so unpinned builds produce non-reproducible rankings.
Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added
an explanatory comment noting that bumping the base requires re-
running the reference sweep.
tasks-public/README.md: added build + verification commands so users
can confirm they have the right OpenClaw version before running Core
v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stages a curated 19-task subset of the internal 40-task dev pool as
the public ClawBench release. Selected via greedy task elimination
from the v2026-4-19-full sweep archive so that:
(a) mean run_score across these 19 tasks reproduces the established
8-model ranking with zero inversions and min adjacent-rank gap
of 0.0049 (well above the ~0.002 seed-noise floor);
(b) coverage is preserved across tiers 1-5 and across the tools,
coding, repo, browser, multi_tool, and adversarial families;
(c) tasks with broken verifiers or near-zero cross-model SNR are
dropped (21 tasks retained as private holdout, not published).
Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs
per task, C+T+B+J weighted score):
1. Claude Opus 4.6 0.8137
2. Claude Opus 4.7 0.7824
3. GPT 5.4 0.7647
4. Claude Sonnet 4.6 0.7597
5. MiniMax M2.7 0.7475
6. Gemini 3.1 Pro 0.7408
7. Qwen 3.6 Plus 0.7030
8. Kimi K2.5 0.6800
Deliverables:
tasks-public/MANIFEST.yaml — machine-readable task list + metadata
tasks-public/README.md — rationale, usage, reproducibility notes
tasks-public/tier{1..5}/*.yaml — 19 task definitions
tasks-public/assets/*/ — 19 asset packs (verifiers + fixtures)
The internal dev set remains in tasks/ (gitignored) and retains 40
tasks for future expansion. Not published:
- 9 ceiling tasks (all frontier models score >0.85)
- 9 noise tasks (cross-model SNR < 0.5)
- 3 ranking-breaker tasks (e.g. t2-node-search-patch,
t5-contradictory-requirements)
Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs
for perturbation-sensitivity measurement, and creative-synthesis
tasks — all currently absent from Core v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Treats agent runs as stochastic trajectories in semantic state space
and extracts signal that flat run_score averages away. Inspired by
the "When LLMs Are Dreaming, Where Do They Go?" framework: task
constraint characterization, per-run regime classification, seed-vs-
capability variance decomposition, per-turn survival, SNR-weighted
ranking.
Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external
model dependencies) as the semantic state proxy since sentence
embeddings would require torch. Crude but sufficient for the signals
the paper calls out.
scripts/compute_constraint_index.py: computes C(q) per task from
archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is
participation ratio of response covariance, entropy is eigenvalue
entropy, and BOPS is inter-run cosine (predictability proxy). High
C(q) = tasks where models converge to similar answers; low C(q) =
open-ended tasks where models diverge for style reasons.
scripts/classify_regimes.py: per-run regime classifier. Computes
drift_mean, from_start, recurrence, vol_log over turn trajectories.
Quartile-based thresholds label each run as too_short / trapped /
limit_cycle / diffusive / mixed. Reveals per-model tendencies:
Gemini traps frequently (one-shot answer without iteration), GPT
loops tool patterns, GLM is most balanced.
scripts/variance_decomp.py: decomposes run_score variance per task
into seed variance (3 runs of same model) vs capability variance
(across model means). SNR = cap_var / seed_var. Exposes that 47% of
benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and
give essentially random rankings.
scripts/survival_analysis.py: per-turn empirical survival S(t) and
hazard h(t). T_F = first turn where assistant emits empty response
or run ends in failure. Reveals long-horizon capability that flat
scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at
60% rate.
scripts/snr_weighted_ranking.py: SNR × |C(q)|-weighted ranking (with
winsorization at p95 to prevent single-task dominance). Headline
metric that weights discriminating + signal-rich tasks more than
noisy or consensus tasks. Also emits SNR-only and flat variants for
comparison.
scripts/generate_dynamical_report.py: assembles all four diagnostic
JSONs into a single markdown report with per-model regime tables,
SNR tiers, survival curves, and integrated interpretation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tools for auditing archive coverage, rejudging judge-infra failures
via direct Anthropic API (bypasses the gateway path that sometimes
returns "Gateway is restarting" / empty judge results), and producing
fair multi-model comparison reports.
scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs
and archive JSONs side-by-side. Reports coverage %, clean mean,
coverage-normalized score, infra-zero count, judge-infra remaining
vs rejudged.
scripts/audit_per_run.py: per-run cross-model audit. Flags tasks
where all models score zero (broken task/verifier), verifier
rejects-valid-outputs (C=0 but agent produced text), harness-error
clusters, model-specific pathologies.
scripts/rejudge_all.py: re-runs judge scoring on archive runs where
the gateway judge failed. Uses direct anthropic SDK against
claude-sonnet-4-6, rewrites judge_result fields in place, recomputes
run_score per the C+T+B+J weighting.
scripts/generate_fair_report.py: produces an 8/9-model comparison
markdown report. Supports --exclude to drop specific models, headlines
"clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0
task pass counts, and coverage parity.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/container_sweep_single.sh: clone pristine OpenClaw state to
/tmp/ per sweep before starting the gateway. Carries over config
(openclaw.json, identity/, devices/, exec-approvals.json, tasks/,
subagents/, flows/, cron/) but leaves runtime dirs (agents/,
workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR
to the isolated dir so the gateway writes to /tmp instead of the
shared host mount. Fixes the cascading "RPC agents.create timed out
after 60s" failures caused by 4k+ stale agents accumulating across
sequential sweeps.
profiles/frontier_qwen_3_6.yaml: fix base_model from
openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus
(no dash). The dashed slug is unknown to OpenRouter and silently fails;
the no-dash version is the real canonical.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Context:
The current 40-task set is being split into a private holdout set plus a
new public set. The public repo will ship a different task set that
doesn't give away the holdout; in the meantime, stop tracking the current
tasks/ directory so benchmarking can continue locally without exposing
the set externally.
Changes:
- .gitignore: add tasks/ and lab-pr68627/ (vendored PR content, also
moving out of the public repo).
- git rm --cached tasks/: remove from tracking (files remain on disk
locally).
- tests/test_integration_checks.py:
* Module-level pytest.mark.skipif that skips the whole file when
tasks/ is absent — so CI against the public repo (no tasks)
stays green once the private set moves out.
* Update the t2-node-search-patch fixture to also define emptyNote()
since the task was hardened with that distractor. Without this, the
integration test asserts score==1.0 but gets 0.0 (the new
"emptyNote stays empty" test fails against a fixture that never
defines emptyNote).
Follow-up (separate work):
Public task set lands in a subsequent commit. Holdout access path
(encrypted-in-repo or private-repo) gets wired into the harness's
private_tasks_root / hidden_tasks_dir plumbing.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
All 5 of these tasks were clearing at 0.85-1.00 across the frontier-4
on v4.14 — narrow spread means they don't differentiate models. Each
now has a specific trap that catches naive approaches:
- t1-refactor-csv-loader: introduces divergent normalization requirements
between load_rows (lowercase) and summarize_inventory (preserve first-
seen case). Naive "lowercase everywhere in parse_inventory_row" fails
2 of 3 tests. Proper refactor returns original case in the helper.
- t3-node-multifile-refactor: adds a 3rd caller (audit.js) requiring
preserved userId case + minute-precision timestamp, diverging from
auth.js and report.js. Single-function extraction fails 2 of 4 tests;
agent must handle two normalization modes.
- t4-browser-research-and-code: docs rewritten with distractors —
v1/v2/v3 versions, required/optional/cross-endpoint headers, rate
limits, payload limits. Tests check 6 facts including negative-match
for X-Admin-Token distractor (scoped to /v2/admin only).
- t2-node-search-patch: adds emptyNote() factory in render.js with
legitimate empty body: "" that MUST NOT be patched. Naive grep-replace
of `body: ""` now fails the emptyNote test. Also adds whitespace-
trimming test for filterNotes.
- t4-memory-recall-continuation: requires storing 3 SEPARATE memory
entries (beta-regions, retry-budget, apac-gating) instead of one.
Release notes include operational-notes distractors that must NOT
be codified. flags.py gains APAC_GATED_UNTIL field. Handoff verifier
added to check all 3 facts in the handoff artifact.
All 5 tasks verified: properly-implemented starter patches pass all
tests, the new traps specifically fail naive implementations.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
Every one of these tasks has an execution_check script (verify_*.py) that
already does a recursive workspace search — it greps for required content
across every agent-written .md/.txt/.csv regardless of filename. The
completion.files block was redundant and actively penalized models that
wrote to reasonable alternate paths (analysis.md vs budget_report.md).
Before: total=1 (file) + N (exec) → if file path didn't match, score was
capped at N/(N+1). On t3-fin-budget-monthly, 14 of 15 prior sweep runs
failed specifically on "FILE budget_report.md: File does not exist".
After: total=N. Verifier is the source of truth. Judge rubric already
tells graders "don't penalize non-standard paths" — this aligns completion
scoring with that stated policy.
Fixed tasks (all had recursive verifiers):
t1-fs-quick-note, t1-life-translate, t2-ctx-pronoun-resolve,
t2-err-instruction-ambig, t2-fs-cleanup-downloads, t2-fs-find-that-thing,
t2-msg-summarize-thread, t2-priv-redact-doc, t2-skill-excel-rollup,
t2-sys-memory-roundtrip, t2-web-quick-fact, t3-cal-reschedule-cascade,
t3-data-sql-query, t3-fin-budget-monthly, t3-msg-inbox-triage,
t3-social-bill-split, t3-web-research-and-cite, t4-ctx-long-recall,
t4-life-trip-plan
Spot-checked that each verifier's required-content set already covered
the content_contains constraints that were also dropped.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
- scripts/_archive_cache.sh: snapshot run_cache/<model>/ to
run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json.
Sourced by sweep scripts so transcripts survive the next sweep's
cache wipe and stay available for audits.
- scripts/container_sweep_single.sh: base multi-model sweep.
Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so
their caches are force-cleared at sweep start. Calls archive helper
on exit.
- scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast
fix validation (~20 min) instead of full 3-run sweep (~60 min).
- Dockerfile.main: parametrized clawbench-on-openclaw image with
ARG BASE for pinning to any openclaw tag.
- scripts/git_checkpoint.py + README: documented checkpoint workflow
for tagging known-good states during risky work.
- .gitignore: un-ignore scripts/, keep targeted ignores for
__pycache__, .tmp, .local.py.
Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
classify_tool_call matched tool names against a fixed set of verb
patterns. The pattern for the "edit" family was:
r"write|edit|patch|apply|create|delete|rename"
This omitted "replace" and "insert", so tools like str_replace,
replace_in_file, insert_text, and insert_at_line all fell through
every check and were returned as ("unknown", False) – classified as
non-mutating with unknown family.
Consequences for any agent that edits via str_replace:
- distinct_mutation_targets stayed empty → min_distinct_mutation_targets
requirement always failed
- read_before_write_ratio was 1.0 for the wrong reason (no mutations
detected, so denominator collapsed to 1)
- "edit" never appeared in distinct_families → required_families check
always reported it as missing
Fix: extend the edit pattern with "replace" and "insert".
Tests added: unit test for classify_tool_call directly and an end-to-end
trajectory test using a str_replace-based edit transcript.
Re-ran Sonnet 4.6 judge on all 60 GPT 5.4 runs that had auth errors
during the original sweep. Called the Anthropic API directly using
cached transcripts. Results:
- judge_task_coverage: 0.6 -> 1.0 (all 40 tasks fully judged)
- judge_error_count: 60 -> 0
- overall_judge_score: 0.438 -> 0.239 (was inflated by excluding errors)
- overall_score: 0.456 -> 0.457 (unchanged, judge gated on C >= 0.9999)
No judge caveat remains. All 6 models now have complete, unbiased
judge coverage across all 720 runs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrote README to focus on why trace-based scoring matters for
user-perceived agent capabilities, how ablation works, and the 13
failure modes. Removed results pending finalization.
Worker changes: re-inject host env/plugins into lane configs after
gateway restart (fixes judge auth stripping), increase control-plane
probe tolerance for slow gateway startups.
Added 6-model leaderboard and sweep reports from April 11-12 runs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Queue submissions were failing intermittently with "Parallel lane N
failed for tasks [...]" after a ~5 minute stall. Root cause traced
to the interaction of three things:
1. GatewayClient.config.request_timeout defaulted to 300 seconds,
meaning every WebSocket RPC would block for 5 full minutes
before raising TimeoutError.
2. worker._assert_gateway_control_plane calls sessions.create on
a freshly-started lane gateway as a readiness probe. A transient
plugin-load race can leave the new gateway accepting /health HTTP
requests but hanging on WebSocket RPCs, so the probe would block
for the full 300s default timeout.
3. _ensure_parallel_gateway called the probe inside the same for-
loop that was polling /health, with `except Exception: pass`
silently swallowing probe failures. Each probe attempt consumed
30-300 seconds of the 60-iteration budget, so effectively only
1-2 probe retries fit before the outer loop gave up and the
whole lane batch was marked failed.
Fixes (all in clawbench/client.py + clawbench/worker.py):
- Drop `GatewayConfig.request_timeout` default from 300.0 to 60.0.
60 seconds is still generous for what are sub-second calls in
steady state (agents.create, sessions.create, etc.); a healthy
gateway responds in milliseconds. Long-running calls like
send_and_wait already pass an explicit per-call timeout.
- Add a `timeout=None` kwarg to `GatewayClient._rpc()` so callers
can override the default when they need tighter or looser
per-call bounds. Error message now includes the effective
timeout so debugging is clearer.
- `_assert_gateway_control_plane` now constructs a dedicated
GatewayConfig with request_timeout=15s and wraps the whole probe
(including WebSocket connect + session create + session delete)
in `asyncio.wait_for(..., timeout=30.0)` as a belt-and-suspenders
ceiling. A probe hang now fails in 30s instead of 300s.
- Split `_ensure_parallel_gateway` and `_ensure_gateway` into two
explicit phases:
Phase A: poll /health over HTTP until 200, up to 60s (fast)
Phase B: call _assert_gateway_control_plane up to 3 times,
sleeping 2s between attempts, before raising
The probe-retry loop lets transient plugin warmup races
self-recover without killing the whole lane batch.
- Added 2s HTTP client timeout to the /health polls so a single
wedged HTTP request can't swallow 30s of the budget.
Also clears the 11 historical failed jobs from the queue dataset at
ScoootScooob/clawbench-results so the leaderboard starts clean on
the next Space rebuild.
All 20 tests in test_worker, test_client, and test_parallel_harness
pass with the new code paths.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: datasets.Dataset.push_to_hub(repo, split="submissions")
writes a single parquet shard to
data/submissions-00000-of-00001.parquet, REPLACING whatever was
there. Every call to upload_result() was clobbering the previous
submission. After 7 sequential uploads the dataset contained only
the last row; the HF Space leaderboard therefore displayed only 1
model entry despite 7 having been pushed.
Fix: upload_result() now loads the existing submissions split
first, appends the new row, dedupes by submission_id (so retried
uploads of the same run don't double-count), and pushes the
combined row list as a fresh parquet shard.
This works at ClawBench's current submission rate (1-2 concurrent
jobs). If cross-worker concurrency ever becomes material, we should
move to a genuinely append-only layout where each submission writes
its own parquet shard under
data/submission-<submission_id>-of-NNNNN.parquet
and readers scan the whole data/ directory. File a follow-up when
submission volume warrants it.
Also backfilled the 6 frontier bake-off rows that had been lost
during the earlier overwrite sequence. The dataset at
https://huggingface.co/datasets/ScoootScooob/clawbench-results now
holds all 7 rows: Kimi K2.5, Opus 4.6, GPT-5.4, Gemini 3.1 Pro,
GLM-5.1, Qwen3.6-Plus, MiniMax M2.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Empty commit to re-queue .github/workflows/sync-to-hf-space.yml
now that HF_TOKEN and HF_USERNAME repository secrets are in place.
GitHub Actions secrets are snapshotted at workflow queue time, so
previously-failed runs cannot retroactively see secrets added after
their queue timestamp. This commit triggers a fresh run that will
pick up the current secrets state.
Documents the one-time secrets setup (HF_TOKEN + HF_USERNAME) that
the sync-to-hf-space.yml workflow needs before it can mirror GitHub
main to the HF Space. Also explains the --force semantics and the
"GitHub is the single source of truth" contract.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds .github/workflows/sync-to-hf-space.yml which force-pushes main
to the HF Space git remote whenever a commit lands on GitHub main.
This eliminates the dual-push friction: GitHub becomes the single
source of truth, and the HF Space deployed at
https://huggingface.co/spaces/ScoootScooob/clawbench always tracks
the latest GitHub main without manual `git push hf main` calls.
Requires two repository secrets (Settings -> Secrets -> Actions):
HF_TOKEN — write-scoped HF token
(https://huggingface.co/settings/tokens)
HF_USERNAME — HF account username that owns the Space
Optional repo variable:
HF_SPACE_ID — defaults to "ScoootScooob/clawbench", override if
mirroring to a different Space.
Uses --force to replace any Space-side commits that were created by
editing files in the HF UI. This is intentional — the workflow's
contract is that GitHub is authoritative.
Guarded by concurrency group so two simultaneous pushes serialize
instead of racing into a non-fast-forward rejection.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two fixes for the HF Space UI:
1. Leaderboard crashed with "'str' object has no attribute 'get'"
because upload_result() serializes BenchmarkResult.environment as
str(result.environment) when pushing to the HF Dataset, but
_flatten_result called .get() on it as if it were a dict.
Defensive parse: accept dict, stringified dict, or JSON object.
2. Stats ribbon (Tasks/Tiers/Browser/Judge counts) was hardcoded to
the v0.3 values (20/5/2/6). Replaced with _compute_stats() which
calls load_all_tasks() at startup and derives the numbers from
the live task corpus, so the ribbon stays in sync with the
tasks/ directory without manual edits.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Found and fixed three blockers preventing the HF Space Docker container
from running the eval suite end-to-end, verified by building the image
locally with Docker Desktop and running a tier-1 task against Qwen3-32B
through the HF Inference API inside the container.
1. pytest was in [project.optional-dependencies].dev, not [project].
The Dockerfile does `pip install .` which only installs runtime
deps, so every task whose completion verifier runs `pytest -q`
would fail with exit 127 (command not found). Moved pytest +
pytest-asyncio into the base dependencies so the container gets
them by default. The [dev] extra is kept as an alias for
existing `pip install .[dev]` invocations.
2. clawbench/tasks.py resolved TASKS_DIR via
`Path(__file__).parent.parent / "tasks"`, which works only for
source checkouts. When pip installs the package into
/usr/local/lib/python3.11/dist-packages/clawbench, the sibling
`tasks/` directory no longer exists at that path, so
`load_all_tasks()` returned empty and `clawbench run` died with
"No tasks to run". Added a fallback resolver that tries, in
order: $CLAWBENCH_TASKS_DIR env var, sibling-of-source,
Path.cwd() / "tasks", and known Docker layout candidates
(/home/node/app/tasks, /home/user/app/tasks, /app/tasks).
Verified inside the container that `TASKS_DIR` now resolves to
/home/node/app/tasks and load_all_tasks() returns 40 tasks.
3. Tier-1 task timeouts were at 180s, which is enough for Qwen3-32B
(52.9s wall time) but causes Llama-3.3-70B to hit the wall on
t1-bugfix-discount. Raised tier-1 timeouts to 360s so slower HF
models can complete tasks within the deterministic timeout and
produce a capability signal instead of an infrastructure
timeout signal.
Also fixed a pre-existing stale test (tests/test_tasks.py expected
20 tasks, we have 40 since v0.5 corpus expansion) that was failing
on every test run.
Verified inside the container image:
- `clawbench list-tasks` returns all 40 tasks
- sessions.create passes for all 11 preset models
(9 huggingface/* + 2 anthropic/*)
- `clawbench run --model huggingface/Qwen/Qwen3-32B
--task t1-bugfix-discount --runs 1`
scored 1.000 / C=1.000 T=1.000 B=1.000 in 52.9s
with 279,702 tokens captured.
Remaining architectural note (not a blocker): the CLI path
`clawbench run` assumes the gateway is already running. Only the
queue/worker path (`app.py` → `EvalWorker._ensure_gateway`) spawns
its own gateway. For HF Space deployment this is fine because all
user submissions go through the Gradio UI → queue → worker path;
local CLI invocations inside the container need to start the
gateway manually first.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the two separate JSON files (hermes_trace_summary.json and
basic_usage_query_summary.json) with a single markdown document that
captures every empirical source informing ClawBench's task design.
baselines/BASELINE_SOURCES.md covers:
1. The 24 public Hugging Face datasets tagged format:agent-traces,
with owner/name, row counts, cluster classification (Pi sessions,
custom agent traces, Claude Code, demo), and how each cluster
maps onto ClawBench's tier/family/trajectory design decisions.
Aggregate ~3,049 rows, ~1,168 unique sessions after mirror
deduplication.
2. The Hermes agent reasoning trace aggregate (14,701 sessions,
24.3 avg turns, category distribution) with the direct mapping
from observed categories to ClawBench task families.
3. The internal personal-agent use-case corpus (72 queries, 12
primary scenarios, 139 atomic capabilities) that contributes
the scenario_weight_defaults in query_catalog.py. The source
is not a public dataset and is only referred to as "the internal
personal-agent use-case corpus" — no filename reference.
4. A full source-to-design-decision mapping table showing which
design choice (tier ladder, family mix, tool diversity,
recovery expectations, browser task count, scenario weights,
difficulty tags, adversarial tier-5) is driven by which source.
Also scrub two remaining references to the Chinese filename in
reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md,
replacing them with pointers to baselines/BASELINE_SOURCES.md.
No runtime code paths read the baselines/ directory; these files are
provenance artifacts for the design decisions baked into tasks/ and
clawbench/query_catalog.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the 192-line v0.3 README with a 724-line v0.5 writeup that
documents:
- The three-layer scoring model (deterministic first, process second,
semantic residue last) with full Mermaid diagram of the harness ->
verifier -> scorer -> v0.5 diagnostic pipeline
- The v0.5 Configuration Diagnostic positioning: first agent benchmark
that measures the configuration, not just the model, with a 5-stage
flow diagram (profile -> fingerprint -> predict -> validate -> explain)
- All three mathematical pillars (k-NN composite Jaccard, fANOVA with
Random Forest surrogate + lite fallback, Taguchi S/N) with formulas
and the rationale for why each technique is in the stack
- The 7-model frontier baseline results with ASCII bar chart,
per-bucket Taguchi S/N, calibration MAE, and honest caveats about
tier-1 coding tasks being too easy at n=1 to separate frontier models
- Task suite breakdown (40 tasks, 5 tiers, 14 capability tags, 4 pools)
- Full scoring model including the gated judge weighting invariant
- CLI reference for `clawbench run` + `clawbench diagnose`
- Repository layout with line counts for all 18 package modules
- Comparison table vs SWE-bench / HumanEval / pass-rate leaderboards
- Testing section (107 tests, key files with line counts)
- Historical data + HF Dataset integration
- Contributing guide stub with task YAML shape
- Citation block
Preserves the HF Space frontmatter (title, emoji, colorFrom, colorTo,
sdk, app_port, pinned, license) so the Space rendering still works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Profiles (profiles/):
- frontier_opus_4_6.yaml (Anthropic Claude Opus 4.6 — closed)
- frontier_gpt_5_4.yaml (OpenAI GPT-5.4 — closed)
- frontier_gemini_3_pro.yaml (Google Gemini 3.1 Pro — closed)
- frontier_glm_5_1.yaml (Zhipu AI GLM-5.1 via OpenRouter — open)
- frontier_qwen_3_6.yaml (Alibaba Qwen3.6-Plus via OpenRouter — open)
- frontier_minimax_m27.yaml (MiniMax M2.7 via OpenRouter — open)
- frontier_kimi_k25.yaml (Moonshot Kimi K2.5 via OpenRouter — open)
- example_research_stack.yaml (example for docs)
All seven profiles share an identical plugin stack (anthropic +
memory-lancedb + browser-playwright) so base_model is the only
structural variable across the bake-off.
Scripts (scripts/):
- run_open_vs_closed_bakeoff.py: driver that runs each profile
through the harness and generates a comparison table. Wraps
`clawbench run --profile` via an inline Click entry (the package
has no __main__.py so `python -m clawbench.cli` is a no-op).
- analyze_open_vs_closed.py: historical DB analyzer — per-bucket
mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies
OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/
Moonshot land in the open bucket.
- ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py,
scale_timeouts.py, seed_historical_db.py: task-corpus tooling.
Reports (reports/):
- FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run
(3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6
scored 63.9% with real token streaming (174K tok, $0.18 cost).
The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too
easy to separate frontier models at n=1. Documents
infrastructure findings around gateway plugin allowlist behavior,
token streaming gaps for non-Anthropic providers, and hot-reload
cascade when config changes mid-run.
- open_vs_closed_bakeoff_summary.md: auto-generated headline table
- FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run
- REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run
- PARALLEL_HARNESS_REPORT.md: concurrency validation writeup
- V05_DELIVERY_REPORT.md: v0.5 framework delivery notes
- CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning
Artifacts (reports/artifacts/):
- frontier_*.json: the 7 BenchmarkResult files from the bake-off
(committed snapshot for reproducibility; runtime results still
go to results/ which remains gitignored)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests:
- tests/test_v05_framework.py (646 lines): end-to-end synthetic ecosystem
covering profile parsing, fingerprint computation, k-NN prediction,
surprise detection, factor analysis, diagnostic rendering
- tests/test_v05_extensions.py (552 lines): unit tests for Taguchi S/N
robustness profile, plugin utilization audit, manifest-vs-reality gap,
calibration tracking, surprise cause attribution, recommendations
generator, insights publishing, end-to-end diagnostic with all sections
- tests/test_scorer.py: judge gating tests (judge cannot rescue failed
deterministic completion; judge capped at 10% when deterministic
verifier exists and floor met; judge dominates at 50% on semantic-
only tasks)
- tests/test_e2e_significance.py, test_parallel_harness.py:
additional coverage for harness behavior
Task corpus expansion:
- 20 new task YAMLs across tier1-4 covering fs, web, calendar,
messaging, data processing, social coordination, life assistance,
context continuation, error boundary, skill calling, privacy
redaction scenarios
- Fresh asset packs for each new task (test fixtures + reference
inputs/outputs)
- Lower tier-1 coding task timeouts from 360s to 180s to avoid
final-state wait waste (the gateway emits no chat.state:final event,
so the wait is pure overhead; 180s is plenty for any tier-1 task)
- Modify tier2-5 task YAMLs for verifier robustness and judge rubric
updates
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add the v0.5 plugin-profile diagnostic system on top of v0.4:
- profile.py: PluginProfile, PluginManifest, RegistrationTrace,
ProfileFingerprint, fingerprint_similarity (Jaccard composite over
capability coverage, hook footprint, tool family surface, tags, slots,
base model)
- prediction.py: HistoricalDatabase with JSON persistence, k-NN cold-start
prediction with confidence bands, calibration metrics (MAE/RMSE/bias),
surprise cause attribution
- factor_analysis.py: fANOVA with Random Forest surrogate when sklearn
is available, fANOVA-lite fallback that decomposes variance via
SSB/SST with pairwise interaction residuals
- diagnostic.py / diagnose_cli.py: Configuration Diagnostic Report
ties profile -> fingerprint -> prediction -> run -> surprises -> insights
- utilization.py: plugin utilization audit (dead-weight detection) +
manifest-vs-reality gap per plugin
- recommendations.py: evidence-backed profile change generator
(add_plugin, remove_plugin, fill_slot, add_capability) with
confidence scaled by sample size
- insights.py: publishes plugin leaderboard, factor importance,
interactions, capability gaps, calibration history to JSON files
- stats.py: Taguchi larger-is-better signal-to-noise ratio and
RobustnessProfile with per-tier means (the third mathematical
pillar of v0.5 alongside k-NN and fANOVA)
- scorer.py: fix judge weighting per spec. Judge now capped at 10%
when the task has a deterministic completion verifier and only
contributes when the deterministic floor (completion >= 0.9999)
is met. When no deterministic verifier exists, judge dominates
at 50% (semantic-only regime). This enforces CLAWBENCH_V0_4_SPEC.md
"Disallowed Primary Verifiers" and "Judge Gating" sections.
- cli.py: wire --profile flag into clawbench run; add clawbench diagnose
subcommand
- harness.py: pass has_deterministic_verifier to combine_run_score
- CLAWBENCH_V0_4_SPEC.md: add v0.5 Direction section
.gitignore: exclude .clawbench/ runtime state and .DS_Store
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous redesign passed radius_size/block_radius/shadow_drop/shadow_spread
to themes.Base().set() which are either constructor-only or version-specific,
causing the HF Space to runtime-error at startup. Drop those kwargs and wrap
the whole theme build in a try/except that falls back to plain Base() so any
future unknown kwarg degrades gracefully instead of crashing the Space.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Apply the shared OpenClaw aesthetic — dark backgrounds (#0e1015 layered),
signature red accent (#ff5c5c), Inter + JetBrains Mono typography,
whisper-thin color-mix borders, pill tab switcher, and rise animations.
Replaces the default Gradio Base theme with custom CSS and theme tokens.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>