Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.
Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.
README.md:
- drops the "Docker base pinning" row from the "What's new" table;
replaced with "Reproducibility-first infrastructure" framing
- drops the "pinned" badge; added a "Diagnostics" badge instead
- updates "Reproducibility caveats" to recommend "build both sides
of any comparison from the same OpenClaw release" rather than
"pin to 2026.4.15-beta.1"
- updates Quick Start to record (not assume) the OpenClaw version
the build resolved to
- drops the pinned-base row from the comparison table; replaced
with "State-isolation per run" (the actually distinguishing infra)
- updates the version log entry for Core v1 to highlight the
dynamical-systems diagnostics + state-isolation rather than the
pinning that's no longer there
tasks-public/README.md:
- drops the 8-row "Established ranking" table per request
- replaced with a "Selection criteria" section that explains how
the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
publishing version-dependent scores
- reframes the build instructions to track :latest with a comment
about platform-version drift
tasks-public/MANIFEST.yaml:
- drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
a hard requirement)
- drops the `established_ranking` block
- replaced with `selection_basis` that documents the methodology
and explicitly states why scores are intentionally omitted
Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ClawBench Core v1 reference numbers were measured against
ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c).
Using the moving ":latest" tag caused observable drift in our sweeps
(platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by
+0.13 to +0.29), so unpinned builds produce non-reproducible rankings.
Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added
an explanatory comment noting that bumping the base requires re-
running the reference sweep.
tasks-public/README.md: added build + verification commands so users
can confirm they have the right OpenClaw version before running Core
v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Stages a curated 19-task subset of the internal 40-task dev pool as
the public ClawBench release. Selected via greedy task elimination
from the v2026-4-19-full sweep archive so that:
(a) mean run_score across these 19 tasks reproduces the established
8-model ranking with zero inversions and min adjacent-rank gap
of 0.0049 (well above the ~0.002 seed-noise floor);
(b) coverage is preserved across tiers 1-5 and across the tools,
coding, repo, browser, multi_tool, and adversarial families;
(c) tasks with broken verifiers or near-zero cross-model SNR are
dropped (21 tasks retained as private holdout, not published).
Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs
per task, C+T+B+J weighted score):
1. Claude Opus 4.6 0.8137
2. Claude Opus 4.7 0.7824
3. GPT 5.4 0.7647
4. Claude Sonnet 4.6 0.7597
5. MiniMax M2.7 0.7475
6. Gemini 3.1 Pro 0.7408
7. Qwen 3.6 Plus 0.7030
8. Kimi K2.5 0.6800
Deliverables:
tasks-public/MANIFEST.yaml — machine-readable task list + metadata
tasks-public/README.md — rationale, usage, reproducibility notes
tasks-public/tier{1..5}/*.yaml — 19 task definitions
tasks-public/assets/*/ — 19 asset packs (verifiers + fixtures)
The internal dev set remains in tasks/ (gitignored) and retains 40
tasks for future expansion. Not published:
- 9 ceiling tasks (all frontier models score >0.85)
- 9 noise tasks (cross-model SNR < 0.5)
- 3 ranking-breaker tasks (e.g. t2-node-search-patch,
t5-contradictory-requirements)
Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs
for perturbation-sensitivity measurement, and creative-synthesis
tasks — all currently absent from Core v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>