Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.
Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.
README.md:
- drops the "Docker base pinning" row from the "What's new" table;
replaced with "Reproducibility-first infrastructure" framing
- drops the "pinned" badge; added a "Diagnostics" badge instead
- updates "Reproducibility caveats" to recommend "build both sides
of any comparison from the same OpenClaw release" rather than
"pin to 2026.4.15-beta.1"
- updates Quick Start to record (not assume) the OpenClaw version
the build resolved to
- drops the pinned-base row from the comparison table; replaced
with "State-isolation per run" (the actually distinguishing infra)
- updates the version log entry for Core v1 to highlight the
dynamical-systems diagnostics + state-isolation rather than the
pinning that's no longer there
tasks-public/README.md:
- drops the 8-row "Established ranking" table per request
- replaced with a "Selection criteria" section that explains how
the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
publishing version-dependent scores
- reframes the build instructions to track :latest with a comment
about platform-version drift
tasks-public/MANIFEST.yaml:
- drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
a hard requirement)
- drops the `established_ranking` block
- replaced with `selection_basis` that documents the methodology
and explicitly states why scores are intentionally omitted
Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ClawBench Core v1 reference numbers were measured against
ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c).
Using the moving ":latest" tag caused observable drift in our sweeps
(platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by
+0.13 to +0.29), so unpinned builds produce non-reproducible rankings.
Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added
an explanatory comment noting that bumping the base requires re-
running the reference sweep.
tasks-public/README.md: added build + verification commands so users
can confirm they have the right OpenClaw version before running Core
v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix gateway: --allow-unconfigured + token auth for headless container
- Fix client: use cli client ID/mode + full operator scopes
- Add 11 preset models with Submit All button
- Open-source models use HF Inference API (no extra keys needed)