Stages a curated 19-task subset of the internal 40-task dev pool as
the public ClawBench release. Selected via greedy task elimination
from the v2026-4-19-full sweep archive so that:
(a) mean run_score across these 19 tasks reproduces the established
8-model ranking with zero inversions and min adjacent-rank gap
of 0.0049 (well above the ~0.002 seed-noise floor);
(b) coverage is preserved across tiers 1-5 and across the tools,
coding, repo, browser, multi_tool, and adversarial families;
(c) tasks with broken verifiers or near-zero cross-model SNR are
dropped (21 tasks retained as private holdout, not published).
Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs
per task, C+T+B+J weighted score):
1. Claude Opus 4.6 0.8137
2. Claude Opus 4.7 0.7824
3. GPT 5.4 0.7647
4. Claude Sonnet 4.6 0.7597
5. MiniMax M2.7 0.7475
6. Gemini 3.1 Pro 0.7408
7. Qwen 3.6 Plus 0.7030
8. Kimi K2.5 0.6800
Deliverables:
tasks-public/MANIFEST.yaml — machine-readable task list + metadata
tasks-public/README.md — rationale, usage, reproducibility notes
tasks-public/tier{1..5}/*.yaml — 19 task definitions
tasks-public/assets/*/ — 19 asset packs (verifiers + fixtures)
The internal dev set remains in tasks/ (gitignored) and retains 40
tasks for future expansion. Not published:
- 9 ceiling tasks (all frontier models score >0.85)
- 9 noise tasks (cross-model SNR < 0.5)
- 3 ranking-breaker tasks (e.g. t2-node-search-patch,
t5-contradictory-requirements)
Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs
for perturbation-sensitivity measurement, and creative-synthesis
tasks — all currently absent from Core v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>