clawbench

History

scoootscooob 50959fa670 tasks: add Core v1 public task set (19 tasks) Stages a curated 19-task subset of the internal 40-task dev pool as the public ClawBench release. Selected via greedy task elimination from the v2026-4-19-full sweep archive so that: (a) mean run_score across these 19 tasks reproduces the established 8-model ranking with zero inversions and min adjacent-rank gap of 0.0049 (well above the ~0.002 seed-noise floor); (b) coverage is preserved across tiers 1-5 and across the tools, coding, repo, browser, multi_tool, and adversarial families; (c) tasks with broken verifiers or near-zero cross-model SNR are dropped (21 tasks retained as private holdout, not published). Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs per task, C+T+B+J weighted score): 1. Claude Opus 4.6 0.8137 2. Claude Opus 4.7 0.7824 3. GPT 5.4 0.7647 4. Claude Sonnet 4.6 0.7597 5. MiniMax M2.7 0.7475 6. Gemini 3.1 Pro 0.7408 7. Qwen 3.6 Plus 0.7030 8. Kimi K2.5 0.6800 Deliverables: tasks-public/MANIFEST.yaml — machine-readable task list + metadata tasks-public/README.md — rationale, usage, reproducibility notes tasks-public/tier{1..5}/.yaml — 19 task definitions tasks-public/assets// — 19 asset packs (verifiers + fixtures) The internal dev set remains in tasks/ (gitignored) and retains 40 tasks for future expansion. Not published: - 9 ceiling tasks (all frontier models score >0.85) - 9 noise tasks (cross-model SNR < 0.5) - 3 ranking-breaker tasks (e.g. t2-node-search-patch, t5-contradictory-requirements) Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs for perturbation-sensitivity measurement, and creative-synthesis tasks — all currently absent from Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:06:36 -07:00
..
t5-hallucination-resistant-evidence.yaml	tasks: add Core v1 public task set (19 tasks)	2026-04-20 20:06:36 -07:00

scoootscooob 50959fa670 tasks: add Core v1 public task set (19 tasks)

Stages a curated 19-task subset of the internal 40-task dev pool as
the public ClawBench release. Selected via greedy task elimination
from the v2026-4-19-full sweep archive so that:

  (a) mean run_score across these 19 tasks reproduces the established
      8-model ranking with zero inversions and min adjacent-rank gap
      of 0.0049 (well above the ~0.002 seed-noise floor);
  (b) coverage is preserved across tiers 1-5 and across the tools,
      coding, repo, browser, multi_tool, and adversarial families;
  (c) tasks with broken verifiers or near-zero cross-model SNR are
      dropped (21 tasks retained as private holdout, not published).

Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs
per task, C+T+B+J weighted score):

  1. Claude Opus 4.6         0.8137
  2. Claude Opus 4.7         0.7824
  3. GPT 5.4                 0.7647
  4. Claude Sonnet 4.6       0.7597
  5. MiniMax M2.7            0.7475
  6. Gemini 3.1 Pro          0.7408
  7. Qwen 3.6 Plus           0.7030
  8. Kimi K2.5               0.6800

Deliverables:
  tasks-public/MANIFEST.yaml   — machine-readable task list + metadata
  tasks-public/README.md       — rationale, usage, reproducibility notes
  tasks-public/tier{1..5}/*.yaml  — 19 task definitions
  tasks-public/assets/*/       — 19 asset packs (verifiers + fixtures)

The internal dev set remains in tasks/ (gitignored) and retains 40
tasks for future expansion. Not published:
  - 9 ceiling tasks (all frontier models score >0.85)
  - 9 noise tasks (cross-model SNR < 0.5)
  - 3 ranking-breaker tasks (e.g. t2-node-search-patch,
    t5-contradictory-requirements)

Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs
for perturbation-sensitivity measurement, and creative-synthesis
tasks — all currently absent from Core v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-04-20 20:06:36 -07:00

t5-hallucination-resistant-evidence.yaml

tasks: add Core v1 public task set (19 tasks)

2026-04-20 20:06:36 -07:00