clawbench

Author	SHA1	Message	Date
Vincent Koc	01dd96c71c	fix(security): constrain research article paths Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details	2026-04-30 02:57:52 -07:00
Vincent Koc	f373e4a710	fix: harden packaging and submissions	2026-04-28 01:17:43 -07:00
scoootscooob	595cdc910c	Add public domain scaffold and adapter diagnostics	2026-04-23 12:40:23 -07:00
scoootscooob	8447ab1ca6	docker: revert OpenClaw base pin; remove reference scores Per request: drop the Docker-base-pinning approach and the inline reference scores. Treat published numbers as version-, provider-, and seed-dependent. Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1 back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the current OpenClaw release. The state-isolation patch + rejudge pipeline (the actually load-bearing reproducibility infra) stay in place; only the pinned-version approach is reverted. README.md: - drops the "Docker base pinning" row from the "What's new" table; replaced with "Reproducibility-first infrastructure" framing - drops the "pinned" badge; added a "Diagnostics" badge instead - updates "Reproducibility caveats" to recommend "build both sides of any comparison from the same OpenClaw release" rather than "pin to 2026.4.15-beta.1" - updates Quick Start to record (not assume) the OpenClaw version the build resolved to - drops the pinned-base row from the comparison table; replaced with "State-isolation per run" (the actually distinguishing infra) - updates the version log entry for Core v1 to highlight the dynamical-systems diagnostics + state-isolation rather than the pinning that's no longer there tasks-public/README.md: - drops the 8-row "Established ranking" table per request - replaced with a "Selection criteria" section that explains how the 19 tasks were chosen (0 inversions, min-gap 0.0049) without publishing version-dependent scores - reframes the build instructions to track :latest with a comment about platform-version drift tasks-public/MANIFEST.yaml: - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as a hard requirement) - drops the `established_ranking` block - replaced with `selection_basis` that documents the methodology and explicitly states why scores are intentionally omitted Test suite still green: 156 passed locally, 152 passed in the CI-equivalent (no private tasks/) configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 21:24:42 -07:00
scoootscooob	030e9968bd	docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility The ClawBench Core v1 reference numbers were measured against ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c). Using the moving ":latest" tag caused observable drift in our sweeps (platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by +0.13 to +0.29), so unpinned builds produce non-reproducible rankings. Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added an explanatory comment noting that bumping the base requires re- running the reference sweep. tasks-public/README.md: added build + verification commands so users can confirm they have the right OpenClaw version before running Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:09:49 -07:00
scoootscooob	50959fa670	tasks: add Core v1 public task set (19 tasks) Stages a curated 19-task subset of the internal 40-task dev pool as the public ClawBench release. Selected via greedy task elimination from the v2026-4-19-full sweep archive so that: (a) mean run_score across these 19 tasks reproduces the established 8-model ranking with zero inversions and min adjacent-rank gap of 0.0049 (well above the ~0.002 seed-noise floor); (b) coverage is preserved across tiers 1-5 and across the tools, coding, repo, browser, multi_tool, and adversarial families; (c) tasks with broken verifiers or near-zero cross-model SNR are dropped (21 tasks retained as private holdout, not published). Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs per task, C+T+B+J weighted score): 1. Claude Opus 4.6 0.8137 2. Claude Opus 4.7 0.7824 3. GPT 5.4 0.7647 4. Claude Sonnet 4.6 0.7597 5. MiniMax M2.7 0.7475 6. Gemini 3.1 Pro 0.7408 7. Qwen 3.6 Plus 0.7030 8. Kimi K2.5 0.6800 Deliverables: tasks-public/MANIFEST.yaml — machine-readable task list + metadata tasks-public/README.md — rationale, usage, reproducibility notes tasks-public/tier{1..5}/.yaml — 19 task definitions tasks-public/assets// — 19 asset packs (verifiers + fixtures) The internal dev set remains in tasks/ (gitignored) and retains 40 tasks for future expansion. Not published: - 9 ceiling tasks (all frontier models score >0.85) - 9 noise tasks (cross-model SNR < 0.5) - 3 ranking-breaker tasks (e.g. t2-node-search-patch, t5-contradictory-requirements) Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs for perturbation-sensitivity measurement, and creative-synthesis tasks — all currently absent from Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:06:36 -07:00

6 Commits