clawbench

Author	SHA1	Message	Date
scoootscooob	5dfa4c9280	fix(eval): stabilize OpenClaw container sweeps Some checks failed CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled Details	2026-05-02 02:50:57 -07:00
scoootscooob	b5538e0927	Copy all package data in HF Docker build	2026-04-28 02:35:09 -07:00
scoootscooob	425daa4fc8	Copy partner spec in HF Docker build	2026-04-28 02:31:26 -07:00
scoootscooob	d069bcfe3a	Fix HF Docker package build	2026-04-28 02:26:39 -07:00
Vincent Koc	f373e4a710	fix: harden packaging and submissions	2026-04-28 01:17:43 -07:00
scoootscooob	4b7a9ee31c	Fix public Docker task copies	2026-04-27 22:57:10 -07:00
scoootscooob	8447ab1ca6	docker: revert OpenClaw base pin; remove reference scores Per request: drop the Docker-base-pinning approach and the inline reference scores. Treat published numbers as version-, provider-, and seed-dependent. Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1 back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the current OpenClaw release. The state-isolation patch + rejudge pipeline (the actually load-bearing reproducibility infra) stay in place; only the pinned-version approach is reverted. README.md: - drops the "Docker base pinning" row from the "What's new" table; replaced with "Reproducibility-first infrastructure" framing - drops the "pinned" badge; added a "Diagnostics" badge instead - updates "Reproducibility caveats" to recommend "build both sides of any comparison from the same OpenClaw release" rather than "pin to 2026.4.15-beta.1" - updates Quick Start to record (not assume) the OpenClaw version the build resolved to - drops the pinned-base row from the comparison table; replaced with "State-isolation per run" (the actually distinguishing infra) - updates the version log entry for Core v1 to highlight the dynamical-systems diagnostics + state-isolation rather than the pinning that's no longer there tasks-public/README.md: - drops the 8-row "Established ranking" table per request - replaced with a "Selection criteria" section that explains how the 19 tasks were chosen (0 inversions, min-gap 0.0049) without publishing version-dependent scores - reframes the build instructions to track :latest with a comment about platform-version drift tasks-public/MANIFEST.yaml: - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as a hard requirement) - drops the `established_ranking` block - replaced with `selection_basis` that documents the methodology and explicitly states why scores are intentionally omitted Test suite still green: 156 passed locally, 152 passed in the CI-equivalent (no private tasks/) configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 21:24:42 -07:00
scoootscooob	030e9968bd	docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility The ClawBench Core v1 reference numbers were measured against ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c). Using the moving ":latest" tag caused observable drift in our sweeps (platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by +0.13 to +0.29), so unpinned builds produce non-reproducible rankings. Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added an explanatory comment noting that bumping the base requires re- running the reference sweep. tasks-public/README.md: added build + verification commands so users can confirm they have the right OpenClaw version before running Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:09:49 -07:00
Codex	f2ba2a5238	Docker: detect Playwright Chromium path across architectures	2026-04-09 13:32:14 -07:00
Codex	f309da64d9	Docker: base HF Space on official OpenClaw image	2026-04-09 13:24:52 -07:00
Codex	843d31b1a2	Docker: use published OpenClaw runtime on HF	2026-04-09 13:12:38 -07:00
Codex	d68c1ba1ec	Docker: harden HF Space build path	2026-04-09 12:57:36 -07:00
Codex	a1fbfb3731	Docker: pin OpenClaw source and simplify runtime	2026-04-09 12:49:03 -07:00
scoootscooob	2e39d5ccb2	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
scoootscooob	fab8f96cf4	Fix: install lsof, drop --force (fuser not found)	2026-04-08 09:28:00 -07:00
scoootscooob	8d5a28d283	Fix build: pnpm + stub canvas bundle + tsdown directly	2026-04-08 09:18:13 -07:00
scoootscooob	7c804525ad	Fix build: tsdown + runtime-postbuild only, skip canvas bundle	2026-04-08 09:14:17 -07:00
scoootscooob	5d622ee77e	Build OpenClaw from source in Docker — all extension deps included	2026-04-08 09:11:01 -07:00
scoootscooob	feb9f6344f	Pin openclaw@2026.4.5	2026-04-08 09:04:30 -07:00
scoootscooob	67995dd39b	Fix: pin openclaw@2026.4.8 (v4.5 doesnt exist on npm)	2026-04-08 09:03:27 -07:00
scoootscooob	3f989b0371	Pin openclaw to v4.5	2026-04-08 08:58:50 -07:00
scoootscooob	b621e64a15	Fix: install @buape/carbon for gateway Discord extension	2026-04-08 08:57:20 -07:00
scoootscooob	4765d6e5aa	Add preset models: Qwen 3.5, DeepSeek R1, Kimi K2.5, MiniMax M2.5, GLM-4/Z1, Gemma 4, Claude - Fix gateway: --allow-unconfigured + token auth for headless container - Fix client: use cli client ID/mode + full operator scopes - Add 11 preset models with Submit All button - Open-source models use HF Inference API (no extra keys needed)	2026-04-07 13:27:43 -07:00
scoootscooob	1df8c430f3	Initial ClawBench: three-axis agent harness benchmark - Environment state verification (filesystem, memory, gateway queries) - Trajectory evaluation (precision/recall/F1 on tool call sequences) - Simulated users (static, adaptive LLM, adversarial) - pass^k reliability as primary metric - 14 tasks: 6 general, 5 OpenClaw, 3 adversarial - HF Docker Space with job queue and background eval worker - Gradio leaderboard with submission form	2026-04-07 12:48:31 -07:00

24 Commits