Commit Graph

24 Commits

Author SHA1 Message Date
scoootscooob
5dfa4c9280 fix(eval): stabilize OpenClaw container sweeps
Some checks failed
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled
2026-05-02 02:50:57 -07:00
scoootscooob
b5538e0927 Copy all package data in HF Docker build 2026-04-28 02:35:09 -07:00
scoootscooob
425daa4fc8 Copy partner spec in HF Docker build 2026-04-28 02:31:26 -07:00
scoootscooob
d069bcfe3a Fix HF Docker package build 2026-04-28 02:26:39 -07:00
Vincent Koc
f373e4a710
fix: harden packaging and submissions 2026-04-28 01:17:43 -07:00
scoootscooob
4b7a9ee31c Fix public Docker task copies 2026-04-27 22:57:10 -07:00
scoootscooob
8447ab1ca6 docker: revert OpenClaw base pin; remove reference scores
Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.

Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.

README.md:
  - drops the "Docker base pinning" row from the "What's new" table;
    replaced with "Reproducibility-first infrastructure" framing
  - drops the "pinned" badge; added a "Diagnostics" badge instead
  - updates "Reproducibility caveats" to recommend "build both sides
    of any comparison from the same OpenClaw release" rather than
    "pin to 2026.4.15-beta.1"
  - updates Quick Start to record (not assume) the OpenClaw version
    the build resolved to
  - drops the pinned-base row from the comparison table; replaced
    with "State-isolation per run" (the actually distinguishing infra)
  - updates the version log entry for Core v1 to highlight the
    dynamical-systems diagnostics + state-isolation rather than the
    pinning that's no longer there

tasks-public/README.md:
  - drops the 8-row "Established ranking" table per request
  - replaced with a "Selection criteria" section that explains how
    the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
    publishing version-dependent scores
  - reframes the build instructions to track :latest with a comment
    about platform-version drift

tasks-public/MANIFEST.yaml:
  - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
    a hard requirement)
  - drops the `established_ranking` block
  - replaced with `selection_basis` that documents the methodology
    and explicitly states why scores are intentionally omitted

Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:24:42 -07:00
scoootscooob
030e9968bd docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility
The ClawBench Core v1 reference numbers were measured against
ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c).
Using the moving ":latest" tag caused observable drift in our sweeps
(platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by
+0.13 to +0.29), so unpinned builds produce non-reproducible rankings.

Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added
an explanatory comment noting that bumping the base requires re-
running the reference sweep.

tasks-public/README.md: added build + verification commands so users
can confirm they have the right OpenClaw version before running Core
v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:09:49 -07:00
Codex
f2ba2a5238 Docker: detect Playwright Chromium path across architectures 2026-04-09 13:32:14 -07:00
Codex
f309da64d9 Docker: base HF Space on official OpenClaw image 2026-04-09 13:24:52 -07:00
Codex
843d31b1a2 Docker: use published OpenClaw runtime on HF 2026-04-09 13:12:38 -07:00
Codex
d68c1ba1ec Docker: harden HF Space build path 2026-04-09 12:57:36 -07:00
Codex
a1fbfb3731 Docker: pin OpenClaw source and simplify runtime 2026-04-09 12:49:03 -07:00
scoootscooob
2e39d5ccb2 Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
scoootscooob
fab8f96cf4 Fix: install lsof, drop --force (fuser not found) 2026-04-08 09:28:00 -07:00
scoootscooob
8d5a28d283 Fix build: pnpm + stub canvas bundle + tsdown directly 2026-04-08 09:18:13 -07:00
scoootscooob
7c804525ad Fix build: tsdown + runtime-postbuild only, skip canvas bundle 2026-04-08 09:14:17 -07:00
scoootscooob
5d622ee77e Build OpenClaw from source in Docker — all extension deps included 2026-04-08 09:11:01 -07:00
scoootscooob
feb9f6344f Pin openclaw@2026.4.5 2026-04-08 09:04:30 -07:00
scoootscooob
67995dd39b Fix: pin openclaw@2026.4.8 (v4.5 doesnt exist on npm) 2026-04-08 09:03:27 -07:00
scoootscooob
3f989b0371 Pin openclaw to v4.5 2026-04-08 08:58:50 -07:00
scoootscooob
b621e64a15 Fix: install @buape/carbon for gateway Discord extension 2026-04-08 08:57:20 -07:00
scoootscooob
4765d6e5aa Add preset models: Qwen 3.5, DeepSeek R1, Kimi K2.5, MiniMax M2.5, GLM-4/Z1, Gemma 4, Claude
- Fix gateway: --allow-unconfigured + token auth for headless container
- Fix client: use cli client ID/mode + full operator scopes
- Add 11 preset models with Submit All button
- Open-source models use HF Inference API (no extra keys needed)
2026-04-07 13:27:43 -07:00
scoootscooob
1df8c430f3 Initial ClawBench: three-axis agent harness benchmark
- Environment state verification (filesystem, memory, gateway queries)
- Trajectory evaluation (precision/recall/F1 on tool call sequences)
- Simulated users (static, adaptive LLM, adversarial)
- pass^k reliability as primary metric
- 14 tasks: 6 general, 5 OpenClaw, 3 adversarial
- HF Docker Space with job queue and background eval worker
- Gradio leaderboard with submission form
2026-04-07 12:48:31 -07:00