Commit Graph

6 Commits

Author SHA1 Message Date
Vincent Koc
01dd96c71c
fix(security): constrain research article paths
Some checks are pending
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run
Sync main to HF Space / mirror (push) Waiting to run
2026-04-30 02:57:52 -07:00
Vincent Koc
f373e4a710
fix: harden packaging and submissions 2026-04-28 01:17:43 -07:00
scoootscooob
595cdc910c Add public domain scaffold and adapter diagnostics 2026-04-23 12:40:23 -07:00
scoootscooob
8447ab1ca6 docker: revert OpenClaw base pin; remove reference scores
Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.

Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.

README.md:
  - drops the "Docker base pinning" row from the "What's new" table;
    replaced with "Reproducibility-first infrastructure" framing
  - drops the "pinned" badge; added a "Diagnostics" badge instead
  - updates "Reproducibility caveats" to recommend "build both sides
    of any comparison from the same OpenClaw release" rather than
    "pin to 2026.4.15-beta.1"
  - updates Quick Start to record (not assume) the OpenClaw version
    the build resolved to
  - drops the pinned-base row from the comparison table; replaced
    with "State-isolation per run" (the actually distinguishing infra)
  - updates the version log entry for Core v1 to highlight the
    dynamical-systems diagnostics + state-isolation rather than the
    pinning that's no longer there

tasks-public/README.md:
  - drops the 8-row "Established ranking" table per request
  - replaced with a "Selection criteria" section that explains how
    the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
    publishing version-dependent scores
  - reframes the build instructions to track :latest with a comment
    about platform-version drift

tasks-public/MANIFEST.yaml:
  - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
    a hard requirement)
  - drops the `established_ranking` block
  - replaced with `selection_basis` that documents the methodology
    and explicitly states why scores are intentionally omitted

Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:24:42 -07:00
scoootscooob
030e9968bd docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility
The ClawBench Core v1 reference numbers were measured against
ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c).
Using the moving ":latest" tag caused observable drift in our sweeps
(platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by
+0.13 to +0.29), so unpinned builds produce non-reproducible rankings.

Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added
an explanatory comment noting that bumping the base requires re-
running the reference sweep.

tasks-public/README.md: added build + verification commands so users
can confirm they have the right OpenClaw version before running Core
v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:09:49 -07:00
scoootscooob
50959fa670 tasks: add Core v1 public task set (19 tasks)
Stages a curated 19-task subset of the internal 40-task dev pool as
the public ClawBench release. Selected via greedy task elimination
from the v2026-4-19-full sweep archive so that:

  (a) mean run_score across these 19 tasks reproduces the established
      8-model ranking with zero inversions and min adjacent-rank gap
      of 0.0049 (well above the ~0.002 seed-noise floor);
  (b) coverage is preserved across tiers 1-5 and across the tools,
      coding, repo, browser, multi_tool, and adversarial families;
  (c) tasks with broken verifiers or near-zero cross-model SNR are
      dropped (21 tasks retained as private holdout, not published).

Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs
per task, C+T+B+J weighted score):

  1. Claude Opus 4.6         0.8137
  2. Claude Opus 4.7         0.7824
  3. GPT 5.4                 0.7647
  4. Claude Sonnet 4.6       0.7597
  5. MiniMax M2.7            0.7475
  6. Gemini 3.1 Pro          0.7408
  7. Qwen 3.6 Plus           0.7030
  8. Kimi K2.5               0.6800

Deliverables:
  tasks-public/MANIFEST.yaml   — machine-readable task list + metadata
  tasks-public/README.md       — rationale, usage, reproducibility notes
  tasks-public/tier{1..5}/*.yaml  — 19 task definitions
  tasks-public/assets/*/       — 19 asset packs (verifiers + fixtures)

The internal dev set remains in tasks/ (gitignored) and retains 40
tasks for future expansion. Not published:
  - 9 ceiling tasks (all frontier models score >0.85)
  - 9 noise tasks (cross-model SNR < 0.5)
  - 3 ranking-breaker tasks (e.g. t2-node-search-patch,
    t5-contradictory-requirements)

Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs
for perturbation-sensitivity measurement, and creative-synthesis
tasks — all currently absent from Core v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:06:36 -07:00