Commit Graph

76 Commits

Author SHA1 Message Date
scoootscooob
595cdc910c Add public domain scaffold and adapter diagnostics 2026-04-23 12:40:23 -07:00
scoootscooob
df32a5f073
Merge pull request #7 from HaoLi111/feat/dynamics-analysis
Add archive dynamics pipeline and audience-based model presets
2026-04-22 13:11:32 -07:00
scoootscooob
11d943f21c fix: preserve preset submission settings and lazy-load plots
Some checks failed
CI / Python 3.12 test suite (push) Has been cancelled
2026-04-22 12:03:16 -07:00
pllm-uci
c209612d46 Add archive dynamics pipeline and audience-based model presets 2026-04-22 12:03:13 -07:00
scoootscooob
5b50814dfc
Merge pull request #8 from gchlebus/gchlebus/fix-connect-timeout
fix(client): raise default connect_timeout to 30s and make it env-overridable
2026-04-22 09:47:06 -07:00
scoootscooob
79b2253bfc fix(ci): restore public task fallback 2026-04-22 09:46:33 -07:00
scoootscooob
e4ca2bef8e fix(client): reject invalid timeout env values
Some checks failed
CI / Python 3.12 test suite (push) Has been cancelled
2026-04-22 09:41:44 -07:00
Grzegorz Chlebus
547ee160ad fix(client): raise default connect_timeout to 30s and make it env-overridable
The default connect_timeout=15.0 is shorter than the
observed first-session setup time against a freshly started
OpenClaw gateway (we've measured phase0_session_setup
~20-25s during containerised benchmark runs), which creates a
race where the client gives up before the gateway is ready for
the first turn.  Downstream the adapter then surfaces this as
an ``empty_response`` with zero transcript steps, which looks
like a model failure when it's really an environment timing
issue.

Concrete repro from a 19-task public_dev run:
    task:      t4-life-trip-plan
    failed:    reward=0, failure_category=empty_response,
               duration_ms=0, total_ms=16352, response hash
               = SHA256 of empty string
    rerun:     score=0.927 standalone, phase0_session_setup=21.2s

Change:

* GatewayConfig.connect_timeout default 15.0 -> 30.0
* GatewayConfig.request_timeout default kept at 60.0 but
  now explicitly documented and overridable for symmetry
* Both are now overridable via environment variables
  CLAWBENCH_CONNECT_TIMEOUT / CLAWBENCH_REQUEST_TIMEOUT
  so ops can tune further without a code change.
* Invalid env values are logged and fall back to the default
  rather than blowing up benchmark runs.
* Adds three unit tests covering default, env override, and
  invalid-env fallback behaviour.

Reported-by: Grzegorz Chlebus <gchlebus@nvidia.com>
2026-04-22 10:19:20 +02:00
scoootscooob
8447ab1ca6 docker: revert OpenClaw base pin; remove reference scores
Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.

Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.

README.md:
  - drops the "Docker base pinning" row from the "What's new" table;
    replaced with "Reproducibility-first infrastructure" framing
  - drops the "pinned" badge; added a "Diagnostics" badge instead
  - updates "Reproducibility caveats" to recommend "build both sides
    of any comparison from the same OpenClaw release" rather than
    "pin to 2026.4.15-beta.1"
  - updates Quick Start to record (not assume) the OpenClaw version
    the build resolved to
  - drops the pinned-base row from the comparison table; replaced
    with "State-isolation per run" (the actually distinguishing infra)
  - updates the version log entry for Core v1 to highlight the
    dynamical-systems diagnostics + state-isolation rather than the
    pinning that's no longer there

tasks-public/README.md:
  - drops the 8-row "Established ranking" table per request
  - replaced with a "Selection criteria" section that explains how
    the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
    publishing version-dependent scores
  - reframes the build instructions to track :latest with a comment
    about platform-version drift

tasks-public/MANIFEST.yaml:
  - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
    a hard requirement)
  - drops the `established_ranking` block
  - replaced with `selection_basis` that documents the methodology
    and explicitly states why scores are intentionally omitted

Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:24:42 -07:00
scoootscooob
0e250e3fe1 fix(ci): tasks-public fallback + leaderboard removed from README
README.md: removed the inline reference leaderboard per user request.
The Core v1 manifest still carries the established ranking, the
README still documents methodology + dynamical-systems diagnostics.

clawbench/tasks.py: extend _resolve_tasks_dir() with a tasks-public/
fallback layer (resolver step 5). Local dev with the private tasks/
present is unchanged; CI without tasks/ now falls back to the public
Core v1 set instead of returning an empty corpus. Has been broken
since deb3d5d (the "stop tracking current task set" commit) — this
restores green CI now that tasks-public/ is available.

tests/test_tasks.py: three updates so tests pass against either the
private 40-task set OR the public 19-task set:
  - test_load_all_tasks_returns_full_corpus: threshold lowered from
    >= 20 to >= 19 (Core v1 size)
  - test_workspace_setup_preserves_nested_asset_paths: switched from
    t1-architecture-brief (private) to t4-browser-research-and-code
    (public) which exercises the same flat+nested asset behaviour
  - test_selected_tasks_include_judge_rubrics: replaced 3 task IDs
    not in the public Core release (t1-architecture-brief,
    t5-contradictory-requirements, t5-impossible-graceful-fail) with
    public-set equivalents (t1-bugfix-discount, t3-feature-export)

Verified locally with both branches:
  - private tasks/ present:    156 passed, 1 skipped
  - private tasks/ hidden:     152 passed, 5 skipped (CI-equivalent)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:32:26 -07:00
scoootscooob
f95e838d99 docs: rewrite README around Core v1 + dynamical-systems diagnostics
Updates the front-door README to reflect the Core v1 release and the
methodology innovations we shipped this cycle. Key additions:

- "What's new in Core v1" table highlighting the five methodology
  layers most agent benchmarks lack (signal-curated task set,
  variance decomposition, dynamical-systems diagnostics, Constraint
  Index, Docker base pinning).

- Reference leaderboard — 8-model ranking on the Core-19 set from the
  v2026-4-19-full sweep. Honest about GLM 5.1's non-reproducibility
  and the OpenRouter routing issue.

- "What makes ClawBench different" expanded with variance
  decomposition (52.7% capability / 47.3% seed noise) and a new
  section (#3) on dynamical-systems diagnostics, including the four
  concrete signals (C(q), regime, survival, SNR-weighted ranking).

- New "Reproducibility caveats" section — what reproduces (audit,
  diagnostics, top-cluster ranking) vs what drifts (absolute scores,
  OpenRouter models, OpenClaw platform upgrades). Documents the
  pinning we did.

- Updated Quick Start with `docker build -t clawbench:core-v1`
  verification flow and a full analysis-pipeline walkthrough using
  the new scripts (rejudge_all, compute_constraint_index, etc).

- Repository layout updated to include tasks-public/ (public) and
  scripts/ with brief descriptions of all 11 reproducibility +
  analysis scripts.

- Comparison table extended with new columns: variance decomposition,
  dynamical regime, SNR-weighted alternative, Docker base pinning,
  provider-routing caveats — all areas where SWE-bench / HumanEval /
  LLM-judge leaderboards are silent.

- Version log + planned Core v2 roadmap (Tier 6 long-horizon,
  paraphrased prompt pairs, creative-synthesis, human baseline).

Headline shifts from "the agent benchmark that measures what users
actually experience" to "Rigorous agent evaluation. Signal-curated
tasks. Dynamical-systems diagnostics." — foregrounds the
methodological contributions that separate Core v1 from prior art.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:15:18 -07:00
scoootscooob
030e9968bd docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility
The ClawBench Core v1 reference numbers were measured against
ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c).
Using the moving ":latest" tag caused observable drift in our sweeps
(platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by
+0.13 to +0.29), so unpinned builds produce non-reproducible rankings.

Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added
an explanatory comment noting that bumping the base requires re-
running the reference sweep.

tasks-public/README.md: added build + verification commands so users
can confirm they have the right OpenClaw version before running Core
v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:09:49 -07:00
scoootscooob
50959fa670 tasks: add Core v1 public task set (19 tasks)
Stages a curated 19-task subset of the internal 40-task dev pool as
the public ClawBench release. Selected via greedy task elimination
from the v2026-4-19-full sweep archive so that:

  (a) mean run_score across these 19 tasks reproduces the established
      8-model ranking with zero inversions and min adjacent-rank gap
      of 0.0049 (well above the ~0.002 seed-noise floor);
  (b) coverage is preserved across tiers 1-5 and across the tools,
      coding, repo, browser, multi_tool, and adversarial families;
  (c) tasks with broken verifiers or near-zero cross-model SNR are
      dropped (21 tasks retained as private holdout, not published).

Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs
per task, C+T+B+J weighted score):

  1. Claude Opus 4.6         0.8137
  2. Claude Opus 4.7         0.7824
  3. GPT 5.4                 0.7647
  4. Claude Sonnet 4.6       0.7597
  5. MiniMax M2.7            0.7475
  6. Gemini 3.1 Pro          0.7408
  7. Qwen 3.6 Plus           0.7030
  8. Kimi K2.5               0.6800

Deliverables:
  tasks-public/MANIFEST.yaml   — machine-readable task list + metadata
  tasks-public/README.md       — rationale, usage, reproducibility notes
  tasks-public/tier{1..5}/*.yaml  — 19 task definitions
  tasks-public/assets/*/       — 19 asset packs (verifiers + fixtures)

The internal dev set remains in tasks/ (gitignored) and retains 40
tasks for future expansion. Not published:
  - 9 ceiling tasks (all frontier models score >0.85)
  - 9 noise tasks (cross-model SNR < 0.5)
  - 3 ranking-breaker tasks (e.g. t2-node-search-patch,
    t5-contradictory-requirements)

Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs
for perturbation-sensitivity measurement, and creative-synthesis
tasks — all currently absent from Core v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:06:36 -07:00
scoootscooob
b6f07d9a87 analysis: dynamical-systems diagnostics for agent runs
Treats agent runs as stochastic trajectories in semantic state space
and extracts signal that flat run_score averages away. Inspired by
the "When LLMs Are Dreaming, Where Do They Go?" framework: task
constraint characterization, per-run regime classification, seed-vs-
capability variance decomposition, per-turn survival, SNR-weighted
ranking.

Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external
model dependencies) as the semantic state proxy since sentence
embeddings would require torch. Crude but sufficient for the signals
the paper calls out.

scripts/compute_constraint_index.py: computes C(q) per task from
archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is
participation ratio of response covariance, entropy is eigenvalue
entropy, and BOPS is inter-run cosine (predictability proxy). High
C(q) = tasks where models converge to similar answers; low C(q) =
open-ended tasks where models diverge for style reasons.

scripts/classify_regimes.py: per-run regime classifier. Computes
drift_mean, from_start, recurrence, vol_log over turn trajectories.
Quartile-based thresholds label each run as too_short / trapped /
limit_cycle / diffusive / mixed. Reveals per-model tendencies:
Gemini traps frequently (one-shot answer without iteration), GPT
loops tool patterns, GLM is most balanced.

scripts/variance_decomp.py: decomposes run_score variance per task
into seed variance (3 runs of same model) vs capability variance
(across model means). SNR = cap_var / seed_var. Exposes that 47% of
benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and
give essentially random rankings.

scripts/survival_analysis.py: per-turn empirical survival S(t) and
hazard h(t). T_F = first turn where assistant emits empty response
or run ends in failure. Reveals long-horizon capability that flat
scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at
60% rate.

scripts/snr_weighted_ranking.py: SNR × |C(q)|-weighted ranking (with
winsorization at p95 to prevent single-task dominance). Headline
metric that weights discriminating + signal-rich tasks more than
noisy or consensus tasks. Also emits SNR-only and flat variants for
comparison.

scripts/generate_dynamical_report.py: assembles all four diagnostic
JSONs into a single markdown report with per-model regime tables,
SNR tiers, survival curves, and integrated interpretation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:49:05 -07:00
scoootscooob
afb14c3982 analysis: fair-comparison audit and rejudge pipeline
Tools for auditing archive coverage, rejudging judge-infra failures
via direct Anthropic API (bypasses the gateway path that sometimes
returns "Gateway is restarting" / empty judge results), and producing
fair multi-model comparison reports.

scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs
and archive JSONs side-by-side. Reports coverage %, clean mean,
coverage-normalized score, infra-zero count, judge-infra remaining
vs rejudged.

scripts/audit_per_run.py: per-run cross-model audit. Flags tasks
where all models score zero (broken task/verifier), verifier
rejects-valid-outputs (C=0 but agent produced text), harness-error
clusters, model-specific pathologies.

scripts/rejudge_all.py: re-runs judge scoring on archive runs where
the gateway judge failed. Uses direct anthropic SDK against
claude-sonnet-4-6, rewrites judge_result fields in place, recomputes
run_score per the C+T+B+J weighting.

scripts/generate_fair_report.py: produces an 8/9-model comparison
markdown report. Supports --exclude to drop specific models, headlines
"clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0
task pass counts, and coverage parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:48:43 -07:00
scoootscooob
01a31e55fb sweep: per-container state isolation + qwen model-id fix
scripts/container_sweep_single.sh: clone pristine OpenClaw state to
/tmp/ per sweep before starting the gateway. Carries over config
(openclaw.json, identity/, devices/, exec-approvals.json, tasks/,
subagents/, flows/, cron/) but leaves runtime dirs (agents/,
workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR
to the isolated dir so the gateway writes to /tmp instead of the
shared host mount. Fixes the cascading "RPC agents.create timed out
after 60s" failures caused by 4k+ stale agents accumulating across
sequential sweeps.

profiles/frontier_qwen_3_6.yaml: fix base_model from
openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus
(no dash). The dashed slug is unknown to OpenRouter and silently fails;
the no-dash version is the real canonical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:48:30 -07:00
scoootscooob
deb3d5d85d tasks: stop tracking current task set; fix t2 integration test for emptyNote
Context:
  The current 40-task set is being split into a private holdout set plus a
  new public set. The public repo will ship a different task set that
  doesn't give away the holdout; in the meantime, stop tracking the current
  tasks/ directory so benchmarking can continue locally without exposing
  the set externally.

Changes:
  - .gitignore: add tasks/ and lab-pr68627/ (vendored PR content, also
    moving out of the public repo).
  - git rm --cached tasks/: remove from tracking (files remain on disk
    locally).
  - tests/test_integration_checks.py:
    * Module-level pytest.mark.skipif that skips the whole file when
      tasks/ is absent — so CI against the public repo (no tasks)
      stays green once the private set moves out.
    * Update the t2-node-search-patch fixture to also define emptyNote()
      since the task was hardened with that distractor. Without this, the
      integration test asserts score==1.0 but gets 0.0 (the new
      "emptyNote stays empty" test fails against a fixture that never
      defines emptyNote).

Follow-up (separate work):
  Public task set lands in a subsequent commit. Holdout access path
  (encrypted-in-repo or private-repo) gets wired into the harness's
  private_tasks_root / hidden_tasks_dir plumbing.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-19 12:29:52 -07:00
scoootscooob
95b226dfed tasks: harden 5 ceiling-bound tasks for better model differentiation
All 5 of these tasks were clearing at 0.85-1.00 across the frontier-4
on v4.14 — narrow spread means they don't differentiate models. Each
now has a specific trap that catches naive approaches:

- t1-refactor-csv-loader: introduces divergent normalization requirements
  between load_rows (lowercase) and summarize_inventory (preserve first-
  seen case). Naive "lowercase everywhere in parse_inventory_row" fails
  2 of 3 tests. Proper refactor returns original case in the helper.

- t3-node-multifile-refactor: adds a 3rd caller (audit.js) requiring
  preserved userId case + minute-precision timestamp, diverging from
  auth.js and report.js. Single-function extraction fails 2 of 4 tests;
  agent must handle two normalization modes.

- t4-browser-research-and-code: docs rewritten with distractors —
  v1/v2/v3 versions, required/optional/cross-endpoint headers, rate
  limits, payload limits. Tests check 6 facts including negative-match
  for X-Admin-Token distractor (scoped to /v2/admin only).

- t2-node-search-patch: adds emptyNote() factory in render.js with
  legitimate empty body: "" that MUST NOT be patched. Naive grep-replace
  of `body: ""` now fails the emptyNote test. Also adds whitespace-
  trimming test for filterNotes.

- t4-memory-recall-continuation: requires storing 3 SEPARATE memory
  entries (beta-regions, retry-budget, apac-gating) instead of one.
  Release notes include operational-notes distractors that must NOT
  be codified. flags.py gains APAC_GATED_UNTIL field. Handoff verifier
  added to check all 3 facts in the handoff artifact.

All 5 tasks verified: properly-implemented starter patches pass all
tests, the new traps specifically fail naive implementations.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-19 12:24:25 -07:00
scoootscooob
cb48ca72e8 tasks: drop strict completion.files checks on 19 tasks
Every one of these tasks has an execution_check script (verify_*.py) that
already does a recursive workspace search — it greps for required content
across every agent-written .md/.txt/.csv regardless of filename. The
completion.files block was redundant and actively penalized models that
wrote to reasonable alternate paths (analysis.md vs budget_report.md).

Before: total=1 (file) + N (exec) → if file path didn't match, score was
capped at N/(N+1). On t3-fin-budget-monthly, 14 of 15 prior sweep runs
failed specifically on "FILE budget_report.md: File does not exist".

After: total=N. Verifier is the source of truth. Judge rubric already
tells graders "don't penalize non-standard paths" — this aligns completion
scoring with that stated policy.

Fixed tasks (all had recursive verifiers):
  t1-fs-quick-note, t1-life-translate, t2-ctx-pronoun-resolve,
  t2-err-instruction-ambig, t2-fs-cleanup-downloads, t2-fs-find-that-thing,
  t2-msg-summarize-thread, t2-priv-redact-doc, t2-skill-excel-rollup,
  t2-sys-memory-roundtrip, t2-web-quick-fact, t3-cal-reschedule-cascade,
  t3-data-sql-query, t3-fin-budget-monthly, t3-msg-inbox-triage,
  t3-social-bill-split, t3-web-research-and-cite, t4-ctx-long-recall,
  t4-life-trip-plan

Spot-checked that each verifier's required-content set already covered
the content_contains constraints that were also dropped.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 13:16:34 -07:00
scoootscooob
8a5be9c686 clawbench: per-sweep cache archiving + generic sweep templates
- scripts/_archive_cache.sh: snapshot run_cache/<model>/ to
  run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json.
  Sourced by sweep scripts so transcripts survive the next sweep's
  cache wipe and stay available for audits.
- scripts/container_sweep_single.sh: base multi-model sweep.
  Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so
  their caches are force-cleared at sweep start. Calls archive helper
  on exit.
- scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast
  fix validation (~20 min) instead of full 3-run sweep (~60 min).
- Dockerfile.main: parametrized clawbench-on-openclaw image with
  ARG BASE for pinning to any openclaw tag.
- scripts/git_checkpoint.py + README: documented checkpoint workflow
  for tagging known-good states during risky work.
- .gitignore: un-ignore scripts/, keep targeted ignores for
  __pycache__, .tmp, .local.py.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 12:46:45 -07:00
scoootscooob
fe8fef7795 Merge branch 'pr-4' into codex/merge-pr4 2026-04-16 19:50:11 -07:00
scoootscooob
ee8ff79347 docs: fix ollama profile guidance 2026-04-16 19:49:04 -07:00
scoootscooob
9d802d6c53 fix: classify find_replace-style tools as edits 2026-04-16 19:37:01 -07:00
pllm-uci
517f2207b0 Refine local Ollama profile documentation for clarity and usability 2026-04-15 11:45:57 -07:00
pllm-uci
e2d82b34c3 Add local Ollama model support and configuration guidance to README and profiles 2026-04-15 11:45:12 -07:00
HeYan
a2757e6bd9 fix: classify str_replace and insert tools as mutating edits
classify_tool_call matched tool names against a fixed set of verb
patterns. The pattern for the "edit" family was:

    r"write|edit|patch|apply|create|delete|rename"

This omitted "replace" and "insert", so tools like str_replace,
replace_in_file, insert_text, and insert_at_line all fell through
every check and were returned as ("unknown", False) – classified as
non-mutating with unknown family.

Consequences for any agent that edits via str_replace:
- distinct_mutation_targets stayed empty → min_distinct_mutation_targets
  requirement always failed
- read_before_write_ratio was 1.0 for the wrong reason (no mutations
  detected, so denominator collapsed to 1)
- "edit" never appeared in distinct_families → required_families check
  always reported it as missing

Fix: extend the edit pattern with "replace" and "insert".

Tests added: unit test for classify_tool_call directly and an end-to-end
trajectory test using a str_replace-based edit transcript.
2026-04-14 01:00:13 -07:00
scoootscooob
eb879adf9b Remove reports/ reference from README repo layout
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 00:52:17 -07:00
scoootscooob
6ab3004d63 Remove reports and scripts from repo, add to gitignore
Reports and eval scripts contain internal benchmark data that
should not be public.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 00:51:50 -07:00
scoootscooob
0d07aa4d08 Re-judge GPT 5.4: resolve judge auth caveat, full coverage
Re-ran Sonnet 4.6 judge on all 60 GPT 5.4 runs that had auth errors
during the original sweep. Called the Anthropic API directly using
cached transcripts. Results:

- judge_task_coverage: 0.6 -> 1.0 (all 40 tasks fully judged)
- judge_error_count: 60 -> 0
- overall_judge_score: 0.438 -> 0.239 (was inflated by excluding errors)
- overall_score: 0.456 -> 0.457 (unchanged, judge gated on C >= 0.9999)

No judge caveat remains. All 6 models now have complete, unbiased
judge coverage across all 720 runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 00:36:27 -07:00
scoootscooob
952decadcf Rewrite README, harden worker, add benchmark reports
Rewrote README to focus on why trace-based scoring matters for
user-perceived agent capabilities, how ablation works, and the 13
failure modes. Removed results pending finalization.

Worker changes: re-inject host env/plugins into lane configs after
gateway restart (fixes judge auth stripping), increase control-plane
probe tolerance for slow gateway startups.

Added 6-model leaderboard and sweep reports from April 11-12 runs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 00:11:34 -07:00
scoootscooob
44bef14f4d Add partner trace submission spec 2026-04-11 15:36:54 -07:00
scoootscooob
b4620d10ca worker: harden gateway runtime and resume behavior 2026-04-11 15:27:14 -07:00
scoootscooob
380c6b4815 bench: audit contamination and harden HF leaderboard loading 2026-04-11 07:14:32 -07:00
scoootscooob
99803642b0 bench: add trace ingestion and template promotion pipeline 2026-04-11 06:45:27 -07:00
Codex
02573d565d bench: add hidden release scaffolding and CI push coverage 2026-04-11 06:28:43 -07:00
Codex
29c1cd90e4 worker: fail-fast on hung sessions.create, retry control-plane probe
Queue submissions were failing intermittently with "Parallel lane N
failed for tasks [...]" after a ~5 minute stall. Root cause traced
to the interaction of three things:

1. GatewayClient.config.request_timeout defaulted to 300 seconds,
   meaning every WebSocket RPC would block for 5 full minutes
   before raising TimeoutError.

2. worker._assert_gateway_control_plane calls sessions.create on
   a freshly-started lane gateway as a readiness probe. A transient
   plugin-load race can leave the new gateway accepting /health HTTP
   requests but hanging on WebSocket RPCs, so the probe would block
   for the full 300s default timeout.

3. _ensure_parallel_gateway called the probe inside the same for-
   loop that was polling /health, with `except Exception: pass`
   silently swallowing probe failures. Each probe attempt consumed
   30-300 seconds of the 60-iteration budget, so effectively only
   1-2 probe retries fit before the outer loop gave up and the
   whole lane batch was marked failed.

Fixes (all in clawbench/client.py + clawbench/worker.py):

- Drop `GatewayConfig.request_timeout` default from 300.0 to 60.0.
  60 seconds is still generous for what are sub-second calls in
  steady state (agents.create, sessions.create, etc.); a healthy
  gateway responds in milliseconds. Long-running calls like
  send_and_wait already pass an explicit per-call timeout.

- Add a `timeout=None` kwarg to `GatewayClient._rpc()` so callers
  can override the default when they need tighter or looser
  per-call bounds. Error message now includes the effective
  timeout so debugging is clearer.

- `_assert_gateway_control_plane` now constructs a dedicated
  GatewayConfig with request_timeout=15s and wraps the whole probe
  (including WebSocket connect + session create + session delete)
  in `asyncio.wait_for(..., timeout=30.0)` as a belt-and-suspenders
  ceiling. A probe hang now fails in 30s instead of 300s.

- Split `_ensure_parallel_gateway` and `_ensure_gateway` into two
  explicit phases:
    Phase A: poll /health over HTTP until 200, up to 60s (fast)
    Phase B: call _assert_gateway_control_plane up to 3 times,
             sleeping 2s between attempts, before raising
  The probe-retry loop lets transient plugin warmup races
  self-recover without killing the whole lane batch.

- Added 2s HTTP client timeout to the /health polls so a single
  wedged HTTP request can't swallow 30s of the budget.

Also clears the 11 historical failed jobs from the queue dataset at
ScoootScooob/clawbench-results so the leaderboard starts clean on
the next Space rebuild.

All 20 tests in test_worker, test_client, and test_parallel_harness
pass with the new code paths.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 05:30:49 -07:00
Codex
ab69af31be upload: read-then-append instead of overwriting submissions split
Root cause: datasets.Dataset.push_to_hub(repo, split="submissions")
writes a single parquet shard to
data/submissions-00000-of-00001.parquet, REPLACING whatever was
there. Every call to upload_result() was clobbering the previous
submission. After 7 sequential uploads the dataset contained only
the last row; the HF Space leaderboard therefore displayed only 1
model entry despite 7 having been pushed.

Fix: upload_result() now loads the existing submissions split
first, appends the new row, dedupes by submission_id (so retried
uploads of the same run don't double-count), and pushes the
combined row list as a fresh parquet shard.

This works at ClawBench's current submission rate (1-2 concurrent
jobs). If cross-worker concurrency ever becomes material, we should
move to a genuinely append-only layout where each submission writes
its own parquet shard under
  data/submission-<submission_id>-of-NNNNN.parquet
and readers scan the whole data/ directory. File a follow-up when
submission volume warrants it.

Also backfilled the 6 frontier bake-off rows that had been lost
during the earlier overwrite sequence. The dataset at
https://huggingface.co/datasets/ScoootScooob/clawbench-results now
holds all 7 rows: Kimi K2.5, Opus 4.6, GPT-5.4, Gemini 3.1 Pro,
GLM-5.1, Qwen3.6-Plus, MiniMax M2.7.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 00:22:24 -07:00
Codex
78d844364f ci: trigger HF Space sync (secrets added)
Empty commit to re-queue .github/workflows/sync-to-hf-space.yml
now that HF_TOKEN and HF_USERNAME repository secrets are in place.

GitHub Actions secrets are snapshotted at workflow queue time, so
previously-failed runs cannot retroactively see secrets added after
their queue timestamp. This commit triggers a fresh run that will
pick up the current secrets state.
2026-04-11 00:17:55 -07:00
Codex
f55b990476 docs: add .github/workflows/README for HF sync setup
Documents the one-time secrets setup (HF_TOKEN + HF_USERNAME) that
the sync-to-hf-space.yml workflow needs before it can mirror GitHub
main to the HF Space. Also explains the --force semantics and the
"GitHub is the single source of truth" contract.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 00:02:45 -07:00
Codex
19e4750b69 ci: auto-mirror main to HF Space on every push
Adds .github/workflows/sync-to-hf-space.yml which force-pushes main
to the HF Space git remote whenever a commit lands on GitHub main.

This eliminates the dual-push friction: GitHub becomes the single
source of truth, and the HF Space deployed at
https://huggingface.co/spaces/ScoootScooob/clawbench always tracks
the latest GitHub main without manual `git push hf main` calls.

Requires two repository secrets (Settings -> Secrets -> Actions):
  HF_TOKEN     — write-scoped HF token
                 (https://huggingface.co/settings/tokens)
  HF_USERNAME  — HF account username that owns the Space

Optional repo variable:
  HF_SPACE_ID  — defaults to "ScoootScooob/clawbench", override if
                 mirroring to a different Space.

Uses --force to replace any Space-side commits that were created by
editing files in the HF UI. This is intentional — the workflow's
contract is that GitHub is authoritative.

Guarded by concurrency group so two simultaneous pushes serialize
instead of racing into a non-fast-forward rejection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 00:01:20 -07:00
Codex
07a20c3f18 HF Space: dynamic stats + fix leaderboard environment parsing
Two fixes for the HF Space UI:

1. Leaderboard crashed with "'str' object has no attribute 'get'"
   because upload_result() serializes BenchmarkResult.environment as
   str(result.environment) when pushing to the HF Dataset, but
   _flatten_result called .get() on it as if it were a dict.
   Defensive parse: accept dict, stringified dict, or JSON object.
2. Stats ribbon (Tasks/Tiers/Browser/Judge counts) was hardcoded to
   the v0.3 values (20/5/2/6). Replaced with _compute_stats() which
   calls load_all_tasks() at startup and derives the numbers from
   the live task corpus, so the ribbon stays in sync with the
   tasks/ directory without manual edits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 23:55:37 -07:00
Codex
c24d982110 HF Space: fix container eval — pytest in runtime deps, TASKS_DIR resolver, timeouts
Found and fixed three blockers preventing the HF Space Docker container
from running the eval suite end-to-end, verified by building the image
locally with Docker Desktop and running a tier-1 task against Qwen3-32B
through the HF Inference API inside the container.

1. pytest was in [project.optional-dependencies].dev, not [project].
   The Dockerfile does `pip install .` which only installs runtime
   deps, so every task whose completion verifier runs `pytest -q`
   would fail with exit 127 (command not found). Moved pytest +
   pytest-asyncio into the base dependencies so the container gets
   them by default. The [dev] extra is kept as an alias for
   existing `pip install .[dev]` invocations.

2. clawbench/tasks.py resolved TASKS_DIR via
   `Path(__file__).parent.parent / "tasks"`, which works only for
   source checkouts. When pip installs the package into
   /usr/local/lib/python3.11/dist-packages/clawbench, the sibling
   `tasks/` directory no longer exists at that path, so
   `load_all_tasks()` returned empty and `clawbench run` died with
   "No tasks to run". Added a fallback resolver that tries, in
   order: $CLAWBENCH_TASKS_DIR env var, sibling-of-source,
   Path.cwd() / "tasks", and known Docker layout candidates
   (/home/node/app/tasks, /home/user/app/tasks, /app/tasks).
   Verified inside the container that `TASKS_DIR` now resolves to
   /home/node/app/tasks and load_all_tasks() returns 40 tasks.

3. Tier-1 task timeouts were at 180s, which is enough for Qwen3-32B
   (52.9s wall time) but causes Llama-3.3-70B to hit the wall on
   t1-bugfix-discount. Raised tier-1 timeouts to 360s so slower HF
   models can complete tasks within the deterministic timeout and
   produce a capability signal instead of an infrastructure
   timeout signal.

Also fixed a pre-existing stale test (tests/test_tasks.py expected
20 tasks, we have 40 since v0.5 corpus expansion) that was failing
on every test run.

Verified inside the container image:
- `clawbench list-tasks` returns all 40 tasks
- sessions.create passes for all 11 preset models
  (9 huggingface/* + 2 anthropic/*)
- `clawbench run --model huggingface/Qwen/Qwen3-32B
    --task t1-bugfix-discount --runs 1`
  scored 1.000 / C=1.000 T=1.000 B=1.000 in 52.9s
  with 279,702 tokens captured.

Remaining architectural note (not a blocker): the CLI path
`clawbench run` assumes the gateway is already running. Only the
queue/worker path (`app.py` → `EvalWorker._ensure_gateway`) spawns
its own gateway. For HF Space deployment this is fine because all
user submissions go through the Gradio UI → queue → worker path;
local CLI invocations inside the container need to start the
gateway manually first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 23:03:15 -07:00
Codex
e9ff163217 baselines: merge provenance docs into BASELINE_SOURCES.md
Replace the two separate JSON files (hermes_trace_summary.json and
basic_usage_query_summary.json) with a single markdown document that
captures every empirical source informing ClawBench's task design.

baselines/BASELINE_SOURCES.md covers:

1. The 24 public Hugging Face datasets tagged format:agent-traces,
   with owner/name, row counts, cluster classification (Pi sessions,
   custom agent traces, Claude Code, demo), and how each cluster
   maps onto ClawBench's tier/family/trajectory design decisions.
   Aggregate ~3,049 rows, ~1,168 unique sessions after mirror
   deduplication.

2. The Hermes agent reasoning trace aggregate (14,701 sessions,
   24.3 avg turns, category distribution) with the direct mapping
   from observed categories to ClawBench task families.

3. The internal personal-agent use-case corpus (72 queries, 12
   primary scenarios, 139 atomic capabilities) that contributes
   the scenario_weight_defaults in query_catalog.py. The source
   is not a public dataset and is only referred to as "the internal
   personal-agent use-case corpus" — no filename reference.

4. A full source-to-design-decision mapping table showing which
   design choice (tier ladder, family mix, tool diversity,
   recovery expectations, browser task count, scenario weights,
   difficulty tags, adversarial tier-5) is driven by which source.

Also scrub two remaining references to the Chinese filename in
reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md,
replacing them with pointers to baselines/BASELINE_SOURCES.md.

No runtime code paths read the baselines/ directory; these files are
provenance artifacts for the design decisions baked into tasks/ and
clawbench/query_catalog.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:36:18 -07:00
Codex
3cdade49ce README: rewrite for v0.5 with architecture, numbers, and positioning
Replaces the 192-line v0.3 README with a 724-line v0.5 writeup that
documents:

- The three-layer scoring model (deterministic first, process second,
  semantic residue last) with full Mermaid diagram of the harness ->
  verifier -> scorer -> v0.5 diagnostic pipeline
- The v0.5 Configuration Diagnostic positioning: first agent benchmark
  that measures the configuration, not just the model, with a 5-stage
  flow diagram (profile -> fingerprint -> predict -> validate -> explain)
- All three mathematical pillars (k-NN composite Jaccard, fANOVA with
  Random Forest surrogate + lite fallback, Taguchi S/N) with formulas
  and the rationale for why each technique is in the stack
- The 7-model frontier baseline results with ASCII bar chart,
  per-bucket Taguchi S/N, calibration MAE, and honest caveats about
  tier-1 coding tasks being too easy at n=1 to separate frontier models
- Task suite breakdown (40 tasks, 5 tiers, 14 capability tags, 4 pools)
- Full scoring model including the gated judge weighting invariant
- CLI reference for `clawbench run` + `clawbench diagnose`
- Repository layout with line counts for all 18 package modules
- Comparison table vs SWE-bench / HumanEval / pass-rate leaderboards
- Testing section (107 tests, key files with line counts)
- Historical data + HF Dataset integration
- Contributing guide stub with task YAML shape
- Citation block

Preserves the HF Space frontmatter (title, emoji, colorFrom, colorTo,
sdk, app_port, pinned, license) so the Space rendering still works.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:21:48 -07:00
Codex
4744a6ae7e ClawBench: 7-model frontier baseline + bake-off tooling
Profiles (profiles/):
- frontier_opus_4_6.yaml      (Anthropic Claude Opus 4.6 — closed)
- frontier_gpt_5_4.yaml       (OpenAI GPT-5.4 — closed)
- frontier_gemini_3_pro.yaml  (Google Gemini 3.1 Pro — closed)
- frontier_glm_5_1.yaml       (Zhipu AI GLM-5.1 via OpenRouter — open)
- frontier_qwen_3_6.yaml      (Alibaba Qwen3.6-Plus via OpenRouter — open)
- frontier_minimax_m27.yaml   (MiniMax M2.7 via OpenRouter — open)
- frontier_kimi_k25.yaml      (Moonshot Kimi K2.5 via OpenRouter — open)
- example_research_stack.yaml (example for docs)

All seven profiles share an identical plugin stack (anthropic +
memory-lancedb + browser-playwright) so base_model is the only
structural variable across the bake-off.

Scripts (scripts/):
- run_open_vs_closed_bakeoff.py: driver that runs each profile
  through the harness and generates a comparison table. Wraps
  `clawbench run --profile` via an inline Click entry (the package
  has no __main__.py so `python -m clawbench.cli` is a no-op).
- analyze_open_vs_closed.py: historical DB analyzer — per-bucket
  mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies
  OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/
  Moonshot land in the open bucket.
- ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py,
  scale_timeouts.py, seed_historical_db.py: task-corpus tooling.

Reports (reports/):
- FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run
  (3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6
  scored 63.9% with real token streaming (174K tok, $0.18 cost).
  The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too
  easy to separate frontier models at n=1. Documents
  infrastructure findings around gateway plugin allowlist behavior,
  token streaming gaps for non-Anthropic providers, and hot-reload
  cascade when config changes mid-run.
- open_vs_closed_bakeoff_summary.md: auto-generated headline table
- FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run
- REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run
- PARALLEL_HARNESS_REPORT.md: concurrency validation writeup
- V05_DELIVERY_REPORT.md: v0.5 framework delivery notes
- CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning

Artifacts (reports/artifacts/):
- frontier_*.json: the 7 BenchmarkResult files from the bake-off
  (committed snapshot for reproducibility; runtime results still
  go to results/ which remains gitignored)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:14:11 -07:00
Codex
4aa017838a ClawBench v0.5: tests + task corpus expansion
Tests:
- tests/test_v05_framework.py (646 lines): end-to-end synthetic ecosystem
  covering profile parsing, fingerprint computation, k-NN prediction,
  surprise detection, factor analysis, diagnostic rendering
- tests/test_v05_extensions.py (552 lines): unit tests for Taguchi S/N
  robustness profile, plugin utilization audit, manifest-vs-reality gap,
  calibration tracking, surprise cause attribution, recommendations
  generator, insights publishing, end-to-end diagnostic with all sections
- tests/test_scorer.py: judge gating tests (judge cannot rescue failed
  deterministic completion; judge capped at 10% when deterministic
  verifier exists and floor met; judge dominates at 50% on semantic-
  only tasks)
- tests/test_e2e_significance.py, test_parallel_harness.py:
  additional coverage for harness behavior

Task corpus expansion:
- 20 new task YAMLs across tier1-4 covering fs, web, calendar,
  messaging, data processing, social coordination, life assistance,
  context continuation, error boundary, skill calling, privacy
  redaction scenarios
- Fresh asset packs for each new task (test fixtures + reference
  inputs/outputs)
- Lower tier-1 coding task timeouts from 360s to 180s to avoid
  final-state wait waste (the gateway emits no chat.state:final event,
  so the wait is pure overhead; 180s is plenty for any tier-1 task)
- Modify tier2-5 task YAMLs for verifier robustness and judge rubric
  updates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:13:37 -07:00
Codex
cf04a17fea ClawBench v0.5: configuration-space diagnostic framework
Add the v0.5 plugin-profile diagnostic system on top of v0.4:

- profile.py: PluginProfile, PluginManifest, RegistrationTrace,
  ProfileFingerprint, fingerprint_similarity (Jaccard composite over
  capability coverage, hook footprint, tool family surface, tags, slots,
  base model)
- prediction.py: HistoricalDatabase with JSON persistence, k-NN cold-start
  prediction with confidence bands, calibration metrics (MAE/RMSE/bias),
  surprise cause attribution
- factor_analysis.py: fANOVA with Random Forest surrogate when sklearn
  is available, fANOVA-lite fallback that decomposes variance via
  SSB/SST with pairwise interaction residuals
- diagnostic.py / diagnose_cli.py: Configuration Diagnostic Report
  ties profile -> fingerprint -> prediction -> run -> surprises -> insights
- utilization.py: plugin utilization audit (dead-weight detection) +
  manifest-vs-reality gap per plugin
- recommendations.py: evidence-backed profile change generator
  (add_plugin, remove_plugin, fill_slot, add_capability) with
  confidence scaled by sample size
- insights.py: publishes plugin leaderboard, factor importance,
  interactions, capability gaps, calibration history to JSON files
- stats.py: Taguchi larger-is-better signal-to-noise ratio and
  RobustnessProfile with per-tier means (the third mathematical
  pillar of v0.5 alongside k-NN and fANOVA)
- scorer.py: fix judge weighting per spec. Judge now capped at 10%
  when the task has a deterministic completion verifier and only
  contributes when the deterministic floor (completion >= 0.9999)
  is met. When no deterministic verifier exists, judge dominates
  at 50% (semantic-only regime). This enforces CLAWBENCH_V0_4_SPEC.md
  "Disallowed Primary Verifiers" and "Judge Gating" sections.
- cli.py: wire --profile flag into clawbench run; add clawbench diagnose
  subcommand
- harness.py: pass has_deterministic_verifier to combine_run_score
- CLAWBENCH_V0_4_SPEC.md: add v0.5 Direction section

.gitignore: exclude .clawbench/ runtime state and .DS_Store

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:13:02 -07:00
Codex
b6e82d6afe Space: harden theme construction against unsupported kwargs
The previous redesign passed radius_size/block_radius/shadow_drop/shadow_spread
to themes.Base().set() which are either constructor-only or version-specific,
causing the HF Space to runtime-error at startup. Drop those kwargs and wrap
the whole theme build in a try/except that falls back to plain Base() so any
future unknown kwarg degrades gracefully instead of crashing the Space.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 22:20:57 -07:00
scoootscooob
42cd0768ef Space: sync A10 defaults 2026-04-09 22:14:06 -07:00
Codex
aed7c37207 Space: redesign UI to match OpenClaw WebUI + ClawHub design system
Apply the shared OpenClaw aesthetic — dark backgrounds (#0e1015 layered),
signature red accent (#ff5c5c), Inter + JetBrains Mono typography,
whisper-thin color-mix borders, pill tab switcher, and rise animations.
Replaces the default Gradio Base theme with custom CSS and theme tokens.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 22:10:27 -07:00