clawbench

Author	SHA1	Message	Date
scoootscooob	595cdc910c	Add public domain scaffold and adapter diagnostics	2026-04-23 12:40:23 -07:00
scoootscooob	df32a5f073	Merge pull request #7 from HaoLi111/feat/dynamics-analysis Add archive dynamics pipeline and audience-based model presets	2026-04-22 13:11:32 -07:00
scoootscooob	11d943f21c	fix: preserve preset submission settings and lazy-load plots Some checks failed CI / Python 3.12 test suite (push) Has been cancelled Details	2026-04-22 12:03:16 -07:00
pllm-uci	c209612d46	Add archive dynamics pipeline and audience-based model presets	2026-04-22 12:03:13 -07:00
scoootscooob	5b50814dfc	Merge pull request #8 from gchlebus/gchlebus/fix-connect-timeout fix(client): raise default connect_timeout to 30s and make it env-overridable	2026-04-22 09:47:06 -07:00
scoootscooob	79b2253bfc	fix(ci): restore public task fallback	2026-04-22 09:46:33 -07:00
scoootscooob	e4ca2bef8e	fix(client): reject invalid timeout env values Some checks failed CI / Python 3.12 test suite (push) Has been cancelled Details	2026-04-22 09:41:44 -07:00
Grzegorz Chlebus	547ee160ad	fix(client): raise default connect_timeout to 30s and make it env-overridable The default connect_timeout=15.0 is shorter than the observed first-session setup time against a freshly started OpenClaw gateway (we've measured phase0_session_setup ~20-25s during containerised benchmark runs), which creates a race where the client gives up before the gateway is ready for the first turn. Downstream the adapter then surfaces this as an ``empty_response`` with zero transcript steps, which looks like a model failure when it's really an environment timing issue. Concrete repro from a 19-task public_dev run: task: t4-life-trip-plan failed: reward=0, failure_category=empty_response, duration_ms=0, total_ms=16352, response hash = SHA256 of empty string rerun: score=0.927 standalone, phase0_session_setup=21.2s Change: * GatewayConfig.connect_timeout default 15.0 -> 30.0 * GatewayConfig.request_timeout default kept at 60.0 but now explicitly documented and overridable for symmetry * Both are now overridable via environment variables CLAWBENCH_CONNECT_TIMEOUT / CLAWBENCH_REQUEST_TIMEOUT so ops can tune further without a code change. * Invalid env values are logged and fall back to the default rather than blowing up benchmark runs. * Adds three unit tests covering default, env override, and invalid-env fallback behaviour. Reported-by: Grzegorz Chlebus <gchlebus@nvidia.com>	2026-04-22 10:19:20 +02:00
scoootscooob	8447ab1ca6	docker: revert OpenClaw base pin; remove reference scores Per request: drop the Docker-base-pinning approach and the inline reference scores. Treat published numbers as version-, provider-, and seed-dependent. Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1 back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the current OpenClaw release. The state-isolation patch + rejudge pipeline (the actually load-bearing reproducibility infra) stay in place; only the pinned-version approach is reverted. README.md: - drops the "Docker base pinning" row from the "What's new" table; replaced with "Reproducibility-first infrastructure" framing - drops the "pinned" badge; added a "Diagnostics" badge instead - updates "Reproducibility caveats" to recommend "build both sides of any comparison from the same OpenClaw release" rather than "pin to 2026.4.15-beta.1" - updates Quick Start to record (not assume) the OpenClaw version the build resolved to - drops the pinned-base row from the comparison table; replaced with "State-isolation per run" (the actually distinguishing infra) - updates the version log entry for Core v1 to highlight the dynamical-systems diagnostics + state-isolation rather than the pinning that's no longer there tasks-public/README.md: - drops the 8-row "Established ranking" table per request - replaced with a "Selection criteria" section that explains how the 19 tasks were chosen (0 inversions, min-gap 0.0049) without publishing version-dependent scores - reframes the build instructions to track :latest with a comment about platform-version drift tasks-public/MANIFEST.yaml: - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as a hard requirement) - drops the `established_ranking` block - replaced with `selection_basis` that documents the methodology and explicitly states why scores are intentionally omitted Test suite still green: 156 passed locally, 152 passed in the CI-equivalent (no private tasks/) configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 21:24:42 -07:00
scoootscooob	0e250e3fe1	fix(ci): tasks-public fallback + leaderboard removed from README README.md: removed the inline reference leaderboard per user request. The Core v1 manifest still carries the established ranking, the README still documents methodology + dynamical-systems diagnostics. clawbench/tasks.py: extend _resolve_tasks_dir() with a tasks-public/ fallback layer (resolver step 5). Local dev with the private tasks/ present is unchanged; CI without tasks/ now falls back to the public Core v1 set instead of returning an empty corpus. Has been broken since `deb3d5d` (the "stop tracking current task set" commit) — this restores green CI now that tasks-public/ is available. tests/test_tasks.py: three updates so tests pass against either the private 40-task set OR the public 19-task set: - test_load_all_tasks_returns_full_corpus: threshold lowered from >= 20 to >= 19 (Core v1 size) - test_workspace_setup_preserves_nested_asset_paths: switched from t1-architecture-brief (private) to t4-browser-research-and-code (public) which exercises the same flat+nested asset behaviour - test_selected_tasks_include_judge_rubrics: replaced 3 task IDs not in the public Core release (t1-architecture-brief, t5-contradictory-requirements, t5-impossible-graceful-fail) with public-set equivalents (t1-bugfix-discount, t3-feature-export) Verified locally with both branches: - private tasks/ present: 156 passed, 1 skipped - private tasks/ hidden: 152 passed, 5 skipped (CI-equivalent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:32:26 -07:00
scoootscooob	f95e838d99	docs: rewrite README around Core v1 + dynamical-systems diagnostics Updates the front-door README to reflect the Core v1 release and the methodology innovations we shipped this cycle. Key additions: - "What's new in Core v1" table highlighting the five methodology layers most agent benchmarks lack (signal-curated task set, variance decomposition, dynamical-systems diagnostics, Constraint Index, Docker base pinning). - Reference leaderboard — 8-model ranking on the Core-19 set from the v2026-4-19-full sweep. Honest about GLM 5.1's non-reproducibility and the OpenRouter routing issue. - "What makes ClawBench different" expanded with variance decomposition (52.7% capability / 47.3% seed noise) and a new section (#3) on dynamical-systems diagnostics, including the four concrete signals (C(q), regime, survival, SNR-weighted ranking). - New "Reproducibility caveats" section — what reproduces (audit, diagnostics, top-cluster ranking) vs what drifts (absolute scores, OpenRouter models, OpenClaw platform upgrades). Documents the pinning we did. - Updated Quick Start with `docker build -t clawbench:core-v1` verification flow and a full analysis-pipeline walkthrough using the new scripts (rejudge_all, compute_constraint_index, etc). - Repository layout updated to include tasks-public/ (public) and scripts/ with brief descriptions of all 11 reproducibility + analysis scripts. - Comparison table extended with new columns: variance decomposition, dynamical regime, SNR-weighted alternative, Docker base pinning, provider-routing caveats — all areas where SWE-bench / HumanEval / LLM-judge leaderboards are silent. - Version log + planned Core v2 roadmap (Tier 6 long-horizon, paraphrased prompt pairs, creative-synthesis, human baseline). Headline shifts from "the agent benchmark that measures what users actually experience" to "Rigorous agent evaluation. Signal-curated tasks. Dynamical-systems diagnostics." — foregrounds the methodological contributions that separate Core v1 from prior art. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:15:18 -07:00
scoootscooob	030e9968bd	docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility The ClawBench Core v1 reference numbers were measured against ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c). Using the moving ":latest" tag caused observable drift in our sweeps (platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by +0.13 to +0.29), so unpinned builds produce non-reproducible rankings. Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added an explanatory comment noting that bumping the base requires re- running the reference sweep. tasks-public/README.md: added build + verification commands so users can confirm they have the right OpenClaw version before running Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:09:49 -07:00
scoootscooob	50959fa670	tasks: add Core v1 public task set (19 tasks) Stages a curated 19-task subset of the internal 40-task dev pool as the public ClawBench release. Selected via greedy task elimination from the v2026-4-19-full sweep archive so that: (a) mean run_score across these 19 tasks reproduces the established 8-model ranking with zero inversions and min adjacent-rank gap of 0.0049 (well above the ~0.002 seed-noise floor); (b) coverage is preserved across tiers 1-5 and across the tools, coding, repo, browser, multi_tool, and adversarial families; (c) tasks with broken verifiers or near-zero cross-model SNR are dropped (21 tasks retained as private holdout, not published). Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs per task, C+T+B+J weighted score): 1. Claude Opus 4.6 0.8137 2. Claude Opus 4.7 0.7824 3. GPT 5.4 0.7647 4. Claude Sonnet 4.6 0.7597 5. MiniMax M2.7 0.7475 6. Gemini 3.1 Pro 0.7408 7. Qwen 3.6 Plus 0.7030 8. Kimi K2.5 0.6800 Deliverables: tasks-public/MANIFEST.yaml — machine-readable task list + metadata tasks-public/README.md — rationale, usage, reproducibility notes tasks-public/tier{1..5}/.yaml — 19 task definitions tasks-public/assets// — 19 asset packs (verifiers + fixtures) The internal dev set remains in tasks/ (gitignored) and retains 40 tasks for future expansion. Not published: - 9 ceiling tasks (all frontier models score >0.85) - 9 noise tasks (cross-model SNR < 0.5) - 3 ranking-breaker tasks (e.g. t2-node-search-patch, t5-contradictory-requirements) Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs for perturbation-sensitivity measurement, and creative-synthesis tasks — all currently absent from Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:06:36 -07:00
scoootscooob	b6f07d9a87	analysis: dynamical-systems diagnostics for agent runs Treats agent runs as stochastic trajectories in semantic state space and extracts signal that flat run_score averages away. Inspired by the "When LLMs Are Dreaming, Where Do They Go?" framework: task constraint characterization, per-run regime classification, seed-vs- capability variance decomposition, per-turn survival, SNR-weighted ranking. Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external model dependencies) as the semantic state proxy since sentence embeddings would require torch. Crude but sufficient for the signals the paper calls out. scripts/compute_constraint_index.py: computes C(q) per task from archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is participation ratio of response covariance, entropy is eigenvalue entropy, and BOPS is inter-run cosine (predictability proxy). High C(q) = tasks where models converge to similar answers; low C(q) = open-ended tasks where models diverge for style reasons. scripts/classify_regimes.py: per-run regime classifier. Computes drift_mean, from_start, recurrence, vol_log over turn trajectories. Quartile-based thresholds label each run as too_short / trapped / limit_cycle / diffusive / mixed. Reveals per-model tendencies: Gemini traps frequently (one-shot answer without iteration), GPT loops tool patterns, GLM is most balanced. scripts/variance_decomp.py: decomposes run_score variance per task into seed variance (3 runs of same model) vs capability variance (across model means). SNR = cap_var / seed_var. Exposes that 47% of benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and give essentially random rankings. scripts/survival_analysis.py: per-turn empirical survival S(t) and hazard h(t). T_F = first turn where assistant emits empty response or run ends in failure. Reveals long-horizon capability that flat scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at 60% rate. scripts/snr_weighted_ranking.py: SNR × \|C(q)\|-weighted ranking (with winsorization at p95 to prevent single-task dominance). Headline metric that weights discriminating + signal-rich tasks more than noisy or consensus tasks. Also emits SNR-only and flat variants for comparison. scripts/generate_dynamical_report.py: assembles all four diagnostic JSONs into a single markdown report with per-model regime tables, SNR tiers, survival curves, and integrated interpretation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:49:05 -07:00
scoootscooob	afb14c3982	analysis: fair-comparison audit and rejudge pipeline Tools for auditing archive coverage, rejudging judge-infra failures via direct Anthropic API (bypasses the gateway path that sometimes returns "Gateway is restarting" / empty judge results), and producing fair multi-model comparison reports. scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs and archive JSONs side-by-side. Reports coverage %, clean mean, coverage-normalized score, infra-zero count, judge-infra remaining vs rejudged. scripts/audit_per_run.py: per-run cross-model audit. Flags tasks where all models score zero (broken task/verifier), verifier rejects-valid-outputs (C=0 but agent produced text), harness-error clusters, model-specific pathologies. scripts/rejudge_all.py: re-runs judge scoring on archive runs where the gateway judge failed. Uses direct anthropic SDK against claude-sonnet-4-6, rewrites judge_result fields in place, recomputes run_score per the C+T+B+J weighting. scripts/generate_fair_report.py: produces an 8/9-model comparison markdown report. Supports --exclude to drop specific models, headlines "clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0 task pass counts, and coverage parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:48:43 -07:00
scoootscooob	01a31e55fb	sweep: per-container state isolation + qwen model-id fix scripts/container_sweep_single.sh: clone pristine OpenClaw state to /tmp/ per sweep before starting the gateway. Carries over config (openclaw.json, identity/, devices/, exec-approvals.json, tasks/, subagents/, flows/, cron/) but leaves runtime dirs (agents/, workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR to the isolated dir so the gateway writes to /tmp instead of the shared host mount. Fixes the cascading "RPC agents.create timed out after 60s" failures caused by 4k+ stale agents accumulating across sequential sweeps. profiles/frontier_qwen_3_6.yaml: fix base_model from openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus (no dash). The dashed slug is unknown to OpenRouter and silently fails; the no-dash version is the real canonical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:48:30 -07:00
scoootscooob	deb3d5d85d	tasks: stop tracking current task set; fix t2 integration test for emptyNote Context: The current 40-task set is being split into a private holdout set plus a new public set. The public repo will ship a different task set that doesn't give away the holdout; in the meantime, stop tracking the current tasks/ directory so benchmarking can continue locally without exposing the set externally. Changes: - .gitignore: add tasks/ and lab-pr68627/ (vendored PR content, also moving out of the public repo). - git rm --cached tasks/: remove from tracking (files remain on disk locally). - tests/test_integration_checks.py: * Module-level pytest.mark.skipif that skips the whole file when tasks/ is absent — so CI against the public repo (no tasks) stays green once the private set moves out. * Update the t2-node-search-patch fixture to also define emptyNote() since the task was hardened with that distractor. Without this, the integration test asserts score==1.0 but gets 0.0 (the new "emptyNote stays empty" test fails against a fixture that never defines emptyNote). Follow-up (separate work): Public task set lands in a subsequent commit. Holdout access path (encrypted-in-repo or private-repo) gets wired into the harness's private_tasks_root / hidden_tasks_dir plumbing. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-19 12:29:52 -07:00
scoootscooob	95b226dfed	tasks: harden 5 ceiling-bound tasks for better model differentiation All 5 of these tasks were clearing at 0.85-1.00 across the frontier-4 on v4.14 — narrow spread means they don't differentiate models. Each now has a specific trap that catches naive approaches: - t1-refactor-csv-loader: introduces divergent normalization requirements between load_rows (lowercase) and summarize_inventory (preserve first- seen case). Naive "lowercase everywhere in parse_inventory_row" fails 2 of 3 tests. Proper refactor returns original case in the helper. - t3-node-multifile-refactor: adds a 3rd caller (audit.js) requiring preserved userId case + minute-precision timestamp, diverging from auth.js and report.js. Single-function extraction fails 2 of 4 tests; agent must handle two normalization modes. - t4-browser-research-and-code: docs rewritten with distractors — v1/v2/v3 versions, required/optional/cross-endpoint headers, rate limits, payload limits. Tests check 6 facts including negative-match for X-Admin-Token distractor (scoped to /v2/admin only). - t2-node-search-patch: adds emptyNote() factory in render.js with legitimate empty body: "" that MUST NOT be patched. Naive grep-replace of `body: ""` now fails the emptyNote test. Also adds whitespace- trimming test for filterNotes. - t4-memory-recall-continuation: requires storing 3 SEPARATE memory entries (beta-regions, retry-budget, apac-gating) instead of one. Release notes include operational-notes distractors that must NOT be codified. flags.py gains APAC_GATED_UNTIL field. Handoff verifier added to check all 3 facts in the handoff artifact. All 5 tasks verified: properly-implemented starter patches pass all tests, the new traps specifically fail naive implementations. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-19 12:24:25 -07:00
scoootscooob	cb48ca72e8	tasks: drop strict completion.files checks on 19 tasks Every one of these tasks has an execution_check script (verify_*.py) that already does a recursive workspace search — it greps for required content across every agent-written .md/.txt/.csv regardless of filename. The completion.files block was redundant and actively penalized models that wrote to reasonable alternate paths (analysis.md vs budget_report.md). Before: total=1 (file) + N (exec) → if file path didn't match, score was capped at N/(N+1). On t3-fin-budget-monthly, 14 of 15 prior sweep runs failed specifically on "FILE budget_report.md: File does not exist". After: total=N. Verifier is the source of truth. Judge rubric already tells graders "don't penalize non-standard paths" — this aligns completion scoring with that stated policy. Fixed tasks (all had recursive verifiers): t1-fs-quick-note, t1-life-translate, t2-ctx-pronoun-resolve, t2-err-instruction-ambig, t2-fs-cleanup-downloads, t2-fs-find-that-thing, t2-msg-summarize-thread, t2-priv-redact-doc, t2-skill-excel-rollup, t2-sys-memory-roundtrip, t2-web-quick-fact, t3-cal-reschedule-cascade, t3-data-sql-query, t3-fin-budget-monthly, t3-msg-inbox-triage, t3-social-bill-split, t3-web-research-and-cite, t4-ctx-long-recall, t4-life-trip-plan Spot-checked that each verifier's required-content set already covered the content_contains constraints that were also dropped. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 13:16:34 -07:00
scoootscooob	8a5be9c686	clawbench: per-sweep cache archiving + generic sweep templates - scripts/_archive_cache.sh: snapshot run_cache/<model>/ to run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json. Sourced by sweep scripts so transcripts survive the next sweep's cache wipe and stay available for audits. - scripts/container_sweep_single.sh: base multi-model sweep. Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so their caches are force-cleared at sweep start. Calls archive helper on exit. - scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast fix validation (~20 min) instead of full 3-run sweep (~60 min). - Dockerfile.main: parametrized clawbench-on-openclaw image with ARG BASE for pinning to any openclaw tag. - scripts/git_checkpoint.py + README: documented checkpoint workflow for tagging known-good states during risky work. - .gitignore: un-ignore scripts/, keep targeted ignores for __pycache__, .tmp, .local.py. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 12:46:45 -07:00
scoootscooob	fe8fef7795	Merge branch 'pr-4' into codex/merge-pr4	2026-04-16 19:50:11 -07:00
scoootscooob	ee8ff79347	docs: fix ollama profile guidance	2026-04-16 19:49:04 -07:00
scoootscooob	9d802d6c53	fix: classify find_replace-style tools as edits	2026-04-16 19:37:01 -07:00
pllm-uci	517f2207b0	Refine local Ollama profile documentation for clarity and usability	2026-04-15 11:45:57 -07:00
pllm-uci	e2d82b34c3	Add local Ollama model support and configuration guidance to README and profiles	2026-04-15 11:45:12 -07:00
HeYan	a2757e6bd9	fix: classify str_replace and insert tools as mutating edits classify_tool_call matched tool names against a fixed set of verb patterns. The pattern for the "edit" family was: r"write\|edit\|patch\|apply\|create\|delete\|rename" This omitted "replace" and "insert", so tools like str_replace, replace_in_file, insert_text, and insert_at_line all fell through every check and were returned as ("unknown", False) – classified as non-mutating with unknown family. Consequences for any agent that edits via str_replace: - distinct_mutation_targets stayed empty → min_distinct_mutation_targets requirement always failed - read_before_write_ratio was 1.0 for the wrong reason (no mutations detected, so denominator collapsed to 1) - "edit" never appeared in distinct_families → required_families check always reported it as missing Fix: extend the edit pattern with "replace" and "insert". Tests added: unit test for classify_tool_call directly and an end-to-end trajectory test using a str_replace-based edit transcript.	2026-04-14 01:00:13 -07:00
scoootscooob	eb879adf9b	Remove reports/ reference from README repo layout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 00:52:17 -07:00
scoootscooob	6ab3004d63	Remove reports and scripts from repo, add to gitignore Reports and eval scripts contain internal benchmark data that should not be public. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 00:51:50 -07:00
scoootscooob	0d07aa4d08	Re-judge GPT 5.4: resolve judge auth caveat, full coverage Re-ran Sonnet 4.6 judge on all 60 GPT 5.4 runs that had auth errors during the original sweep. Called the Anthropic API directly using cached transcripts. Results: - judge_task_coverage: 0.6 -> 1.0 (all 40 tasks fully judged) - judge_error_count: 60 -> 0 - overall_judge_score: 0.438 -> 0.239 (was inflated by excluding errors) - overall_score: 0.456 -> 0.457 (unchanged, judge gated on C >= 0.9999) No judge caveat remains. All 6 models now have complete, unbiased judge coverage across all 720 runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 00:36:27 -07:00
scoootscooob	952decadcf	Rewrite README, harden worker, add benchmark reports Rewrote README to focus on why trace-based scoring matters for user-perceived agent capabilities, how ablation works, and the 13 failure modes. Removed results pending finalization. Worker changes: re-inject host env/plugins into lane configs after gateway restart (fixes judge auth stripping), increase control-plane probe tolerance for slow gateway startups. Added 6-model leaderboard and sweep reports from April 11-12 runs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 00:11:34 -07:00
scoootscooob	44bef14f4d	Add partner trace submission spec	2026-04-11 15:36:54 -07:00
scoootscooob	b4620d10ca	worker: harden gateway runtime and resume behavior	2026-04-11 15:27:14 -07:00
scoootscooob	380c6b4815	bench: audit contamination and harden HF leaderboard loading	2026-04-11 07:14:32 -07:00
scoootscooob	99803642b0	bench: add trace ingestion and template promotion pipeline	2026-04-11 06:45:27 -07:00
Codex	02573d565d	bench: add hidden release scaffolding and CI push coverage	2026-04-11 06:28:43 -07:00
Codex	29c1cd90e4	worker: fail-fast on hung sessions.create, retry control-plane probe Queue submissions were failing intermittently with "Parallel lane N failed for tasks [...]" after a ~5 minute stall. Root cause traced to the interaction of three things: 1. GatewayClient.config.request_timeout defaulted to 300 seconds, meaning every WebSocket RPC would block for 5 full minutes before raising TimeoutError. 2. worker._assert_gateway_control_plane calls sessions.create on a freshly-started lane gateway as a readiness probe. A transient plugin-load race can leave the new gateway accepting /health HTTP requests but hanging on WebSocket RPCs, so the probe would block for the full 300s default timeout. 3. _ensure_parallel_gateway called the probe inside the same for- loop that was polling /health, with `except Exception: pass` silently swallowing probe failures. Each probe attempt consumed 30-300 seconds of the 60-iteration budget, so effectively only 1-2 probe retries fit before the outer loop gave up and the whole lane batch was marked failed. Fixes (all in clawbench/client.py + clawbench/worker.py): - Drop `GatewayConfig.request_timeout` default from 300.0 to 60.0. 60 seconds is still generous for what are sub-second calls in steady state (agents.create, sessions.create, etc.); a healthy gateway responds in milliseconds. Long-running calls like send_and_wait already pass an explicit per-call timeout. - Add a `timeout=None` kwarg to `GatewayClient._rpc()` so callers can override the default when they need tighter or looser per-call bounds. Error message now includes the effective timeout so debugging is clearer. - `_assert_gateway_control_plane` now constructs a dedicated GatewayConfig with request_timeout=15s and wraps the whole probe (including WebSocket connect + session create + session delete) in `asyncio.wait_for(..., timeout=30.0)` as a belt-and-suspenders ceiling. A probe hang now fails in 30s instead of 300s. - Split `_ensure_parallel_gateway` and `_ensure_gateway` into two explicit phases: Phase A: poll /health over HTTP until 200, up to 60s (fast) Phase B: call _assert_gateway_control_plane up to 3 times, sleeping 2s between attempts, before raising The probe-retry loop lets transient plugin warmup races self-recover without killing the whole lane batch. - Added 2s HTTP client timeout to the /health polls so a single wedged HTTP request can't swallow 30s of the budget. Also clears the 11 historical failed jobs from the queue dataset at ScoootScooob/clawbench-results so the leaderboard starts clean on the next Space rebuild. All 20 tests in test_worker, test_client, and test_parallel_harness pass with the new code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 05:30:49 -07:00
Codex	ab69af31be	upload: read-then-append instead of overwriting submissions split Root cause: datasets.Dataset.push_to_hub(repo, split="submissions") writes a single parquet shard to data/submissions-00000-of-00001.parquet, REPLACING whatever was there. Every call to upload_result() was clobbering the previous submission. After 7 sequential uploads the dataset contained only the last row; the HF Space leaderboard therefore displayed only 1 model entry despite 7 having been pushed. Fix: upload_result() now loads the existing submissions split first, appends the new row, dedupes by submission_id (so retried uploads of the same run don't double-count), and pushes the combined row list as a fresh parquet shard. This works at ClawBench's current submission rate (1-2 concurrent jobs). If cross-worker concurrency ever becomes material, we should move to a genuinely append-only layout where each submission writes its own parquet shard under data/submission-<submission_id>-of-NNNNN.parquet and readers scan the whole data/ directory. File a follow-up when submission volume warrants it. Also backfilled the 6 frontier bake-off rows that had been lost during the earlier overwrite sequence. The dataset at https://huggingface.co/datasets/ScoootScooob/clawbench-results now holds all 7 rows: Kimi K2.5, Opus 4.6, GPT-5.4, Gemini 3.1 Pro, GLM-5.1, Qwen3.6-Plus, MiniMax M2.7. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:22:24 -07:00
Codex	78d844364f	ci: trigger HF Space sync (secrets added) Empty commit to re-queue .github/workflows/sync-to-hf-space.yml now that HF_TOKEN and HF_USERNAME repository secrets are in place. GitHub Actions secrets are snapshotted at workflow queue time, so previously-failed runs cannot retroactively see secrets added after their queue timestamp. This commit triggers a fresh run that will pick up the current secrets state.	2026-04-11 00:17:55 -07:00
Codex	f55b990476	docs: add .github/workflows/README for HF sync setup Documents the one-time secrets setup (HF_TOKEN + HF_USERNAME) that the sync-to-hf-space.yml workflow needs before it can mirror GitHub main to the HF Space. Also explains the --force semantics and the "GitHub is the single source of truth" contract. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:02:45 -07:00
Codex	19e4750b69	ci: auto-mirror main to HF Space on every push Adds .github/workflows/sync-to-hf-space.yml which force-pushes main to the HF Space git remote whenever a commit lands on GitHub main. This eliminates the dual-push friction: GitHub becomes the single source of truth, and the HF Space deployed at https://huggingface.co/spaces/ScoootScooob/clawbench always tracks the latest GitHub main without manual `git push hf main` calls. Requires two repository secrets (Settings -> Secrets -> Actions): HF_TOKEN — write-scoped HF token (https://huggingface.co/settings/tokens) HF_USERNAME — HF account username that owns the Space Optional repo variable: HF_SPACE_ID — defaults to "ScoootScooob/clawbench", override if mirroring to a different Space. Uses --force to replace any Space-side commits that were created by editing files in the HF UI. This is intentional — the workflow's contract is that GitHub is authoritative. Guarded by concurrency group so two simultaneous pushes serialize instead of racing into a non-fast-forward rejection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 00:01:20 -07:00
Codex	07a20c3f18	HF Space: dynamic stats + fix leaderboard environment parsing Two fixes for the HF Space UI: 1. Leaderboard crashed with "'str' object has no attribute 'get'" because upload_result() serializes BenchmarkResult.environment as str(result.environment) when pushing to the HF Dataset, but _flatten_result called .get() on it as if it were a dict. Defensive parse: accept dict, stringified dict, or JSON object. 2. Stats ribbon (Tasks/Tiers/Browser/Judge counts) was hardcoded to the v0.3 values (20/5/2/6). Replaced with _compute_stats() which calls load_all_tasks() at startup and derives the numbers from the live task corpus, so the ribbon stays in sync with the tasks/ directory without manual edits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 23:55:37 -07:00
Codex	c24d982110	HF Space: fix container eval — pytest in runtime deps, TASKS_DIR resolver, timeouts Found and fixed three blockers preventing the HF Space Docker container from running the eval suite end-to-end, verified by building the image locally with Docker Desktop and running a tier-1 task against Qwen3-32B through the HF Inference API inside the container. 1. pytest was in [project.optional-dependencies].dev, not [project]. The Dockerfile does `pip install .` which only installs runtime deps, so every task whose completion verifier runs `pytest -q` would fail with exit 127 (command not found). Moved pytest + pytest-asyncio into the base dependencies so the container gets them by default. The [dev] extra is kept as an alias for existing `pip install .[dev]` invocations. 2. clawbench/tasks.py resolved TASKS_DIR via `Path(__file__).parent.parent / "tasks"`, which works only for source checkouts. When pip installs the package into /usr/local/lib/python3.11/dist-packages/clawbench, the sibling `tasks/` directory no longer exists at that path, so `load_all_tasks()` returned empty and `clawbench run` died with "No tasks to run". Added a fallback resolver that tries, in order: $CLAWBENCH_TASKS_DIR env var, sibling-of-source, Path.cwd() / "tasks", and known Docker layout candidates (/home/node/app/tasks, /home/user/app/tasks, /app/tasks). Verified inside the container that `TASKS_DIR` now resolves to /home/node/app/tasks and load_all_tasks() returns 40 tasks. 3. Tier-1 task timeouts were at 180s, which is enough for Qwen3-32B (52.9s wall time) but causes Llama-3.3-70B to hit the wall on t1-bugfix-discount. Raised tier-1 timeouts to 360s so slower HF models can complete tasks within the deterministic timeout and produce a capability signal instead of an infrastructure timeout signal. Also fixed a pre-existing stale test (tests/test_tasks.py expected 20 tasks, we have 40 since v0.5 corpus expansion) that was failing on every test run. Verified inside the container image: - `clawbench list-tasks` returns all 40 tasks - sessions.create passes for all 11 preset models (9 huggingface/* + 2 anthropic/*) - `clawbench run --model huggingface/Qwen/Qwen3-32B --task t1-bugfix-discount --runs 1` scored 1.000 / C=1.000 T=1.000 B=1.000 in 52.9s with 279,702 tokens captured. Remaining architectural note (not a blocker): the CLI path `clawbench run` assumes the gateway is already running. Only the queue/worker path (`app.py` → `EvalWorker._ensure_gateway`) spawns its own gateway. For HF Space deployment this is fine because all user submissions go through the Gradio UI → queue → worker path; local CLI invocations inside the container need to start the gateway manually first. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 23:03:15 -07:00
Codex	e9ff163217	baselines: merge provenance docs into BASELINE_SOURCES.md Replace the two separate JSON files (hermes_trace_summary.json and basic_usage_query_summary.json) with a single markdown document that captures every empirical source informing ClawBench's task design. baselines/BASELINE_SOURCES.md covers: 1. The 24 public Hugging Face datasets tagged format:agent-traces, with owner/name, row counts, cluster classification (Pi sessions, custom agent traces, Claude Code, demo), and how each cluster maps onto ClawBench's tier/family/trajectory design decisions. Aggregate ~3,049 rows, ~1,168 unique sessions after mirror deduplication. 2. The Hermes agent reasoning trace aggregate (14,701 sessions, 24.3 avg turns, category distribution) with the direct mapping from observed categories to ClawBench task families. 3. The internal personal-agent use-case corpus (72 queries, 12 primary scenarios, 139 atomic capabilities) that contributes the scenario_weight_defaults in query_catalog.py. The source is not a public dataset and is only referred to as "the internal personal-agent use-case corpus" — no filename reference. 4. A full source-to-design-decision mapping table showing which design choice (tier ladder, family mix, tool diversity, recovery expectations, browser task count, scenario weights, difficulty tags, adversarial tier-5) is driven by which source. Also scrub two remaining references to the Chinese filename in reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md, replacing them with pointers to baselines/BASELINE_SOURCES.md. No runtime code paths read the baselines/ directory; these files are provenance artifacts for the design decisions baked into tasks/ and clawbench/query_catalog.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:36:18 -07:00
Codex	3cdade49ce	README: rewrite for v0.5 with architecture, numbers, and positioning Replaces the 192-line v0.3 README with a 724-line v0.5 writeup that documents: - The three-layer scoring model (deterministic first, process second, semantic residue last) with full Mermaid diagram of the harness -> verifier -> scorer -> v0.5 diagnostic pipeline - The v0.5 Configuration Diagnostic positioning: first agent benchmark that measures the configuration, not just the model, with a 5-stage flow diagram (profile -> fingerprint -> predict -> validate -> explain) - All three mathematical pillars (k-NN composite Jaccard, fANOVA with Random Forest surrogate + lite fallback, Taguchi S/N) with formulas and the rationale for why each technique is in the stack - The 7-model frontier baseline results with ASCII bar chart, per-bucket Taguchi S/N, calibration MAE, and honest caveats about tier-1 coding tasks being too easy at n=1 to separate frontier models - Task suite breakdown (40 tasks, 5 tiers, 14 capability tags, 4 pools) - Full scoring model including the gated judge weighting invariant - CLI reference for `clawbench run` + `clawbench diagnose` - Repository layout with line counts for all 18 package modules - Comparison table vs SWE-bench / HumanEval / pass-rate leaderboards - Testing section (107 tests, key files with line counts) - Historical data + HF Dataset integration - Contributing guide stub with task YAML shape - Citation block Preserves the HF Space frontmatter (title, emoji, colorFrom, colorTo, sdk, app_port, pinned, license) so the Space rendering still works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:21:48 -07:00
Codex	4744a6ae7e	ClawBench: 7-model frontier baseline + bake-off tooling Profiles (profiles/): - frontier_opus_4_6.yaml (Anthropic Claude Opus 4.6 — closed) - frontier_gpt_5_4.yaml (OpenAI GPT-5.4 — closed) - frontier_gemini_3_pro.yaml (Google Gemini 3.1 Pro — closed) - frontier_glm_5_1.yaml (Zhipu AI GLM-5.1 via OpenRouter — open) - frontier_qwen_3_6.yaml (Alibaba Qwen3.6-Plus via OpenRouter — open) - frontier_minimax_m27.yaml (MiniMax M2.7 via OpenRouter — open) - frontier_kimi_k25.yaml (Moonshot Kimi K2.5 via OpenRouter — open) - example_research_stack.yaml (example for docs) All seven profiles share an identical plugin stack (anthropic + memory-lancedb + browser-playwright) so base_model is the only structural variable across the bake-off. Scripts (scripts/): - run_open_vs_closed_bakeoff.py: driver that runs each profile through the harness and generates a comparison table. Wraps `clawbench run --profile` via an inline Click entry (the package has no __main__.py so `python -m clawbench.cli` is a no-op). - analyze_open_vs_closed.py: historical DB analyzer — per-bucket mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/ Moonshot land in the open bucket. - ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py, scale_timeouts.py, seed_historical_db.py: task-corpus tooling. Reports (reports/): - FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run (3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6 scored 63.9% with real token streaming (174K tok, $0.18 cost). The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too easy to separate frontier models at n=1. Documents infrastructure findings around gateway plugin allowlist behavior, token streaming gaps for non-Anthropic providers, and hot-reload cascade when config changes mid-run. - open_vs_closed_bakeoff_summary.md: auto-generated headline table - FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run - REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run - PARALLEL_HARNESS_REPORT.md: concurrency validation writeup - V05_DELIVERY_REPORT.md: v0.5 framework delivery notes - CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning Artifacts (reports/artifacts/): - frontier_*.json: the 7 BenchmarkResult files from the bake-off (committed snapshot for reproducibility; runtime results still go to results/ which remains gitignored) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:14:11 -07:00
Codex	4aa017838a	ClawBench v0.5: tests + task corpus expansion Tests: - tests/test_v05_framework.py (646 lines): end-to-end synthetic ecosystem covering profile parsing, fingerprint computation, k-NN prediction, surprise detection, factor analysis, diagnostic rendering - tests/test_v05_extensions.py (552 lines): unit tests for Taguchi S/N robustness profile, plugin utilization audit, manifest-vs-reality gap, calibration tracking, surprise cause attribution, recommendations generator, insights publishing, end-to-end diagnostic with all sections - tests/test_scorer.py: judge gating tests (judge cannot rescue failed deterministic completion; judge capped at 10% when deterministic verifier exists and floor met; judge dominates at 50% on semantic- only tasks) - tests/test_e2e_significance.py, test_parallel_harness.py: additional coverage for harness behavior Task corpus expansion: - 20 new task YAMLs across tier1-4 covering fs, web, calendar, messaging, data processing, social coordination, life assistance, context continuation, error boundary, skill calling, privacy redaction scenarios - Fresh asset packs for each new task (test fixtures + reference inputs/outputs) - Lower tier-1 coding task timeouts from 360s to 180s to avoid final-state wait waste (the gateway emits no chat.state:final event, so the wait is pure overhead; 180s is plenty for any tier-1 task) - Modify tier2-5 task YAMLs for verifier robustness and judge rubric updates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:13:37 -07:00
Codex	cf04a17fea	ClawBench v0.5: configuration-space diagnostic framework Add the v0.5 plugin-profile diagnostic system on top of v0.4: - profile.py: PluginProfile, PluginManifest, RegistrationTrace, ProfileFingerprint, fingerprint_similarity (Jaccard composite over capability coverage, hook footprint, tool family surface, tags, slots, base model) - prediction.py: HistoricalDatabase with JSON persistence, k-NN cold-start prediction with confidence bands, calibration metrics (MAE/RMSE/bias), surprise cause attribution - factor_analysis.py: fANOVA with Random Forest surrogate when sklearn is available, fANOVA-lite fallback that decomposes variance via SSB/SST with pairwise interaction residuals - diagnostic.py / diagnose_cli.py: Configuration Diagnostic Report ties profile -> fingerprint -> prediction -> run -> surprises -> insights - utilization.py: plugin utilization audit (dead-weight detection) + manifest-vs-reality gap per plugin - recommendations.py: evidence-backed profile change generator (add_plugin, remove_plugin, fill_slot, add_capability) with confidence scaled by sample size - insights.py: publishes plugin leaderboard, factor importance, interactions, capability gaps, calibration history to JSON files - stats.py: Taguchi larger-is-better signal-to-noise ratio and RobustnessProfile with per-tier means (the third mathematical pillar of v0.5 alongside k-NN and fANOVA) - scorer.py: fix judge weighting per spec. Judge now capped at 10% when the task has a deterministic completion verifier and only contributes when the deterministic floor (completion >= 0.9999) is met. When no deterministic verifier exists, judge dominates at 50% (semantic-only regime). This enforces CLAWBENCH_V0_4_SPEC.md "Disallowed Primary Verifiers" and "Judge Gating" sections. - cli.py: wire --profile flag into clawbench run; add clawbench diagnose subcommand - harness.py: pass has_deterministic_verifier to combine_run_score - CLAWBENCH_V0_4_SPEC.md: add v0.5 Direction section .gitignore: exclude .clawbench/ runtime state and .DS_Store Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:13:02 -07:00
Codex	b6e82d6afe	Space: harden theme construction against unsupported kwargs The previous redesign passed radius_size/block_radius/shadow_drop/shadow_spread to themes.Base().set() which are either constructor-only or version-specific, causing the HF Space to runtime-error at startup. Drop those kwargs and wrap the whole theme build in a try/except that falls back to plain Base() so any future unknown kwarg degrades gracefully instead of crashing the Space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 22:20:57 -07:00
scoootscooob	42cd0768ef	Space: sync A10 defaults	2026-04-09 22:14:06 -07:00
Codex	aed7c37207	Space: redesign UI to match OpenClaw WebUI + ClawHub design system Apply the shared OpenClaw aesthetic — dark backgrounds (#0e1015 layered), signature red accent (#ff5c5c), Inter + JetBrains Mono typography, whisper-thin color-mix borders, pill tab switcher, and rise animations. Replaces the default Gradio Base theme with custom CSS and theme tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 22:10:27 -07:00

1 2

76 Commits