clawbench

Author	SHA1	Message	Date
Vincent Koc	82bcfc1891	fix(worker): harden runtime result writes Some checks failed CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled Details	2026-04-29 13:16:40 -07:00
Vincent Koc	ea17c715b3	fix(client): clean pending rpc on send failure Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details	2026-04-29 00:09:27 -07:00
Vincent Koc	88ab0f5564	test: cover environment verifier success paths	2026-04-28 23:27:38 -07:00
Vincent Koc	8172fad70e	test: cover judge score gate propagation	2026-04-28 23:08:58 -07:00
Vincent Koc	fb486a1ed3	fix(scoring): gate judge-weighted scores	2026-04-28 22:52:12 -07:00
Vincent Koc	ed9adf8d84	fix(runtime): harden benchmark cache and task paths	2026-04-28 22:40:46 -07:00
Aaron Zhu	e120e86601	fix: flag credential file access in dangerous shell patterns (#6 ) Some checks are pending CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run Details CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run Details Sync main to HF Space / mirror (push) Waiting to run Details * fix: flag credential file access in dangerous shell patterns * fix: avoid quoted credential false positives * fix: reduce credential detector merge conflicts * test: avoid credential detector import conflicts * test: place credential detector coverage after baseline tests --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-28 13:17:11 -07:00
Aaron Zhu	dddfc0a175	fix: flag git push --force variants as dangerous shell commands (#5 ) * fix: flag git push --force variants as dangerous shell commands * fix: avoid quoted force-push false positives * fix: reduce force-push detector merge conflicts * test: avoid force-push detector import conflicts --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-28 13:17:01 -07:00
HeYan	c72e41687d	chore: add open-source contribution scaffolding (#3 ) * chore: add open-source contribution scaffolding New files --------- LICENSE The README already references this file and the pyproject.toml already declares `license = "MIT"`, but no actual LICENSE file existed in the repo. The badge link was pointing at a 404. CONTRIBUTING.md Setup instructions, guidance on which contributions are welcome (bug fixes, new tasks, scoring changes, docs), branch naming convention, commit style, and a note on adding new tasks with deterministic completion checks. .github/ISSUE_TEMPLATE/bug_report.md .github/ISSUE_TEMPLATE/feature_request.md Structured templates so bug reports arrive with reproduction steps and environment info, and feature requests arrive with motivation and alternatives considered. .github/PULL_REQUEST_TEMPLATE.md Lightweight checklist (what / why / changes / tests) that matches the style of the two bug-fix PRs already merged. pyproject.toml Added [project.urls] with Homepage, Repository, and Bug Tracker so the links appear correctly on PyPI if the package is ever published there. * docs: align contribution scaffolding --------- Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-28 13:16:52 -07:00
HeYan	d21648ad3d	fix: strip quoted strings before checking for shell redirect operators (#2 ) is_mutating_shell_command scanned the raw command string against MUTATING_SHELL_PATTERNS, which includes the bare pattern r">". This caused any command with a > character inside a quoted argument to be classified as a file-writing mutation: grep "count > 5" logs.txt → ("edit", True) # wrong python -c "print(1 > 0)" → ("edit", True) # wrong In classify_shell_command, a mutating=True result suppresses both the READ_ONLY and EXECUTION branches, so these read-only commands fell through to `return "edit", True` instead of "search" or "execute". Fix: strip the contents of quoted strings (both double and single quotes) before scanning for mutation patterns. The redirect operators that actually matter — `>`, `>>`, `2>`, etc. — always appear outside quotes in real shell commands, so stripping quote bodies removes the false positives while preserving all true redirects. Tests added: read-only commands containing > inside quotes must not be flagged, and real redirect commands must still be detected. Co-authored-by: Vincent Koc <vincentkoc@ieee.org>	2026-04-28 13:16:42 -07:00
Vincent Koc	0625ab7159	fix(runtime): harden queue and gateway lifecycle	2026-04-28 11:34:53 -07:00
Vincent Koc	dd92f8884c	chore(dev): add lint guardrails	2026-04-28 10:50:07 -07:00
Vincent Koc	38a2a0ff91	perf(app): cache leaderboard loads	2026-04-28 10:49:52 -07:00
Vincent Koc	509f21bb95	fix(cli): sync scenario filters	2026-04-28 10:49:38 -07:00
scoootscooob	b5538e0927	Copy all package data in HF Docker build	2026-04-28 02:35:09 -07:00
scoootscooob	425daa4fc8	Copy partner spec in HF Docker build	2026-04-28 02:31:26 -07:00
scoootscooob	d069bcfe3a	Fix HF Docker package build	2026-04-28 02:26:39 -07:00
Vincent Koc	4ad2f1f417	fix(ci): ensure hugging face space before sync	2026-04-28 01:50:26 -07:00
Vincent Koc	fc86dd6155	ci: add blacksmith testbox setup	2026-04-28 01:45:35 -07:00
Vincent Koc	f373e4a710	fix: harden packaging and submissions	2026-04-28 01:17:43 -07:00
scoootscooob	fb029437be	Add MIT license file	2026-04-28 00:05:38 -07:00
scoootscooob	4b7a9ee31c	Fix public Docker task copies	2026-04-27 22:57:10 -07:00
scoootscooob	595cdc910c	Add public domain scaffold and adapter diagnostics	2026-04-23 12:40:23 -07:00
scoootscooob	df32a5f073	Merge pull request #7 from HaoLi111/feat/dynamics-analysis Add archive dynamics pipeline and audience-based model presets	2026-04-22 13:11:32 -07:00
scoootscooob	11d943f21c	fix: preserve preset submission settings and lazy-load plots Some checks failed CI / Python 3.12 test suite (push) Has been cancelled Details	2026-04-22 12:03:16 -07:00
pllm-uci	c209612d46	Add archive dynamics pipeline and audience-based model presets	2026-04-22 12:03:13 -07:00
scoootscooob	5b50814dfc	Merge pull request #8 from gchlebus/gchlebus/fix-connect-timeout fix(client): raise default connect_timeout to 30s and make it env-overridable	2026-04-22 09:47:06 -07:00
scoootscooob	79b2253bfc	fix(ci): restore public task fallback	2026-04-22 09:46:33 -07:00
scoootscooob	e4ca2bef8e	fix(client): reject invalid timeout env values Some checks failed CI / Python 3.12 test suite (push) Has been cancelled Details	2026-04-22 09:41:44 -07:00
Grzegorz Chlebus	547ee160ad	fix(client): raise default connect_timeout to 30s and make it env-overridable The default connect_timeout=15.0 is shorter than the observed first-session setup time against a freshly started OpenClaw gateway (we've measured phase0_session_setup ~20-25s during containerised benchmark runs), which creates a race where the client gives up before the gateway is ready for the first turn. Downstream the adapter then surfaces this as an ``empty_response`` with zero transcript steps, which looks like a model failure when it's really an environment timing issue. Concrete repro from a 19-task public_dev run: task: t4-life-trip-plan failed: reward=0, failure_category=empty_response, duration_ms=0, total_ms=16352, response hash = SHA256 of empty string rerun: score=0.927 standalone, phase0_session_setup=21.2s Change: * GatewayConfig.connect_timeout default 15.0 -> 30.0 * GatewayConfig.request_timeout default kept at 60.0 but now explicitly documented and overridable for symmetry * Both are now overridable via environment variables CLAWBENCH_CONNECT_TIMEOUT / CLAWBENCH_REQUEST_TIMEOUT so ops can tune further without a code change. * Invalid env values are logged and fall back to the default rather than blowing up benchmark runs. * Adds three unit tests covering default, env override, and invalid-env fallback behaviour. Reported-by: Grzegorz Chlebus <gchlebus@nvidia.com>	2026-04-22 10:19:20 +02:00
scoootscooob	8447ab1ca6	docker: revert OpenClaw base pin; remove reference scores Per request: drop the Docker-base-pinning approach and the inline reference scores. Treat published numbers as version-, provider-, and seed-dependent. Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1 back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the current OpenClaw release. The state-isolation patch + rejudge pipeline (the actually load-bearing reproducibility infra) stay in place; only the pinned-version approach is reverted. README.md: - drops the "Docker base pinning" row from the "What's new" table; replaced with "Reproducibility-first infrastructure" framing - drops the "pinned" badge; added a "Diagnostics" badge instead - updates "Reproducibility caveats" to recommend "build both sides of any comparison from the same OpenClaw release" rather than "pin to 2026.4.15-beta.1" - updates Quick Start to record (not assume) the OpenClaw version the build resolved to - drops the pinned-base row from the comparison table; replaced with "State-isolation per run" (the actually distinguishing infra) - updates the version log entry for Core v1 to highlight the dynamical-systems diagnostics + state-isolation rather than the pinning that's no longer there tasks-public/README.md: - drops the 8-row "Established ranking" table per request - replaced with a "Selection criteria" section that explains how the 19 tasks were chosen (0 inversions, min-gap 0.0049) without publishing version-dependent scores - reframes the build instructions to track :latest with a comment about platform-version drift tasks-public/MANIFEST.yaml: - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as a hard requirement) - drops the `established_ranking` block - replaced with `selection_basis` that documents the methodology and explicitly states why scores are intentionally omitted Test suite still green: 156 passed locally, 152 passed in the CI-equivalent (no private tasks/) configuration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 21:24:42 -07:00
scoootscooob	0e250e3fe1	fix(ci): tasks-public fallback + leaderboard removed from README README.md: removed the inline reference leaderboard per user request. The Core v1 manifest still carries the established ranking, the README still documents methodology + dynamical-systems diagnostics. clawbench/tasks.py: extend _resolve_tasks_dir() with a tasks-public/ fallback layer (resolver step 5). Local dev with the private tasks/ present is unchanged; CI without tasks/ now falls back to the public Core v1 set instead of returning an empty corpus. Has been broken since `deb3d5d` (the "stop tracking current task set" commit) — this restores green CI now that tasks-public/ is available. tests/test_tasks.py: three updates so tests pass against either the private 40-task set OR the public 19-task set: - test_load_all_tasks_returns_full_corpus: threshold lowered from >= 20 to >= 19 (Core v1 size) - test_workspace_setup_preserves_nested_asset_paths: switched from t1-architecture-brief (private) to t4-browser-research-and-code (public) which exercises the same flat+nested asset behaviour - test_selected_tasks_include_judge_rubrics: replaced 3 task IDs not in the public Core release (t1-architecture-brief, t5-contradictory-requirements, t5-impossible-graceful-fail) with public-set equivalents (t1-bugfix-discount, t3-feature-export) Verified locally with both branches: - private tasks/ present: 156 passed, 1 skipped - private tasks/ hidden: 152 passed, 5 skipped (CI-equivalent) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:32:26 -07:00
scoootscooob	f95e838d99	docs: rewrite README around Core v1 + dynamical-systems diagnostics Updates the front-door README to reflect the Core v1 release and the methodology innovations we shipped this cycle. Key additions: - "What's new in Core v1" table highlighting the five methodology layers most agent benchmarks lack (signal-curated task set, variance decomposition, dynamical-systems diagnostics, Constraint Index, Docker base pinning). - Reference leaderboard — 8-model ranking on the Core-19 set from the v2026-4-19-full sweep. Honest about GLM 5.1's non-reproducibility and the OpenRouter routing issue. - "What makes ClawBench different" expanded with variance decomposition (52.7% capability / 47.3% seed noise) and a new section (#3) on dynamical-systems diagnostics, including the four concrete signals (C(q), regime, survival, SNR-weighted ranking). - New "Reproducibility caveats" section — what reproduces (audit, diagnostics, top-cluster ranking) vs what drifts (absolute scores, OpenRouter models, OpenClaw platform upgrades). Documents the pinning we did. - Updated Quick Start with `docker build -t clawbench:core-v1` verification flow and a full analysis-pipeline walkthrough using the new scripts (rejudge_all, compute_constraint_index, etc). - Repository layout updated to include tasks-public/ (public) and scripts/ with brief descriptions of all 11 reproducibility + analysis scripts. - Comparison table extended with new columns: variance decomposition, dynamical regime, SNR-weighted alternative, Docker base pinning, provider-routing caveats — all areas where SWE-bench / HumanEval / LLM-judge leaderboards are silent. - Version log + planned Core v2 roadmap (Tier 6 long-horizon, paraphrased prompt pairs, creative-synthesis, human baseline). Headline shifts from "the agent benchmark that measures what users actually experience" to "Rigorous agent evaluation. Signal-curated tasks. Dynamical-systems diagnostics." — foregrounds the methodological contributions that separate Core v1 from prior art. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:15:18 -07:00
scoootscooob	030e9968bd	docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility The ClawBench Core v1 reference numbers were measured against ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c). Using the moving ":latest" tag caused observable drift in our sweeps (platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by +0.13 to +0.29), so unpinned builds produce non-reproducible rankings. Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added an explanatory comment noting that bumping the base requires re- running the reference sweep. tasks-public/README.md: added build + verification commands so users can confirm they have the right OpenClaw version before running Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:09:49 -07:00
scoootscooob	50959fa670	tasks: add Core v1 public task set (19 tasks) Stages a curated 19-task subset of the internal 40-task dev pool as the public ClawBench release. Selected via greedy task elimination from the v2026-4-19-full sweep archive so that: (a) mean run_score across these 19 tasks reproduces the established 8-model ranking with zero inversions and min adjacent-rank gap of 0.0049 (well above the ~0.002 seed-noise floor); (b) coverage is preserved across tiers 1-5 and across the tools, coding, repo, browser, multi_tool, and adversarial families; (c) tasks with broken verifiers or near-zero cross-model SNR are dropped (21 tasks retained as private holdout, not published). Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs per task, C+T+B+J weighted score): 1. Claude Opus 4.6 0.8137 2. Claude Opus 4.7 0.7824 3. GPT 5.4 0.7647 4. Claude Sonnet 4.6 0.7597 5. MiniMax M2.7 0.7475 6. Gemini 3.1 Pro 0.7408 7. Qwen 3.6 Plus 0.7030 8. Kimi K2.5 0.6800 Deliverables: tasks-public/MANIFEST.yaml — machine-readable task list + metadata tasks-public/README.md — rationale, usage, reproducibility notes tasks-public/tier{1..5}/.yaml — 19 task definitions tasks-public/assets// — 19 asset packs (verifiers + fixtures) The internal dev set remains in tasks/ (gitignored) and retains 40 tasks for future expansion. Not published: - 9 ceiling tasks (all frontier models score >0.85) - 9 noise tasks (cross-model SNR < 0.5) - 3 ranking-breaker tasks (e.g. t2-node-search-patch, t5-contradictory-requirements) Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs for perturbation-sensitivity measurement, and creative-synthesis tasks — all currently absent from Core v1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 20:06:36 -07:00
scoootscooob	b6f07d9a87	analysis: dynamical-systems diagnostics for agent runs Treats agent runs as stochastic trajectories in semantic state space and extracts signal that flat run_score averages away. Inspired by the "When LLMs Are Dreaming, Where Do They Go?" framework: task constraint characterization, per-run regime classification, seed-vs- capability variance decomposition, per-turn survival, SNR-weighted ranking. Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external model dependencies) as the semantic state proxy since sentence embeddings would require torch. Crude but sufficient for the signals the paper calls out. scripts/compute_constraint_index.py: computes C(q) per task from archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is participation ratio of response covariance, entropy is eigenvalue entropy, and BOPS is inter-run cosine (predictability proxy). High C(q) = tasks where models converge to similar answers; low C(q) = open-ended tasks where models diverge for style reasons. scripts/classify_regimes.py: per-run regime classifier. Computes drift_mean, from_start, recurrence, vol_log over turn trajectories. Quartile-based thresholds label each run as too_short / trapped / limit_cycle / diffusive / mixed. Reveals per-model tendencies: Gemini traps frequently (one-shot answer without iteration), GPT loops tool patterns, GLM is most balanced. scripts/variance_decomp.py: decomposes run_score variance per task into seed variance (3 runs of same model) vs capability variance (across model means). SNR = cap_var / seed_var. Exposes that 47% of benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and give essentially random rankings. scripts/survival_analysis.py: per-turn empirical survival S(t) and hazard h(t). T_F = first turn where assistant emits empty response or run ends in failure. Reveals long-horizon capability that flat scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at 60% rate. scripts/snr_weighted_ranking.py: SNR × \|C(q)\|-weighted ranking (with winsorization at p95 to prevent single-task dominance). Headline metric that weights discriminating + signal-rich tasks more than noisy or consensus tasks. Also emits SNR-only and flat variants for comparison. scripts/generate_dynamical_report.py: assembles all four diagnostic JSONs into a single markdown report with per-model regime tables, SNR tiers, survival curves, and integrated interpretation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:49:05 -07:00
scoootscooob	afb14c3982	analysis: fair-comparison audit and rejudge pipeline Tools for auditing archive coverage, rejudging judge-infra failures via direct Anthropic API (bypasses the gateway path that sometimes returns "Gateway is restarting" / empty judge results), and producing fair multi-model comparison reports. scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs and archive JSONs side-by-side. Reports coverage %, clean mean, coverage-normalized score, infra-zero count, judge-infra remaining vs rejudged. scripts/audit_per_run.py: per-run cross-model audit. Flags tasks where all models score zero (broken task/verifier), verifier rejects-valid-outputs (C=0 but agent produced text), harness-error clusters, model-specific pathologies. scripts/rejudge_all.py: re-runs judge scoring on archive runs where the gateway judge failed. Uses direct anthropic SDK against claude-sonnet-4-6, rewrites judge_result fields in place, recomputes run_score per the C+T+B+J weighting. scripts/generate_fair_report.py: produces an 8/9-model comparison markdown report. Supports --exclude to drop specific models, headlines "clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0 task pass counts, and coverage parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:48:43 -07:00
scoootscooob	01a31e55fb	sweep: per-container state isolation + qwen model-id fix scripts/container_sweep_single.sh: clone pristine OpenClaw state to /tmp/ per sweep before starting the gateway. Carries over config (openclaw.json, identity/, devices/, exec-approvals.json, tasks/, subagents/, flows/, cron/) but leaves runtime dirs (agents/, workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR to the isolated dir so the gateway writes to /tmp instead of the shared host mount. Fixes the cascading "RPC agents.create timed out after 60s" failures caused by 4k+ stale agents accumulating across sequential sweeps. profiles/frontier_qwen_3_6.yaml: fix base_model from openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus (no dash). The dashed slug is unknown to OpenRouter and silently fails; the no-dash version is the real canonical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 19:48:30 -07:00
scoootscooob	deb3d5d85d	tasks: stop tracking current task set; fix t2 integration test for emptyNote Context: The current 40-task set is being split into a private holdout set plus a new public set. The public repo will ship a different task set that doesn't give away the holdout; in the meantime, stop tracking the current tasks/ directory so benchmarking can continue locally without exposing the set externally. Changes: - .gitignore: add tasks/ and lab-pr68627/ (vendored PR content, also moving out of the public repo). - git rm --cached tasks/: remove from tracking (files remain on disk locally). - tests/test_integration_checks.py: * Module-level pytest.mark.skipif that skips the whole file when tasks/ is absent — so CI against the public repo (no tasks) stays green once the private set moves out. * Update the t2-node-search-patch fixture to also define emptyNote() since the task was hardened with that distractor. Without this, the integration test asserts score==1.0 but gets 0.0 (the new "emptyNote stays empty" test fails against a fixture that never defines emptyNote). Follow-up (separate work): Public task set lands in a subsequent commit. Holdout access path (encrypted-in-repo or private-repo) gets wired into the harness's private_tasks_root / hidden_tasks_dir plumbing. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-19 12:29:52 -07:00
scoootscooob	95b226dfed	tasks: harden 5 ceiling-bound tasks for better model differentiation All 5 of these tasks were clearing at 0.85-1.00 across the frontier-4 on v4.14 — narrow spread means they don't differentiate models. Each now has a specific trap that catches naive approaches: - t1-refactor-csv-loader: introduces divergent normalization requirements between load_rows (lowercase) and summarize_inventory (preserve first- seen case). Naive "lowercase everywhere in parse_inventory_row" fails 2 of 3 tests. Proper refactor returns original case in the helper. - t3-node-multifile-refactor: adds a 3rd caller (audit.js) requiring preserved userId case + minute-precision timestamp, diverging from auth.js and report.js. Single-function extraction fails 2 of 4 tests; agent must handle two normalization modes. - t4-browser-research-and-code: docs rewritten with distractors — v1/v2/v3 versions, required/optional/cross-endpoint headers, rate limits, payload limits. Tests check 6 facts including negative-match for X-Admin-Token distractor (scoped to /v2/admin only). - t2-node-search-patch: adds emptyNote() factory in render.js with legitimate empty body: "" that MUST NOT be patched. Naive grep-replace of `body: ""` now fails the emptyNote test. Also adds whitespace- trimming test for filterNotes. - t4-memory-recall-continuation: requires storing 3 SEPARATE memory entries (beta-regions, retry-budget, apac-gating) instead of one. Release notes include operational-notes distractors that must NOT be codified. flags.py gains APAC_GATED_UNTIL field. Handoff verifier added to check all 3 facts in the handoff artifact. All 5 tasks verified: properly-implemented starter patches pass all tests, the new traps specifically fail naive implementations. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-19 12:24:25 -07:00
scoootscooob	cb48ca72e8	tasks: drop strict completion.files checks on 19 tasks Every one of these tasks has an execution_check script (verify_*.py) that already does a recursive workspace search — it greps for required content across every agent-written .md/.txt/.csv regardless of filename. The completion.files block was redundant and actively penalized models that wrote to reasonable alternate paths (analysis.md vs budget_report.md). Before: total=1 (file) + N (exec) → if file path didn't match, score was capped at N/(N+1). On t3-fin-budget-monthly, 14 of 15 prior sweep runs failed specifically on "FILE budget_report.md: File does not exist". After: total=N. Verifier is the source of truth. Judge rubric already tells graders "don't penalize non-standard paths" — this aligns completion scoring with that stated policy. Fixed tasks (all had recursive verifiers): t1-fs-quick-note, t1-life-translate, t2-ctx-pronoun-resolve, t2-err-instruction-ambig, t2-fs-cleanup-downloads, t2-fs-find-that-thing, t2-msg-summarize-thread, t2-priv-redact-doc, t2-skill-excel-rollup, t2-sys-memory-roundtrip, t2-web-quick-fact, t3-cal-reschedule-cascade, t3-data-sql-query, t3-fin-budget-monthly, t3-msg-inbox-triage, t3-social-bill-split, t3-web-research-and-cite, t4-ctx-long-recall, t4-life-trip-plan Spot-checked that each verifier's required-content set already covered the content_contains constraints that were also dropped. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 13:16:34 -07:00
scoootscooob	8a5be9c686	clawbench: per-sweep cache archiving + generic sweep templates - scripts/_archive_cache.sh: snapshot run_cache/<model>/ to run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json. Sourced by sweep scripts so transcripts survive the next sweep's cache wipe and stay available for audits. - scripts/container_sweep_single.sh: base multi-model sweep. Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so their caches are force-cleared at sweep start. Calls archive helper on exit. - scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast fix validation (~20 min) instead of full 3-run sweep (~60 min). - Dockerfile.main: parametrized clawbench-on-openclaw image with ARG BASE for pinning to any openclaw tag. - scripts/git_checkpoint.py + README: documented checkpoint workflow for tagging known-good states during risky work. - .gitignore: un-ignore scripts/, keep targeted ignores for __pycache__, .tmp, .local.py. Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>	2026-04-18 12:46:45 -07:00
scoootscooob	fe8fef7795	Merge branch 'pr-4' into codex/merge-pr4	2026-04-16 19:50:11 -07:00
scoootscooob	ee8ff79347	docs: fix ollama profile guidance	2026-04-16 19:49:04 -07:00
scoootscooob	9d802d6c53	fix: classify find_replace-style tools as edits	2026-04-16 19:37:01 -07:00
pllm-uci	517f2207b0	Refine local Ollama profile documentation for clarity and usability	2026-04-15 11:45:57 -07:00
pllm-uci	e2d82b34c3	Add local Ollama model support and configuration guidance to README and profiles	2026-04-15 11:45:12 -07:00
HeYan	a2757e6bd9	fix: classify str_replace and insert tools as mutating edits classify_tool_call matched tool names against a fixed set of verb patterns. The pattern for the "edit" family was: r"write\|edit\|patch\|apply\|create\|delete\|rename" This omitted "replace" and "insert", so tools like str_replace, replace_in_file, insert_text, and insert_at_line all fell through every check and were returned as ("unknown", False) – classified as non-mutating with unknown family. Consequences for any agent that edits via str_replace: - distinct_mutation_targets stayed empty → min_distinct_mutation_targets requirement always failed - read_before_write_ratio was 1.0 for the wrong reason (no mutations detected, so denominator collapsed to 1) - "edit" never appeared in distinct_families → required_families check always reported it as missing Fix: extend the edit pattern with "replace" and "insert". Tests added: unit test for classify_tool_call directly and an end-to-end trajectory test using a str_replace-based edit transcript.	2026-04-14 01:00:13 -07:00
scoootscooob	eb879adf9b	Remove reports/ reference from README repo layout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 00:52:17 -07:00
scoootscooob	6ab3004d63	Remove reports and scripts from repo, add to gitignore Reports and eval scripts contain internal benchmark data that should not be public. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-04-14 00:51:50 -07:00

1 2

98 Commits