Commit Graph

98 Commits

Author SHA1 Message Date
Vincent Koc
82bcfc1891
fix(worker): harden runtime result writes
Some checks failed
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled
2026-04-29 13:16:40 -07:00
Vincent Koc
ea17c715b3
fix(client): clean pending rpc on send failure
Some checks are pending
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run
Sync main to HF Space / mirror (push) Waiting to run
2026-04-29 00:09:27 -07:00
Vincent Koc
88ab0f5564
test: cover environment verifier success paths 2026-04-28 23:27:38 -07:00
Vincent Koc
8172fad70e
test: cover judge score gate propagation 2026-04-28 23:08:58 -07:00
Vincent Koc
fb486a1ed3
fix(scoring): gate judge-weighted scores 2026-04-28 22:52:12 -07:00
Vincent Koc
ed9adf8d84
fix(runtime): harden benchmark cache and task paths 2026-04-28 22:40:46 -07:00
Aaron Zhu
e120e86601
fix: flag credential file access in dangerous shell patterns (#6)
Some checks are pending
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run
Sync main to HF Space / mirror (push) Waiting to run
* fix: flag credential file access in dangerous shell patterns

* fix: avoid quoted credential false positives

* fix: reduce credential detector merge conflicts

* test: avoid credential detector import conflicts

* test: place credential detector coverage after baseline tests

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-04-28 13:17:11 -07:00
Aaron Zhu
dddfc0a175
fix: flag git push --force variants as dangerous shell commands (#5)
* fix: flag git push --force variants as dangerous shell commands

* fix: avoid quoted force-push false positives

* fix: reduce force-push detector merge conflicts

* test: avoid force-push detector import conflicts

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-04-28 13:17:01 -07:00
HeYan
c72e41687d
chore: add open-source contribution scaffolding (#3)
* chore: add open-source contribution scaffolding

New files
---------
LICENSE
  The README already references this file and the pyproject.toml already
  declares `license = "MIT"`, but no actual LICENSE file existed in the
  repo. The badge link was pointing at a 404.

CONTRIBUTING.md
  Setup instructions, guidance on which contributions are welcome (bug
  fixes, new tasks, scoring changes, docs), branch naming convention,
  commit style, and a note on adding new tasks with deterministic
  completion checks.

.github/ISSUE_TEMPLATE/bug_report.md
.github/ISSUE_TEMPLATE/feature_request.md
  Structured templates so bug reports arrive with reproduction steps and
  environment info, and feature requests arrive with motivation and
  alternatives considered.

.github/PULL_REQUEST_TEMPLATE.md
  Lightweight checklist (what / why / changes / tests) that matches the
  style of the two bug-fix PRs already merged.

pyproject.toml
  Added [project.urls] with Homepage, Repository, and Bug Tracker so the
  links appear correctly on PyPI if the package is ever published there.

* docs: align contribution scaffolding

---------

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-04-28 13:16:52 -07:00
HeYan
d21648ad3d
fix: strip quoted strings before checking for shell redirect operators (#2)
is_mutating_shell_command scanned the raw command string against
MUTATING_SHELL_PATTERNS, which includes the bare pattern r">".  This
caused any command with a > character inside a quoted argument to be
classified as a file-writing mutation:

    grep "count > 5" logs.txt   →  ("edit", True)   # wrong
    python -c "print(1 > 0)"    →  ("edit", True)   # wrong

In classify_shell_command, a mutating=True result suppresses both the
READ_ONLY and EXECUTION branches, so these read-only commands fell
through to `return "edit", True` instead of "search" or "execute".

Fix: strip the contents of quoted strings (both double and single
quotes) before scanning for mutation patterns.  The redirect operators
that actually matter — `>`, `>>`, `2>`, etc. — always appear outside
quotes in real shell commands, so stripping quote bodies removes the
false positives while preserving all true redirects.

Tests added: read-only commands containing > inside quotes must not be
flagged, and real redirect commands must still be detected.

Co-authored-by: Vincent Koc <vincentkoc@ieee.org>
2026-04-28 13:16:42 -07:00
Vincent Koc
0625ab7159
fix(runtime): harden queue and gateway lifecycle 2026-04-28 11:34:53 -07:00
Vincent Koc
dd92f8884c
chore(dev): add lint guardrails 2026-04-28 10:50:07 -07:00
Vincent Koc
38a2a0ff91
perf(app): cache leaderboard loads 2026-04-28 10:49:52 -07:00
Vincent Koc
509f21bb95
fix(cli): sync scenario filters 2026-04-28 10:49:38 -07:00
scoootscooob
b5538e0927 Copy all package data in HF Docker build 2026-04-28 02:35:09 -07:00
scoootscooob
425daa4fc8 Copy partner spec in HF Docker build 2026-04-28 02:31:26 -07:00
scoootscooob
d069bcfe3a Fix HF Docker package build 2026-04-28 02:26:39 -07:00
Vincent Koc
4ad2f1f417
fix(ci): ensure hugging face space before sync 2026-04-28 01:50:26 -07:00
Vincent Koc
fc86dd6155
ci: add blacksmith testbox setup 2026-04-28 01:45:35 -07:00
Vincent Koc
f373e4a710
fix: harden packaging and submissions 2026-04-28 01:17:43 -07:00
scoootscooob
fb029437be Add MIT license file 2026-04-28 00:05:38 -07:00
scoootscooob
4b7a9ee31c Fix public Docker task copies 2026-04-27 22:57:10 -07:00
scoootscooob
595cdc910c Add public domain scaffold and adapter diagnostics 2026-04-23 12:40:23 -07:00
scoootscooob
df32a5f073
Merge pull request #7 from HaoLi111/feat/dynamics-analysis
Add archive dynamics pipeline and audience-based model presets
2026-04-22 13:11:32 -07:00
scoootscooob
11d943f21c fix: preserve preset submission settings and lazy-load plots
Some checks failed
CI / Python 3.12 test suite (push) Has been cancelled
2026-04-22 12:03:16 -07:00
pllm-uci
c209612d46 Add archive dynamics pipeline and audience-based model presets 2026-04-22 12:03:13 -07:00
scoootscooob
5b50814dfc
Merge pull request #8 from gchlebus/gchlebus/fix-connect-timeout
fix(client): raise default connect_timeout to 30s and make it env-overridable
2026-04-22 09:47:06 -07:00
scoootscooob
79b2253bfc fix(ci): restore public task fallback 2026-04-22 09:46:33 -07:00
scoootscooob
e4ca2bef8e fix(client): reject invalid timeout env values
Some checks failed
CI / Python 3.12 test suite (push) Has been cancelled
2026-04-22 09:41:44 -07:00
Grzegorz Chlebus
547ee160ad fix(client): raise default connect_timeout to 30s and make it env-overridable
The default connect_timeout=15.0 is shorter than the
observed first-session setup time against a freshly started
OpenClaw gateway (we've measured phase0_session_setup
~20-25s during containerised benchmark runs), which creates a
race where the client gives up before the gateway is ready for
the first turn.  Downstream the adapter then surfaces this as
an ``empty_response`` with zero transcript steps, which looks
like a model failure when it's really an environment timing
issue.

Concrete repro from a 19-task public_dev run:
    task:      t4-life-trip-plan
    failed:    reward=0, failure_category=empty_response,
               duration_ms=0, total_ms=16352, response hash
               = SHA256 of empty string
    rerun:     score=0.927 standalone, phase0_session_setup=21.2s

Change:

* GatewayConfig.connect_timeout default 15.0 -> 30.0
* GatewayConfig.request_timeout default kept at 60.0 but
  now explicitly documented and overridable for symmetry
* Both are now overridable via environment variables
  CLAWBENCH_CONNECT_TIMEOUT / CLAWBENCH_REQUEST_TIMEOUT
  so ops can tune further without a code change.
* Invalid env values are logged and fall back to the default
  rather than blowing up benchmark runs.
* Adds three unit tests covering default, env override, and
  invalid-env fallback behaviour.

Reported-by: Grzegorz Chlebus <gchlebus@nvidia.com>
2026-04-22 10:19:20 +02:00
scoootscooob
8447ab1ca6 docker: revert OpenClaw base pin; remove reference scores
Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.

Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.

README.md:
  - drops the "Docker base pinning" row from the "What's new" table;
    replaced with "Reproducibility-first infrastructure" framing
  - drops the "pinned" badge; added a "Diagnostics" badge instead
  - updates "Reproducibility caveats" to recommend "build both sides
    of any comparison from the same OpenClaw release" rather than
    "pin to 2026.4.15-beta.1"
  - updates Quick Start to record (not assume) the OpenClaw version
    the build resolved to
  - drops the pinned-base row from the comparison table; replaced
    with "State-isolation per run" (the actually distinguishing infra)
  - updates the version log entry for Core v1 to highlight the
    dynamical-systems diagnostics + state-isolation rather than the
    pinning that's no longer there

tasks-public/README.md:
  - drops the 8-row "Established ranking" table per request
  - replaced with a "Selection criteria" section that explains how
    the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
    publishing version-dependent scores
  - reframes the build instructions to track :latest with a comment
    about platform-version drift

tasks-public/MANIFEST.yaml:
  - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
    a hard requirement)
  - drops the `established_ranking` block
  - replaced with `selection_basis` that documents the methodology
    and explicitly states why scores are intentionally omitted

Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:24:42 -07:00
scoootscooob
0e250e3fe1 fix(ci): tasks-public fallback + leaderboard removed from README
README.md: removed the inline reference leaderboard per user request.
The Core v1 manifest still carries the established ranking, the
README still documents methodology + dynamical-systems diagnostics.

clawbench/tasks.py: extend _resolve_tasks_dir() with a tasks-public/
fallback layer (resolver step 5). Local dev with the private tasks/
present is unchanged; CI without tasks/ now falls back to the public
Core v1 set instead of returning an empty corpus. Has been broken
since deb3d5d (the "stop tracking current task set" commit) — this
restores green CI now that tasks-public/ is available.

tests/test_tasks.py: three updates so tests pass against either the
private 40-task set OR the public 19-task set:
  - test_load_all_tasks_returns_full_corpus: threshold lowered from
    >= 20 to >= 19 (Core v1 size)
  - test_workspace_setup_preserves_nested_asset_paths: switched from
    t1-architecture-brief (private) to t4-browser-research-and-code
    (public) which exercises the same flat+nested asset behaviour
  - test_selected_tasks_include_judge_rubrics: replaced 3 task IDs
    not in the public Core release (t1-architecture-brief,
    t5-contradictory-requirements, t5-impossible-graceful-fail) with
    public-set equivalents (t1-bugfix-discount, t3-feature-export)

Verified locally with both branches:
  - private tasks/ present:    156 passed, 1 skipped
  - private tasks/ hidden:     152 passed, 5 skipped (CI-equivalent)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:32:26 -07:00
scoootscooob
f95e838d99 docs: rewrite README around Core v1 + dynamical-systems diagnostics
Updates the front-door README to reflect the Core v1 release and the
methodology innovations we shipped this cycle. Key additions:

- "What's new in Core v1" table highlighting the five methodology
  layers most agent benchmarks lack (signal-curated task set,
  variance decomposition, dynamical-systems diagnostics, Constraint
  Index, Docker base pinning).

- Reference leaderboard — 8-model ranking on the Core-19 set from the
  v2026-4-19-full sweep. Honest about GLM 5.1's non-reproducibility
  and the OpenRouter routing issue.

- "What makes ClawBench different" expanded with variance
  decomposition (52.7% capability / 47.3% seed noise) and a new
  section (#3) on dynamical-systems diagnostics, including the four
  concrete signals (C(q), regime, survival, SNR-weighted ranking).

- New "Reproducibility caveats" section — what reproduces (audit,
  diagnostics, top-cluster ranking) vs what drifts (absolute scores,
  OpenRouter models, OpenClaw platform upgrades). Documents the
  pinning we did.

- Updated Quick Start with `docker build -t clawbench:core-v1`
  verification flow and a full analysis-pipeline walkthrough using
  the new scripts (rejudge_all, compute_constraint_index, etc).

- Repository layout updated to include tasks-public/ (public) and
  scripts/ with brief descriptions of all 11 reproducibility +
  analysis scripts.

- Comparison table extended with new columns: variance decomposition,
  dynamical regime, SNR-weighted alternative, Docker base pinning,
  provider-routing caveats — all areas where SWE-bench / HumanEval /
  LLM-judge leaderboards are silent.

- Version log + planned Core v2 roadmap (Tier 6 long-horizon,
  paraphrased prompt pairs, creative-synthesis, human baseline).

Headline shifts from "the agent benchmark that measures what users
actually experience" to "Rigorous agent evaluation. Signal-curated
tasks. Dynamical-systems diagnostics." — foregrounds the
methodological contributions that separate Core v1 from prior art.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:15:18 -07:00
scoootscooob
030e9968bd docker: pin OpenClaw base to 2026.4.15-beta.1 for Core v1 reproducibility
The ClawBench Core v1 reference numbers were measured against
ghcr.io/openclaw/openclaw:2026.4.15-beta.1 (SHA 869e5e0ec27099573c54c).
Using the moving ":latest" tag caused observable drift in our sweeps
(platform upgrade from 4.9 to 4.15-beta.1 shifted all-model scores by
+0.13 to +0.29), so unpinned builds produce non-reproducible rankings.

Dockerfile: swap FROM ...:latest -> FROM ...:2026.4.15-beta.1. Added
an explanatory comment noting that bumping the base requires re-
running the reference sweep.

tasks-public/README.md: added build + verification commands so users
can confirm they have the right OpenClaw version before running Core
v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:09:49 -07:00
scoootscooob
50959fa670 tasks: add Core v1 public task set (19 tasks)
Stages a curated 19-task subset of the internal 40-task dev pool as
the public ClawBench release. Selected via greedy task elimination
from the v2026-4-19-full sweep archive so that:

  (a) mean run_score across these 19 tasks reproduces the established
      8-model ranking with zero inversions and min adjacent-rank gap
      of 0.0049 (well above the ~0.002 seed-noise floor);
  (b) coverage is preserved across tiers 1-5 and across the tools,
      coding, repo, browser, multi_tool, and adversarial families;
  (c) tasks with broken verifiers or near-zero cross-model SNR are
      dropped (21 tasks retained as private holdout, not published).

Established ranking (v4-19-full, OpenClaw 2026.4.15-beta.1, 3 runs
per task, C+T+B+J weighted score):

  1. Claude Opus 4.6         0.8137
  2. Claude Opus 4.7         0.7824
  3. GPT 5.4                 0.7647
  4. Claude Sonnet 4.6       0.7597
  5. MiniMax M2.7            0.7475
  6. Gemini 3.1 Pro          0.7408
  7. Qwen 3.6 Plus           0.7030
  8. Kimi K2.5               0.6800

Deliverables:
  tasks-public/MANIFEST.yaml   — machine-readable task list + metadata
  tasks-public/README.md       — rationale, usage, reproducibility notes
  tasks-public/tier{1..5}/*.yaml  — 19 task definitions
  tasks-public/assets/*/       — 19 asset packs (verifiers + fixtures)

The internal dev set remains in tasks/ (gitignored) and retains 40
tasks for future expansion. Not published:
  - 9 ceiling tasks (all frontier models score >0.85)
  - 9 noise tasks (cross-model SNR < 0.5)
  - 3 ranking-breaker tasks (e.g. t2-node-search-patch,
    t5-contradictory-requirements)

Core v2 will add Tier 6 long-horizon tasks, paraphrased prompt pairs
for perturbation-sensitivity measurement, and creative-synthesis
tasks — all currently absent from Core v1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 20:06:36 -07:00
scoootscooob
b6f07d9a87 analysis: dynamical-systems diagnostics for agent runs
Treats agent runs as stochastic trajectories in semantic state space
and extracts signal that flat run_score averages away. Inspired by
the "When LLMs Are Dreaming, Where Do They Go?" framework: task
constraint characterization, per-run regime classification, seed-vs-
capability variance decomposition, per-turn survival, SNR-weighted
ranking.

Uses TF-IDF bag-of-words embeddings (numpy + scipy only; no external
model dependencies) as the semantic state proxy since sentence
embeddings would require torch. Crude but sufficient for the signals
the paper calls out.

scripts/compute_constraint_index.py: computes C(q) per task from
archive responses. C(q) = -z(PR) - z(entropy) + z(BOPS) where PR is
participation ratio of response covariance, entropy is eigenvalue
entropy, and BOPS is inter-run cosine (predictability proxy). High
C(q) = tasks where models converge to similar answers; low C(q) =
open-ended tasks where models diverge for style reasons.

scripts/classify_regimes.py: per-run regime classifier. Computes
drift_mean, from_start, recurrence, vol_log over turn trajectories.
Quartile-based thresholds label each run as too_short / trapped /
limit_cycle / diffusive / mixed. Reveals per-model tendencies:
Gemini traps frequently (one-shot answer without iteration), GPT
loops tool patterns, GLM is most balanced.

scripts/variance_decomp.py: decomposes run_score variance per task
into seed variance (3 runs of same model) vs capability variance
(across model means). SNR = cap_var / seed_var. Exposes that 47% of
benchmark variance is seed noise; 21 of 40 tasks have SNR < 1 and
give essentially random rankings.

scripts/survival_analysis.py: per-turn empirical survival S(t) and
hazard h(t). T_F = first turn where assistant emits empty response
or run ends in failure. Reveals long-horizon capability that flat
scores hide: Kimi dies at median turn 3, GPT survives to turn 8 at
60% rate.

scripts/snr_weighted_ranking.py: SNR × |C(q)|-weighted ranking (with
winsorization at p95 to prevent single-task dominance). Headline
metric that weights discriminating + signal-rich tasks more than
noisy or consensus tasks. Also emits SNR-only and flat variants for
comparison.

scripts/generate_dynamical_report.py: assembles all four diagnostic
JSONs into a single markdown report with per-model regime tables,
SNR tiers, survival curves, and integrated interpretation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:49:05 -07:00
scoootscooob
afb14c3982 analysis: fair-comparison audit and rejudge pipeline
Tools for auditing archive coverage, rejudging judge-infra failures
via direct Anthropic API (bypasses the gateway path that sometimes
returns "Gateway is restarting" / empty judge results), and producing
fair multi-model comparison reports.

scripts/audit_runs.py: aggregate per-model audit. Parses sweep logs
and archive JSONs side-by-side. Reports coverage %, clean mean,
coverage-normalized score, infra-zero count, judge-infra remaining
vs rejudged.

scripts/audit_per_run.py: per-run cross-model audit. Flags tasks
where all models score zero (broken task/verifier), verifier
rejects-valid-outputs (C=0 but agent produced text), harness-error
clusters, model-specific pathologies.

scripts/rejudge_all.py: re-runs judge scoring on archive runs where
the gateway judge failed. Uses direct anthropic SDK against
claude-sonnet-4-6, rewrites judge_result fields in place, recomputes
run_score per the C+T+B+J weighting.

scripts/generate_fair_report.py: produces an 8/9-model comparison
markdown report. Supports --exclude to drop specific models, headlines
"clean" (mean across 120 archived runs). Reports per-tier scores, C=1.0
task pass counts, and coverage parity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:48:43 -07:00
scoootscooob
01a31e55fb sweep: per-container state isolation + qwen model-id fix
scripts/container_sweep_single.sh: clone pristine OpenClaw state to
/tmp/ per sweep before starting the gateway. Carries over config
(openclaw.json, identity/, devices/, exec-approvals.json, tasks/,
subagents/, flows/, cron/) but leaves runtime dirs (agents/,
workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR
to the isolated dir so the gateway writes to /tmp instead of the
shared host mount. Fixes the cascading "RPC agents.create timed out
after 60s" failures caused by 4k+ stale agents accumulating across
sequential sweeps.

profiles/frontier_qwen_3_6.yaml: fix base_model from
openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus
(no dash). The dashed slug is unknown to OpenRouter and silently fails;
the no-dash version is the real canonical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:48:30 -07:00
scoootscooob
deb3d5d85d tasks: stop tracking current task set; fix t2 integration test for emptyNote
Context:
  The current 40-task set is being split into a private holdout set plus a
  new public set. The public repo will ship a different task set that
  doesn't give away the holdout; in the meantime, stop tracking the current
  tasks/ directory so benchmarking can continue locally without exposing
  the set externally.

Changes:
  - .gitignore: add tasks/ and lab-pr68627/ (vendored PR content, also
    moving out of the public repo).
  - git rm --cached tasks/: remove from tracking (files remain on disk
    locally).
  - tests/test_integration_checks.py:
    * Module-level pytest.mark.skipif that skips the whole file when
      tasks/ is absent — so CI against the public repo (no tasks)
      stays green once the private set moves out.
    * Update the t2-node-search-patch fixture to also define emptyNote()
      since the task was hardened with that distractor. Without this, the
      integration test asserts score==1.0 but gets 0.0 (the new
      "emptyNote stays empty" test fails against a fixture that never
      defines emptyNote).

Follow-up (separate work):
  Public task set lands in a subsequent commit. Holdout access path
  (encrypted-in-repo or private-repo) gets wired into the harness's
  private_tasks_root / hidden_tasks_dir plumbing.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-19 12:29:52 -07:00
scoootscooob
95b226dfed tasks: harden 5 ceiling-bound tasks for better model differentiation
All 5 of these tasks were clearing at 0.85-1.00 across the frontier-4
on v4.14 — narrow spread means they don't differentiate models. Each
now has a specific trap that catches naive approaches:

- t1-refactor-csv-loader: introduces divergent normalization requirements
  between load_rows (lowercase) and summarize_inventory (preserve first-
  seen case). Naive "lowercase everywhere in parse_inventory_row" fails
  2 of 3 tests. Proper refactor returns original case in the helper.

- t3-node-multifile-refactor: adds a 3rd caller (audit.js) requiring
  preserved userId case + minute-precision timestamp, diverging from
  auth.js and report.js. Single-function extraction fails 2 of 4 tests;
  agent must handle two normalization modes.

- t4-browser-research-and-code: docs rewritten with distractors —
  v1/v2/v3 versions, required/optional/cross-endpoint headers, rate
  limits, payload limits. Tests check 6 facts including negative-match
  for X-Admin-Token distractor (scoped to /v2/admin only).

- t2-node-search-patch: adds emptyNote() factory in render.js with
  legitimate empty body: "" that MUST NOT be patched. Naive grep-replace
  of `body: ""` now fails the emptyNote test. Also adds whitespace-
  trimming test for filterNotes.

- t4-memory-recall-continuation: requires storing 3 SEPARATE memory
  entries (beta-regions, retry-budget, apac-gating) instead of one.
  Release notes include operational-notes distractors that must NOT
  be codified. flags.py gains APAC_GATED_UNTIL field. Handoff verifier
  added to check all 3 facts in the handoff artifact.

All 5 tasks verified: properly-implemented starter patches pass all
tests, the new traps specifically fail naive implementations.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-19 12:24:25 -07:00
scoootscooob
cb48ca72e8 tasks: drop strict completion.files checks on 19 tasks
Every one of these tasks has an execution_check script (verify_*.py) that
already does a recursive workspace search — it greps for required content
across every agent-written .md/.txt/.csv regardless of filename. The
completion.files block was redundant and actively penalized models that
wrote to reasonable alternate paths (analysis.md vs budget_report.md).

Before: total=1 (file) + N (exec) → if file path didn't match, score was
capped at N/(N+1). On t3-fin-budget-monthly, 14 of 15 prior sweep runs
failed specifically on "FILE budget_report.md: File does not exist".

After: total=N. Verifier is the source of truth. Judge rubric already
tells graders "don't penalize non-standard paths" — this aligns completion
scoring with that stated policy.

Fixed tasks (all had recursive verifiers):
  t1-fs-quick-note, t1-life-translate, t2-ctx-pronoun-resolve,
  t2-err-instruction-ambig, t2-fs-cleanup-downloads, t2-fs-find-that-thing,
  t2-msg-summarize-thread, t2-priv-redact-doc, t2-skill-excel-rollup,
  t2-sys-memory-roundtrip, t2-web-quick-fact, t3-cal-reschedule-cascade,
  t3-data-sql-query, t3-fin-budget-monthly, t3-msg-inbox-triage,
  t3-social-bill-split, t3-web-research-and-cite, t4-ctx-long-recall,
  t4-life-trip-plan

Spot-checked that each verifier's required-content set already covered
the content_contains constraints that were also dropped.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 13:16:34 -07:00
scoootscooob
8a5be9c686 clawbench: per-sweep cache archiving + generic sweep templates
- scripts/_archive_cache.sh: snapshot run_cache/<model>/ to
  run_cache_archive/<sweep_tag>/ at sweep exit with metadata.json.
  Sourced by sweep scripts so transcripts survive the next sweep's
  cache wipe and stay available for audits.
- scripts/container_sweep_single.sh: base multi-model sweep.
  Adds CACHE_SUB entries for claude-opus-4-7 / claude-sonnet-4-7 so
  their caches are force-cleared at sweep start. Calls archive helper
  on exit.
- scripts/container_sweep_minimal.sh: 1-run-per-task variant for fast
  fix validation (~20 min) instead of full 3-run sweep (~60 min).
- Dockerfile.main: parametrized clawbench-on-openclaw image with
  ARG BASE for pinning to any openclaw tag.
- scripts/git_checkpoint.py + README: documented checkpoint workflow
  for tagging known-good states during risky work.
- .gitignore: un-ignore scripts/, keep targeted ignores for
  __pycache__, .tmp, .local.py.

Co-Authored-By: Claude Opus 4 (1M context) <noreply@anthropic.com>
2026-04-18 12:46:45 -07:00
scoootscooob
fe8fef7795 Merge branch 'pr-4' into codex/merge-pr4 2026-04-16 19:50:11 -07:00
scoootscooob
ee8ff79347 docs: fix ollama profile guidance 2026-04-16 19:49:04 -07:00
scoootscooob
9d802d6c53 fix: classify find_replace-style tools as edits 2026-04-16 19:37:01 -07:00
pllm-uci
517f2207b0 Refine local Ollama profile documentation for clarity and usability 2026-04-15 11:45:57 -07:00
pllm-uci
e2d82b34c3 Add local Ollama model support and configuration guidance to README and profiles 2026-04-15 11:45:12 -07:00
HeYan
a2757e6bd9 fix: classify str_replace and insert tools as mutating edits
classify_tool_call matched tool names against a fixed set of verb
patterns. The pattern for the "edit" family was:

    r"write|edit|patch|apply|create|delete|rename"

This omitted "replace" and "insert", so tools like str_replace,
replace_in_file, insert_text, and insert_at_line all fell through
every check and were returned as ("unknown", False) – classified as
non-mutating with unknown family.

Consequences for any agent that edits via str_replace:
- distinct_mutation_targets stayed empty → min_distinct_mutation_targets
  requirement always failed
- read_before_write_ratio was 1.0 for the wrong reason (no mutations
  detected, so denominator collapsed to 1)
- "edit" never appeared in distinct_families → required_families check
  always reported it as missing

Fix: extend the edit pattern with "replace" and "insert".

Tests added: unit test for classify_tool_call directly and an end-to-end
trajectory test using a str_replace-based edit transcript.
2026-04-14 01:00:13 -07:00
scoootscooob
eb879adf9b Remove reports/ reference from README repo layout
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 00:52:17 -07:00
scoootscooob
6ab3004d63 Remove reports and scripts from repo, add to gitignore
Reports and eval scripts contain internal benchmark data that
should not be public.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 00:51:50 -07:00