From 4744a6ae7e48546c78da0fccf6dfbf1846cacc22 Mon Sep 17 00:00:00 2001 From: Codex Date: Fri, 10 Apr 2026 19:14:11 -0700 Subject: [PATCH] ClawBench: 7-model frontier baseline + bake-off tooling MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Profiles (profiles/): - frontier_opus_4_6.yaml (Anthropic Claude Opus 4.6 — closed) - frontier_gpt_5_4.yaml (OpenAI GPT-5.4 — closed) - frontier_gemini_3_pro.yaml (Google Gemini 3.1 Pro — closed) - frontier_glm_5_1.yaml (Zhipu AI GLM-5.1 via OpenRouter — open) - frontier_qwen_3_6.yaml (Alibaba Qwen3.6-Plus via OpenRouter — open) - frontier_minimax_m27.yaml (MiniMax M2.7 via OpenRouter — open) - frontier_kimi_k25.yaml (Moonshot Kimi K2.5 via OpenRouter — open) - example_research_stack.yaml (example for docs) All seven profiles share an identical plugin stack (anthropic + memory-lancedb + browser-playwright) so base_model is the only structural variable across the bake-off. Scripts (scripts/): - run_open_vs_closed_bakeoff.py: driver that runs each profile through the harness and generates a comparison table. Wraps `clawbench run --profile` via an inline Click entry (the package has no __main__.py so `python -m clawbench.cli` is a no-op). - analyze_open_vs_closed.py: historical DB analyzer — per-bucket mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/ Moonshot land in the open bucket. - ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py, scale_timeouts.py, seed_historical_db.py: task-corpus tooling. Reports (reports/): - FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run (3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6 scored 63.9% with real token streaming (174K tok, $0.18 cost). The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too easy to separate frontier models at n=1. Documents infrastructure findings around gateway plugin allowlist behavior, token streaming gaps for non-Anthropic providers, and hot-reload cascade when config changes mid-run. - open_vs_closed_bakeoff_summary.md: auto-generated headline table - FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run - REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run - PARALLEL_HARNESS_REPORT.md: concurrency validation writeup - V05_DELIVERY_REPORT.md: v0.5 framework delivery notes - CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning Artifacts (reports/artifacts/): - frontier_*.json: the 7 BenchmarkResult files from the bake-off (committed snapshot for reproducibility; runtime results still go to results/ which remains gitignored) Co-Authored-By: Claude Opus 4.6 (1M context) --- profiles/example_research_stack.yaml | 28 + profiles/frontier_gemini_3_pro.yaml | 26 + profiles/frontier_glm_5_1.yaml | 26 + profiles/frontier_gpt_5_4.yaml | 26 + profiles/frontier_kimi_k25.yaml | 26 + profiles/frontier_minimax_m27.yaml | 26 + profiles/frontier_opus_4_6.yaml | 26 + profiles/frontier_qwen_3_6.yaml | 26 + reports/CLAWBENCH_100_TASK_PLAN.md | 261 +++++++ reports/CONTRIBUTING_TASKS.md | 79 +++ reports/FRONTIER_7MODEL_BASELINE.md | 155 +++++ reports/FULL_BENCHMARK_REPORT.md | 121 ++++ reports/PARALLEL_HARNESS_REPORT.md | 183 +++++ reports/REAL_BENCHMARK_RESULTS.md | 154 +++++ reports/V05_DELIVERY_REPORT.md | 207 ++++++ reports/artifacts/frontier_gemini_3_pro.json | 636 +++++++++++++++++ reports/artifacts/frontier_glm_5_1.json | 636 +++++++++++++++++ reports/artifacts/frontier_gpt_5_4.json | 637 +++++++++++++++++ reports/artifacts/frontier_kimi_k25.json | 636 +++++++++++++++++ reports/artifacts/frontier_minimax_m27.json | 637 +++++++++++++++++ reports/artifacts/frontier_opus_4_6.json | 630 +++++++++++++++++ reports/artifacts/frontier_qwen_3_6.json | 636 +++++++++++++++++ reports/open_vs_closed_bakeoff_summary.md | 26 + scripts/analyze_open_vs_closed.py | 189 ++++++ scripts/ingest_real_run.py | 139 ++++ scripts/inject_judge_rubrics.py | 110 +++ scripts/refactor_verifiers.py | 680 +++++++++++++++++++ scripts/run_open_vs_closed_bakeoff.py | 467 +++++++++++++ scripts/scale_timeouts.py | 47 ++ scripts/seed_historical_db.py | 32 + 30 files changed, 7508 insertions(+) create mode 100644 profiles/example_research_stack.yaml create mode 100644 profiles/frontier_gemini_3_pro.yaml create mode 100644 profiles/frontier_glm_5_1.yaml create mode 100644 profiles/frontier_gpt_5_4.yaml create mode 100644 profiles/frontier_kimi_k25.yaml create mode 100644 profiles/frontier_minimax_m27.yaml create mode 100644 profiles/frontier_opus_4_6.yaml create mode 100644 profiles/frontier_qwen_3_6.yaml create mode 100644 reports/CLAWBENCH_100_TASK_PLAN.md create mode 100644 reports/CONTRIBUTING_TASKS.md create mode 100644 reports/FRONTIER_7MODEL_BASELINE.md create mode 100644 reports/FULL_BENCHMARK_REPORT.md create mode 100644 reports/PARALLEL_HARNESS_REPORT.md create mode 100644 reports/REAL_BENCHMARK_RESULTS.md create mode 100644 reports/V05_DELIVERY_REPORT.md create mode 100644 reports/artifacts/frontier_gemini_3_pro.json create mode 100644 reports/artifacts/frontier_glm_5_1.json create mode 100644 reports/artifacts/frontier_gpt_5_4.json create mode 100644 reports/artifacts/frontier_kimi_k25.json create mode 100644 reports/artifacts/frontier_minimax_m27.json create mode 100644 reports/artifacts/frontier_opus_4_6.json create mode 100644 reports/artifacts/frontier_qwen_3_6.json create mode 100644 reports/open_vs_closed_bakeoff_summary.md create mode 100755 scripts/analyze_open_vs_closed.py create mode 100644 scripts/ingest_real_run.py create mode 100644 scripts/inject_judge_rubrics.py create mode 100644 scripts/refactor_verifiers.py create mode 100755 scripts/run_open_vs_closed_bakeoff.py create mode 100644 scripts/scale_timeouts.py create mode 100644 scripts/seed_historical_db.py diff --git a/profiles/example_research_stack.yaml b/profiles/example_research_stack.yaml new file mode 100644 index 0000000..4490d17 --- /dev/null +++ b/profiles/example_research_stack.yaml @@ -0,0 +1,28 @@ +profile: + name: example-research-stack + base_model: claude-sonnet-4 + notes: | + A typical research-oriented configuration: anthropic provider plus + memory + browser tooling. Used as the example in CLI documentation. + plugins: + enabled: + - anthropic + - id: memory-lancedb + config: + dimensions: 1536 + - browser-playwright + - clawhub:rag-pinecone@1.2.0 + - local:./plugins/code-reviewer + slots: + memory: memory-lancedb + contextEngine: builtin + tools_allow: + - bash + - file_read + - file_edit + - browser_navigate + - browser_click + - memory_read + - memory_write + - pinecone_query + - review_file diff --git a/profiles/frontier_gemini_3_pro.yaml b/profiles/frontier_gemini_3_pro.yaml new file mode 100644 index 0000000..95d8aca --- /dev/null +++ b/profiles/frontier_gemini_3_pro.yaml @@ -0,0 +1,26 @@ +profile: + name: frontier-gemini-3-pro + base_model: google/gemini-3.1-pro-preview + notes: | + Frontier agentic coding model comparison: Gemini 3.1 Pro (closed). + Google flagship. Plugin stack IDENTICAL across all 7 profiles so the base + model is the only structural variable. Any score delta is attributable + to the model, not the scaffold. + plugins: + enabled: + - anthropic + - id: memory-lancedb + config: + dimensions: 1536 + - browser-playwright + slots: + memory: memory-lancedb + contextEngine: builtin + tools_allow: + - bash + - file_read + - file_edit + - browser_navigate + - browser_click + - memory_read + - memory_write diff --git a/profiles/frontier_glm_5_1.yaml b/profiles/frontier_glm_5_1.yaml new file mode 100644 index 0000000..55d372f --- /dev/null +++ b/profiles/frontier_glm_5_1.yaml @@ -0,0 +1,26 @@ +profile: + name: frontier-glm-5-1 + base_model: openrouter/z-ai/glm-5.1 + notes: | + Frontier agentic coding model comparison: GLM-5.1 (open). + Zhipu AI open-weights. Plugin stack IDENTICAL across all 7 profiles so the base + model is the only structural variable. Any score delta is attributable + to the model, not the scaffold. + plugins: + enabled: + - anthropic + - id: memory-lancedb + config: + dimensions: 1536 + - browser-playwright + slots: + memory: memory-lancedb + contextEngine: builtin + tools_allow: + - bash + - file_read + - file_edit + - browser_navigate + - browser_click + - memory_read + - memory_write diff --git a/profiles/frontier_gpt_5_4.yaml b/profiles/frontier_gpt_5_4.yaml new file mode 100644 index 0000000..282c875 --- /dev/null +++ b/profiles/frontier_gpt_5_4.yaml @@ -0,0 +1,26 @@ +profile: + name: frontier-gpt-5-4 + base_model: openai/gpt-5.4 + notes: | + Frontier agentic coding model comparison: GPT-5.4 (closed). + OpenAI flagship. Plugin stack IDENTICAL across all 7 profiles so the base + model is the only structural variable. Any score delta is attributable + to the model, not the scaffold. + plugins: + enabled: + - anthropic + - id: memory-lancedb + config: + dimensions: 1536 + - browser-playwright + slots: + memory: memory-lancedb + contextEngine: builtin + tools_allow: + - bash + - file_read + - file_edit + - browser_navigate + - browser_click + - memory_read + - memory_write diff --git a/profiles/frontier_kimi_k25.yaml b/profiles/frontier_kimi_k25.yaml new file mode 100644 index 0000000..0749f4a --- /dev/null +++ b/profiles/frontier_kimi_k25.yaml @@ -0,0 +1,26 @@ +profile: + name: frontier-kimi-k25 + base_model: openrouter/moonshotai/kimi-k2.5 + notes: | + Frontier agentic coding model comparison: Kimi K2.5 (open). + Moonshot open-weights. Plugin stack IDENTICAL across all 7 profiles so the base + model is the only structural variable. Any score delta is attributable + to the model, not the scaffold. + plugins: + enabled: + - anthropic + - id: memory-lancedb + config: + dimensions: 1536 + - browser-playwright + slots: + memory: memory-lancedb + contextEngine: builtin + tools_allow: + - bash + - file_read + - file_edit + - browser_navigate + - browser_click + - memory_read + - memory_write diff --git a/profiles/frontier_minimax_m27.yaml b/profiles/frontier_minimax_m27.yaml new file mode 100644 index 0000000..b899040 --- /dev/null +++ b/profiles/frontier_minimax_m27.yaml @@ -0,0 +1,26 @@ +profile: + name: frontier-minimax-m27 + base_model: openrouter/minimax/minimax-m2.7 + notes: | + Frontier agentic coding model comparison: MiniMax M2.7 (open). + MiniMax open-weights. Plugin stack IDENTICAL across all 7 profiles so the base + model is the only structural variable. Any score delta is attributable + to the model, not the scaffold. + plugins: + enabled: + - anthropic + - id: memory-lancedb + config: + dimensions: 1536 + - browser-playwright + slots: + memory: memory-lancedb + contextEngine: builtin + tools_allow: + - bash + - file_read + - file_edit + - browser_navigate + - browser_click + - memory_read + - memory_write diff --git a/profiles/frontier_opus_4_6.yaml b/profiles/frontier_opus_4_6.yaml new file mode 100644 index 0000000..7f8942c --- /dev/null +++ b/profiles/frontier_opus_4_6.yaml @@ -0,0 +1,26 @@ +profile: + name: frontier-opus-4-6 + base_model: anthropic/claude-opus-4-6 + notes: | + Frontier agentic coding model comparison: Claude Opus 4.6 (closed). + Anthropic flagship. Plugin stack IDENTICAL across all 7 profiles so the base + model is the only structural variable. Any score delta is attributable + to the model, not the scaffold. + plugins: + enabled: + - anthropic + - id: memory-lancedb + config: + dimensions: 1536 + - browser-playwright + slots: + memory: memory-lancedb + contextEngine: builtin + tools_allow: + - bash + - file_read + - file_edit + - browser_navigate + - browser_click + - memory_read + - memory_write diff --git a/profiles/frontier_qwen_3_6.yaml b/profiles/frontier_qwen_3_6.yaml new file mode 100644 index 0000000..686caea --- /dev/null +++ b/profiles/frontier_qwen_3_6.yaml @@ -0,0 +1,26 @@ +profile: + name: frontier-qwen-3-6 + base_model: openrouter/qwen/qwen-3.6-plus + notes: | + Frontier agentic coding model comparison: Qwen3.6-Plus (open). + Alibaba open-weights. Plugin stack IDENTICAL across all 7 profiles so the base + model is the only structural variable. Any score delta is attributable + to the model, not the scaffold. + plugins: + enabled: + - anthropic + - id: memory-lancedb + config: + dimensions: 1536 + - browser-playwright + slots: + memory: memory-lancedb + contextEngine: builtin + tools_allow: + - bash + - file_read + - file_edit + - browser_navigate + - browser_click + - memory_read + - memory_write diff --git a/reports/CLAWBENCH_100_TASK_PLAN.md b/reports/CLAWBENCH_100_TASK_PLAN.md new file mode 100644 index 0000000..4ee51a0 --- /dev/null +++ b/reports/CLAWBENCH_100_TASK_PLAN.md @@ -0,0 +1,261 @@ +# ClawBench 100-Task Expansion Plan + +## Goal + +Expand ClawBench from 20 tasks to 100 tasks. Cover all 72 queries from the +基础使用场景测试集 sheet at least loosely. Add new high-frequency personal-agent +scenarios that the sheet does not capture. Make every task vague-prompted, +multi-step, and verifiable through deterministic execution checks. + +## Core Authoring Rules (apply to every new task) + +1. **Vague user prompt.** The user message should sound like a real human at + the end of a long day, not a labeled rubric. No numbered steps. No + parameter lists. No "do all of the following". The agent must discover + structure from the workspace. +2. **Hidden requirements.** All structure (file names, output schemas, time + windows, priority rules) lives in the workspace, not the prompt. +3. **Multi-stage.** Every new task is at minimum 4 distinct phases: + discover → plan → act → verify. Tier 4 tasks add a recovery or + reconciliation phase. +4. **Frontier separators.** Every task must have at least one design element + that bunches weak agents and separates strong ones: dedupe, timezone math, + corrupt input, mutually exclusive constraints, ambiguity that requires + asking or grounding, or cross-stage state passing. +5. **Sandboxed.** No real external sends. Email/chat/calendar/cron live in + workspace files or the OpenClaw test gateway. +6. **Verifiable.** Every task ships with execution_checks scripts that + deterministically pass or fail. No LLM judges in the primary path. +7. **No fabrication tolerance.** Where the agent could hallucinate, the + verifier explicitly checks grounding (e.g., summary cites real event_ids, + prices match real source data, contacts resolved from real records). + +## Task Distribution Across 100 Tasks + +| Scenario | Tasks | Existing | New | +|---|---:|---:|---:| +| `file_system_ops` | 8 | 0 | 8 | +| `web_info_ops` | 7 | 2 | 5 | +| `calendar_reminders` | 6 | 0 | 6 | +| `communication_messaging` | 8 | 0 | 8 | +| `data_processing_analysis` | 9 | 2 | 7 | +| `coding_dev_assist` | 9 | 9 | 0 | +| `personal_life_assistant` | 7 | 0 | 7 | +| `multi_step_compound` | 8 | 3 | 5 | +| `context_continuation` | 7 | 1 | 6 | +| `error_boundary_cases` | 7 | 2 | 5 | +| `skill_calling` | 7 | 0 | 7 | +| `system_capabilities` | 5 | 1 | 4 | +| `privacy_pii_handling` (NEW) | 4 | 0 | 4 | +| `personal_financial_hygiene` (NEW) | 3 | 0 | 3 | +| `travel_logistics_under_uncertainty` (NEW) | 3 | 0 | 3 | +| `social_coordination` (NEW) | 2 | 0 | 2 | +| **Total** | **100** | **20** | **80** | + +## Tier Distribution + +| Tier | Existing | Target | Rationale | +|---|---:|---:|---| +| Tier 1 (single capability, easy) | 3 | 12 | Calibration floor | +| Tier 2 (intermediate, 2-3 capabilities) | 5 | 28 | Bulk of personal-agent surface | +| Tier 3 (multi-stage, 4+ capabilities) | 5 | 32 | Where most differentiation lives | +| Tier 4 (frontier, multi-phase, recovery) | 4 | 20 | Premium frontier signal | +| Tier 5 (adversarial, edge cases) | 3 | 8 | Safety and robustness | + +## Why Add 4 New Scenarios Beyond the Test Sheet + +The test sheet's 12 scenarios cover canonical personal-agent surface, but +omit four classes of high-frequency real-world tasks that production +personal agents must handle: + +1. **`privacy_pii_handling`** — redacting personal info from documents + before sharing, identifying sensitive data leakage in screenshots and + uploads, sandboxing credentials. Personal agents touch PII constantly. + +2. **`personal_financial_hygiene`** — budget tracking, expense categorization, + subscription auditing, receipt parsing. Not investment advice (prohibited) + but everyday personal-finance hygiene that agents are routinely asked + to help with. + +3. **`travel_logistics_under_uncertainty`** — flight delays, replanning + under cancellations, multi-leg booking constraints, time-zone aware + reminders. The "uncertainty" axis (things going wrong mid-trip) is + missing from the test sheet's calendar/reminder coverage. + +4. **`social_coordination`** — splitting bills, scheduling with multiple + humans, RSVPing on behalf of user, group decisions. These require + careful constraint-satisfaction and tactful drafting. + +Each new scenario contributes a small but non-trivial weight (1–4%). + +## The 100 Tasks (by scenario) + +Naming convention: `{tier}-{scenario_short}-{descriptor}.yaml`. +"V" marks tasks already authored or in progress at time of writing. + +### file_system_ops (8 tasks) + +- t1-fs-quick-note L1 — vague "jot down what I just said" with formatting inferred from context +- t2-fs-find-that-thing L2 — fuzzy file recall ("the spreadsheet I worked on last month, with the budget stuff") +- t2-fs-cleanup-downloads L2 — vague "tidy up my downloads" with hidden retention rules +- t2-fs-photo-rename L2 — batch rename with EXIF date extraction and conflict handling +- t3-fs-incident-bundle L3 — V — incident assembly with dedupe, DST, corrupt skip +- t3-fs-archive-rotation L3 — vague "archive last quarter and free up space" with retention policy +- t4-fs-recovery-from-mess L4 — partial-failure recovery: previous agent left workspace half-organized +- t4-fs-cross-volume-sync L4 — sync state across two simulated drives with conflict resolution + +### web_info_ops (7 tasks) + +- t2-web-quick-fact L2 — V — Q-WEB-02 style "what's the weather and the dollar today" +- t2-web-research-note L2 — V — t4 already covers research-and-code; this is research-only +- t2-web-table-extract L2 — table extraction to CSV with header inference and unit normalization +- t3-web-price-compare L3 — multi-source price comparison with seller reputation weighting +- t3-web-form-debug L3 — V — t2-browser-form-fix +- t3-web-research-and-cite L3 — research with mandatory citation and grounding check +- t4-web-deep-dive L4 — multi-hop research with contradicting sources reconciliation + +### calendar_reminders (6 tasks) + +- t1-cal-quick-reminder L1 — vague "remind me later" with implicit time inference +- t2-cal-create-event L2 — natural-language event creation with attendee resolution +- t2-cal-recurring-routine L2 — recurring rule from natural-language description +- t3-cal-conflict-resolver L3 — V — priority-based conflict resolution with DST and eviction trace +- t3-cal-reschedule-cascade L3 — one cancellation triggers reschedule cascade across linked events +- t4-cal-multi-tz-coord L4 — multi-timezone meeting coordination with constraint solver + +### communication_messaging (8 tasks) + +- t2-msg-send-update L2 — vague "let the team know I'm running late" with channel and contact resolution +- t2-msg-summarize-thread L2 — summarize a long thread with action-item extraction +- t2-msg-write-email L2 — formal email from sparse bullet points +- t3-msg-inbox-triage L3 — classify, prioritize, draft replies for urgent items +- t3-msg-followup-loop L3 — track unanswered messages and draft follow-ups with context +- t3-msg-newsletter-purge L3 — bulk unsubscribe planner with allowlist exceptions +- t4-msg-multilingual-thread L4 — thread spanning EN/中文 with consistent tone preservation +- t4-msg-conflict-mediation L4 — drafting a tactful response to a tense thread + +### data_processing_analysis (9 tasks) + +- t2-data-monthly-aggregate L2 — Excel-style monthly rollup with structured output +- t2-data-format-convert L2 — JSON↔CSV↔YAML with type preservation +- t2-data-clean-and-dedupe L2 — clean dirty data with audit log of changes +- t3-data-pipeline-report L3 — V — existing +- t3-data-multifile-merge L3 — merge N CSVs with schema reconciliation +- t3-data-pivot-and-chart L3 — pivot table generation and chart export +- t3-data-sql-query L3 — natural-language to SQL with result verification +- t4-data-anomaly-investigate L4 — detect, explain, and remediate anomalies in time-series data +- t4-data-cross-source-recon L4 — reconcile discrepancies between two sources of truth + +### coding_dev_assist (9 tasks — keep existing) + +All existing t1/t2/t3 coding tasks remain. Reframing 1-2 of them to be more +user-facing (e.g. PNG→JPG batch script) is a future iteration. + +### personal_life_assistant (7 tasks) + +- t1-life-translate L1 — translation with tone preservation +- t2-life-recipe-from-fridge L2 — constraint-based recipe selection (dietary, ingredients, time) +- t2-life-package-tracker L2 — track multiple packages and produce a digest +- t2-life-unit-convert L2 — multi-unit conversion with currency lookup +- t3-life-personal-shopper L3 — shopping list build from sparse goals + budget +- t3-life-letter-draft L3 — formal letter from emotional bullet points +- t4-life-trip-plan L4 — multi-day trip plan with constraints and grounding + +### multi_step_compound (8 tasks) + +- t3-multi-research-to-md L3 — research → structured markdown report +- t3-multi-scrape-analyze L3 — scrape → analyze → chart pipeline +- t3-multi-email-cal-reply L3 — read inbox → create calendar entry → reply +- t3-multi-download-summarize L3 — download → summarize → forward +- t3-feature-export L3 — V — existing +- t3-data-pipeline-report L3 — V — existing +- t3-monitoring-automation L3 — V — existing +- t4-multi-conditional-branch L4 — task with conditional branches based on file existence + +### context_continuation (7 tasks) + +- t2-ctx-pronoun-resolve L2 — multi-turn with pronouns and ellipsis +- t2-ctx-preference-recall L2 — recall stated preferences in later turn +- t3-ctx-task-resume L3 — resume yesterday's half-finished work from memory +- t3-ctx-correction-chain L3 — multi-turn corrections to a single output +- t3-ctx-multitask-switch L3 — interrupt current task, do another, return +- t4-ctx-long-recall L4 — recall fact from 20 turns earlier +- t4-memory-recall-continuation L4 — V — existing + +### error_boundary_cases (7 tasks) + +- t1-err-resource-missing L1 — graceful handling of missing file/URL +- t2-err-permission-denied L2 — graceful refusal on protected paths +- t2-err-instruction-ambig L2 — ask vs guess on ambiguous request +- t3-err-tool-failure L3 — primary tool fails, agent must use fallback +- t3-err-mid-task-interrupt L3 — recover from simulated interruption +- t5-impossible-graceful-fail L5 — V — existing +- t5-hallucination-resistant-evidence L5 — V — existing + +### skill_calling (7 tasks) + +- t2-skill-excel-rollup L2 — Excel skill: read sheet, compute, write new sheet +- t2-skill-pdf-merge L2 — PDF skill: merge, extract pages, page count +- t2-skill-word-memo L2 — Word skill: structured memo with formatting +- t3-skill-ppt-from-md L3 — PPT skill: generate deck from markdown brief +- t3-skill-pdf-extract-table L3 — PDF skill: extract tabular data into CSV +- t4-skill-quarterly-bundle L4 — orchestrate Excel + PPT + PDF + Word for one report +- t4-skill-cross-format L4 — convert between formats with structure preservation + +### system_capabilities (5 tasks) + +- t2-sys-memory-roundtrip L2 — write to memory, recall in next session +- t2-sys-image-generate L2 — image generation with constraint adherence +- t3-sys-html-preview L3 — generate HTML dashboard, preview, verify rendering +- t3-sys-automation-set L3 — create cron + verify execution +- t4-sys-multi-skill-orchestrate L4 — orchestrate memory + image + automation + +### privacy_pii_handling (NEW — 4 tasks) + +- t2-priv-redact-doc L2 — redact PII from a document before sharing +- t3-priv-screenshot-scan L3 — scan screenshots for sensitive info, produce report +- t3-priv-credential-isolate L3 — detect and isolate credentials accidentally pasted in notes +- t4-priv-leakage-audit L4 — audit a workspace for PII exposure across many files + +### personal_financial_hygiene (NEW — 3 tasks) + +- t2-fin-receipt-parse L2 — parse receipts from photos/PDFs into expense log +- t3-fin-subscription-audit L3 — find unused subscriptions in transaction history +- t3-fin-budget-monthly L3 — compute monthly budget vs actual with category drill-down + +### travel_logistics_under_uncertainty (NEW — 3 tasks) + +- t3-travel-replan-delay L3 — replan an itinerary after a flight delay +- t3-travel-multi-leg L3 — multi-leg trip with timezone-aware reminders +- t4-travel-recovery L4 — full recovery from a major cancellation event + +### social_coordination (NEW — 2 tasks) + +- t3-social-bill-split L3 — bill split with itemized contributions and edge cases +- t4-social-group-meet L4 — coordinate a meeting time across N people with constraints + +## Implementation Phasing + +### Phase 1 (current PR): Foundation +- Add new scenario domains to schema (DONE) +- Update scenario weights (DONE) +- Author this plan (DONE) +- Author 20 high-quality YAML files spanning all new scenarios + +### Phase 2: Asset packs +- Build asset packs for the 20 Phase 1 tasks +- Build verifier scripts for each task + +### Phase 3: Bulk authoring +- Author the remaining 60 task YAML files following the templates +- Build remaining asset packs and verifiers +- Update query_catalog.py with metadata for all 100 tasks + +### Phase 4: Calibration +- Run 5 frontier models against the 100-task suite +- Identify tasks with zero discrimination (all models pass or all fail) and rewrite +- Tune scenario weights based on observed score variance + +### Phase 5: Lock and rotate +- Move 30% of tasks to `official_hidden` pool +- Set up rotation schedule for hidden variants diff --git a/reports/CONTRIBUTING_TASKS.md b/reports/CONTRIBUTING_TASKS.md new file mode 100644 index 0000000..3e7b282 --- /dev/null +++ b/reports/CONTRIBUTING_TASKS.md @@ -0,0 +1,79 @@ +# Contributing Tasks to ClawBench + +This guide explains how to add a new task to the ClawBench suite. Every +task is a triple of: + +1. A YAML definition under `tasks/tier{1..5}/` +2. An asset pack under `tasks/assets//` +3. One or more verifier scripts inside the asset pack + +The 100-task plan in `CLAWBENCH_100_TASK_PLAN.md` lists every task slot. +The reference implementations to pattern-match against are: + +- `tasks/tier1/t1-fs-quick-note.yaml` + `tasks/assets/t1_fs_quick_note/` +- `tasks/tier2/t2-fs-cleanup-downloads.yaml` + `tasks/assets/t2_fs_cleanup_downloads/` +- `tasks/tier2/t2-sys-memory-roundtrip.yaml` + `tasks/assets/t2_sys_memory_roundtrip/` + +## Authoring rules (non-negotiable) + +1. **Vague user prompt.** Real-human voice. No numbered steps. No + parameter lists. No "do all of the following". +2. **Hidden requirements.** All structure (file names, schemas, time + windows, priority rules) lives in workspace files, not the prompt. +3. **Multi-stage.** Discover → plan → act → verify. Tier 4 adds recovery. +4. **Frontier separators.** At least one design element that bunches + weak agents and separates strong ones (dedupe, timezone math, corrupt + input, mutually exclusive constraints, ambiguity, no-fabrication). +5. **Sandboxed.** No real external sends. Email/cal/cron in workspace. +6. **Verifiable.** Every assertion runs as a Python verifier with a + non-zero exit code on failure. No LLM judges in the primary path. +7. **No fabrication tolerance.** Where the agent could hallucinate, the + verifier explicitly checks grounding. + +## Verifier conventions + +- One verifier script per `execution_check` in the YAML +- Script lives next to its asset pack: `tasks/assets//