207 lines
6.4 KiB
YAML
207 lines
6.4 KiB
YAML
manifest_version: 1
|
||
release: clawbench-core-v1
|
||
release_date: 2026-04-20
|
||
benchmark_version: 0.4.0.dev1
|
||
task_count: 19
|
||
|
||
description: |
|
||
ClawBench Core v1 — a curated subset of 19 tasks from the internal
|
||
40-task ClawBench dev pool. Selected so that:
|
||
(a) all 8 measured frontier models produce the established ranking
|
||
order in the v4-19-full sweep,
|
||
(b) coverage is preserved across tiers (1–5) and task families
|
||
(tools, coding, repo, browser, multi_tool, adversarial),
|
||
(c) tasks with broken verifiers or near-zero cross-model SNR are
|
||
dropped.
|
||
|
||
Verification: mean run_score across these 19 tasks reproduces the
|
||
reference ranking with 0 inversions and min adjacent-rank gap of
|
||
0.0049 (well above the ~0.002 seed-noise floor).
|
||
|
||
selection_basis:
|
||
description: |
|
||
The 19 tasks below were chosen via greedy task selection from the
|
||
v2026-4-19-full archive so that the cross-model mean reproduces
|
||
the reference 8-model ordering with 0 inversions and a min
|
||
adjacent-rank gap of 0.0049 (~2.5x the seed-noise floor).
|
||
reference_models:
|
||
- anthropic/claude-opus-4-6
|
||
- anthropic/claude-opus-4-7
|
||
- openai/gpt-5.4
|
||
- anthropic/claude-sonnet-4-6
|
||
- openrouter/minimax/minimax-m2.7
|
||
- google/gemini-3.1-pro-preview
|
||
- openrouter/qwen/qwen3.6-plus
|
||
- openrouter/moonshotai/kimi-k2.5
|
||
notes: |
|
||
Numerical scores intentionally omitted from this manifest. They
|
||
are openclaw-version-, provider-routing-, and seed-dependent;
|
||
publishing them would mislead anyone treating them as a stable
|
||
reference. Run the bench against your own configuration to
|
||
establish your own baseline.
|
||
|
||
coverage:
|
||
tiers:
|
||
tier1: 2
|
||
tier2: 6
|
||
tier3: 5
|
||
tier4: 5
|
||
tier5: 1
|
||
families:
|
||
tools: 8
|
||
coding: 2
|
||
repo: 3
|
||
browser: 2
|
||
multi_tool: 3
|
||
adversarial: 1
|
||
# Tier 3/4 some families overlap; see per-task manifest below.
|
||
|
||
tasks:
|
||
- id: t1-bugfix-discount
|
||
tier: tier1
|
||
family: coding
|
||
capabilities: [bugfix]
|
||
path: tier1/t1-bugfix-discount.yaml
|
||
asset_pack: t1_bugfix_discount
|
||
|
||
- id: t1-fs-quick-note
|
||
tier: tier1
|
||
family: tools
|
||
capabilities: [structured_output]
|
||
path: tier1/t1-fs-quick-note.yaml
|
||
asset_pack: t1_fs_quick_note
|
||
|
||
- id: t2-add-tests-normalizer
|
||
tier: tier2
|
||
family: coding
|
||
capabilities: [test_authoring]
|
||
path: tier2/t2-add-tests-normalizer.yaml
|
||
asset_pack: t2_add_tests_normalizer
|
||
|
||
- id: t2-browser-form-fix
|
||
tier: tier2
|
||
family: browser
|
||
capabilities: [browser_debugging, bugfix]
|
||
path: tier2/t2-browser-form-fix.yaml
|
||
asset_pack: t2_browser_form_fix
|
||
|
||
- id: t2-config-loader
|
||
tier: tier2
|
||
family: repo
|
||
capabilities: [bugfix, multifile_reasoning]
|
||
path: tier2/t2-config-loader.yaml
|
||
asset_pack: t2_config_loader
|
||
|
||
- id: t2-fs-find-that-thing
|
||
tier: tier2
|
||
family: tools
|
||
capabilities: [structured_output]
|
||
path: tier2/t2-fs-find-that-thing.yaml
|
||
asset_pack: t2_fs_find_that_thing
|
||
|
||
- id: t2-msg-summarize-thread
|
||
tier: tier2
|
||
family: tools
|
||
capabilities: [research_synthesis, structured_output]
|
||
path: tier2/t2-msg-summarize-thread.yaml
|
||
asset_pack: t2_msg_summarize_thread
|
||
|
||
- id: t2-priv-redact-doc
|
||
tier: tier2
|
||
family: tools
|
||
capabilities: [structured_output, graceful_refusal]
|
||
path: tier2/t2-priv-redact-doc.yaml
|
||
asset_pack: t2_priv_redact_doc
|
||
|
||
- id: t3-data-pipeline-report
|
||
tier: tier3
|
||
family: multi_tool
|
||
capabilities: [structured_output, multifile_reasoning]
|
||
path: tier3/t3-data-pipeline-report.yaml
|
||
asset_pack: t3_data_pipeline_report
|
||
|
||
- id: t3-data-sql-query
|
||
tier: tier3
|
||
family: tools
|
||
capabilities: [structured_output]
|
||
path: tier3/t3-data-sql-query.yaml
|
||
asset_pack: t3_data_sql_query
|
||
|
||
- id: t3-feature-export
|
||
tier: tier3
|
||
family: repo
|
||
capabilities: [multifile_reasoning, structured_output]
|
||
path: tier3/t3-feature-export.yaml
|
||
asset_pack: t3_feature_export
|
||
|
||
- id: t3-msg-inbox-triage
|
||
tier: tier3
|
||
family: tools
|
||
capabilities: [structured_output, multifile_reasoning]
|
||
path: tier3/t3-msg-inbox-triage.yaml
|
||
asset_pack: t3_msg_inbox_triage
|
||
|
||
- id: t3-web-research-and-cite
|
||
tier: tier3
|
||
family: tools
|
||
capabilities: [research_synthesis]
|
||
path: tier3/t3-web-research-and-cite.yaml
|
||
asset_pack: t3_web_research_and_cite
|
||
|
||
- id: t4-browser-research-and-code
|
||
tier: tier4
|
||
family: browser
|
||
capabilities: [browser_debugging, research_synthesis]
|
||
path: tier4/t4-browser-research-and-code.yaml
|
||
asset_pack: t4_browser_research_and_code
|
||
|
||
- id: t4-cross-repo-migration
|
||
tier: tier4
|
||
family: repo
|
||
capabilities: [cross_repo_change, multifile_reasoning]
|
||
path: tier4/t4-cross-repo-migration.yaml
|
||
asset_pack: t4_cross_repo_migration
|
||
|
||
- id: t4-delegation-repair
|
||
tier: tier4
|
||
family: multi_tool
|
||
capabilities: [delegation, bugfix]
|
||
path: tier4/t4-delegation-repair.yaml
|
||
asset_pack: t4_delegation_repair
|
||
|
||
- id: t4-life-trip-plan
|
||
tier: tier4
|
||
family: tools
|
||
capabilities: [research_synthesis, structured_output]
|
||
path: tier4/t4-life-trip-plan.yaml
|
||
asset_pack: t4_life_trip_plan
|
||
|
||
- id: t4-memory-recall-continuation
|
||
tier: tier4
|
||
family: multi_tool
|
||
capabilities: [memory_continuation, multifile_reasoning]
|
||
path: tier4/t4-memory-recall-continuation.yaml
|
||
asset_pack: t4_memory_recall_continuation
|
||
|
||
- id: t5-hallucination-resistant-evidence
|
||
tier: tier5
|
||
family: adversarial
|
||
capabilities: [research_synthesis, tool_composition]
|
||
path: tier5/t5-hallucination-resistant-evidence.yaml
|
||
asset_pack: t5_hallucination_resistant_evidence
|
||
|
||
notes: |
|
||
- The full private dev set (tasks/) contains 40 tasks. This Core-19
|
||
subset is the signal-rich, ranking-consistent public release.
|
||
- Additional 21 tasks are retained as a private holdout for
|
||
contamination-resistant measurement of future models.
|
||
- Task families "creative" and "long-horizon (Tier 6)" are absent
|
||
from Core v1; planned for a future release.
|
||
- Known caveats: t4-memory-recall-continuation has a verifier that
|
||
penalizes agents that respond in conversation rather than via file
|
||
artifacts. All models face the same verifier, so the comparison is
|
||
internally fair, but absolute scores understate capability.
|
||
- t5-hallucination-resistant-evidence has low cross-model SNR (about
|
||
0.25) in v4-19-full; included for adversarial-family coverage
|
||
despite this. Consider upgrading verifier in a future release.
|