clawbench/tasks-public/MANIFEST.yaml

207 lines
6.4 KiB
YAML
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

manifest_version: 1
release: clawbench-core-v1
release_date: 2026-04-20
benchmark_version: 0.4.0.dev1
task_count: 19
description: |
ClawBench Core v1 — a curated subset of 19 tasks from the internal
40-task ClawBench dev pool. Selected so that:
(a) all 8 measured frontier models produce the established ranking
order in the v4-19-full sweep,
(b) coverage is preserved across tiers (15) and task families
(tools, coding, repo, browser, multi_tool, adversarial),
(c) tasks with broken verifiers or near-zero cross-model SNR are
dropped.
Verification: mean run_score across these 19 tasks reproduces the
reference ranking with 0 inversions and min adjacent-rank gap of
0.0049 (well above the ~0.002 seed-noise floor).
selection_basis:
description: |
The 19 tasks below were chosen via greedy task selection from the
v2026-4-19-full archive so that the cross-model mean reproduces
the reference 8-model ordering with 0 inversions and a min
adjacent-rank gap of 0.0049 (~2.5x the seed-noise floor).
reference_models:
- anthropic/claude-opus-4-6
- anthropic/claude-opus-4-7
- openai/gpt-5.4
- anthropic/claude-sonnet-4-6
- openrouter/minimax/minimax-m2.7
- google/gemini-3.1-pro-preview
- openrouter/qwen/qwen3.6-plus
- openrouter/moonshotai/kimi-k2.5
notes: |
Numerical scores intentionally omitted from this manifest. They
are openclaw-version-, provider-routing-, and seed-dependent;
publishing them would mislead anyone treating them as a stable
reference. Run the bench against your own configuration to
establish your own baseline.
coverage:
tiers:
tier1: 2
tier2: 6
tier3: 5
tier4: 5
tier5: 1
families:
tools: 8
coding: 2
repo: 3
browser: 2
multi_tool: 3
adversarial: 1
# Tier 3/4 some families overlap; see per-task manifest below.
tasks:
- id: t1-bugfix-discount
tier: tier1
family: coding
capabilities: [bugfix]
path: tier1/t1-bugfix-discount.yaml
asset_pack: t1_bugfix_discount
- id: t1-fs-quick-note
tier: tier1
family: tools
capabilities: [structured_output]
path: tier1/t1-fs-quick-note.yaml
asset_pack: t1_fs_quick_note
- id: t2-add-tests-normalizer
tier: tier2
family: coding
capabilities: [test_authoring]
path: tier2/t2-add-tests-normalizer.yaml
asset_pack: t2_add_tests_normalizer
- id: t2-browser-form-fix
tier: tier2
family: browser
capabilities: [browser_debugging, bugfix]
path: tier2/t2-browser-form-fix.yaml
asset_pack: t2_browser_form_fix
- id: t2-config-loader
tier: tier2
family: repo
capabilities: [bugfix, multifile_reasoning]
path: tier2/t2-config-loader.yaml
asset_pack: t2_config_loader
- id: t2-fs-find-that-thing
tier: tier2
family: tools
capabilities: [structured_output]
path: tier2/t2-fs-find-that-thing.yaml
asset_pack: t2_fs_find_that_thing
- id: t2-msg-summarize-thread
tier: tier2
family: tools
capabilities: [research_synthesis, structured_output]
path: tier2/t2-msg-summarize-thread.yaml
asset_pack: t2_msg_summarize_thread
- id: t2-priv-redact-doc
tier: tier2
family: tools
capabilities: [structured_output, graceful_refusal]
path: tier2/t2-priv-redact-doc.yaml
asset_pack: t2_priv_redact_doc
- id: t3-data-pipeline-report
tier: tier3
family: multi_tool
capabilities: [structured_output, multifile_reasoning]
path: tier3/t3-data-pipeline-report.yaml
asset_pack: t3_data_pipeline_report
- id: t3-data-sql-query
tier: tier3
family: tools
capabilities: [structured_output]
path: tier3/t3-data-sql-query.yaml
asset_pack: t3_data_sql_query
- id: t3-feature-export
tier: tier3
family: repo
capabilities: [multifile_reasoning, structured_output]
path: tier3/t3-feature-export.yaml
asset_pack: t3_feature_export
- id: t3-msg-inbox-triage
tier: tier3
family: tools
capabilities: [structured_output, multifile_reasoning]
path: tier3/t3-msg-inbox-triage.yaml
asset_pack: t3_msg_inbox_triage
- id: t3-web-research-and-cite
tier: tier3
family: tools
capabilities: [research_synthesis]
path: tier3/t3-web-research-and-cite.yaml
asset_pack: t3_web_research_and_cite
- id: t4-browser-research-and-code
tier: tier4
family: browser
capabilities: [browser_debugging, research_synthesis]
path: tier4/t4-browser-research-and-code.yaml
asset_pack: t4_browser_research_and_code
- id: t4-cross-repo-migration
tier: tier4
family: repo
capabilities: [cross_repo_change, multifile_reasoning]
path: tier4/t4-cross-repo-migration.yaml
asset_pack: t4_cross_repo_migration
- id: t4-delegation-repair
tier: tier4
family: multi_tool
capabilities: [delegation, bugfix]
path: tier4/t4-delegation-repair.yaml
asset_pack: t4_delegation_repair
- id: t4-life-trip-plan
tier: tier4
family: tools
capabilities: [research_synthesis, structured_output]
path: tier4/t4-life-trip-plan.yaml
asset_pack: t4_life_trip_plan
- id: t4-memory-recall-continuation
tier: tier4
family: multi_tool
capabilities: [memory_continuation, multifile_reasoning]
path: tier4/t4-memory-recall-continuation.yaml
asset_pack: t4_memory_recall_continuation
- id: t5-hallucination-resistant-evidence
tier: tier5
family: adversarial
capabilities: [research_synthesis, tool_composition]
path: tier5/t5-hallucination-resistant-evidence.yaml
asset_pack: t5_hallucination_resistant_evidence
notes: |
- The full private dev set (tasks/) contains 40 tasks. This Core-19
subset is the signal-rich, ranking-consistent public release.
- Additional 21 tasks are retained as a private holdout for
contamination-resistant measurement of future models.
- Task families "creative" and "long-horizon (Tier 6)" are absent
from Core v1; planned for a future release.
- Known caveats: t4-memory-recall-continuation has a verifier that
penalizes agents that respond in conversation rather than via file
artifacts. All models face the same verifier, so the comparison is
internally fair, but absolute scores understate capability.
- t5-hallucination-resistant-evidence has low cross-model SNR (about
0.25) in v4-19-full; included for adversarial-family coverage
despite this. Consider upgrading verifier in a future release.