baselines: merge provenance docs into BASELINE_SOURCES.md

Replace the two separate JSON files (hermes_trace_summary.json and
basic_usage_query_summary.json) with a single markdown document that
captures every empirical source informing ClawBench's task design.

baselines/BASELINE_SOURCES.md covers:

1. The 24 public Hugging Face datasets tagged format:agent-traces,
   with owner/name, row counts, cluster classification (Pi sessions,
   custom agent traces, Claude Code, demo), and how each cluster
   maps onto ClawBench's tier/family/trajectory design decisions.
   Aggregate ~3,049 rows, ~1,168 unique sessions after mirror
   deduplication.

2. The Hermes agent reasoning trace aggregate (14,701 sessions,
   24.3 avg turns, category distribution) with the direct mapping
   from observed categories to ClawBench task families.

3. The internal personal-agent use-case corpus (72 queries, 12
   primary scenarios, 139 atomic capabilities) that contributes
   the scenario_weight_defaults in query_catalog.py. The source
   is not a public dataset and is only referred to as "the internal
   personal-agent use-case corpus" — no filename reference.

4. A full source-to-design-decision mapping table showing which
   design choice (tier ladder, family mix, tool diversity,
   recovery expectations, browser task count, scenario weights,
   difficulty tags, adversarial tier-5) is driven by which source.

Also scrub two remaining references to the Chinese filename in
reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md,
replacing them with pointers to baselines/BASELINE_SOURCES.md.

No runtime code paths read the baselines/ directory; these files are
provenance artifacts for the design decisions baked into tasks/ and
clawbench/query_catalog.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Codex 2026-04-10 20:36:18 -07:00
parent 3cdade49ce
commit e9ff163217
5 changed files with 237 additions and 165 deletions

View File

@ -0,0 +1,230 @@
# ClawBench Baseline Sources
This document records the empirical sources that informed ClawBench's task
design. ClawBench's tier structure, task families, trajectory-length targets,
tool-family mix, scenario weights, and difficulty bands are **designed
referencing these sources**. None of them are loaded at runtime — they are
provenance artifacts for the design decisions baked into `tasks/` and
[`clawbench/query_catalog.py`](../clawbench/query_catalog.py).
Contents:
1. [Public Hugging Face agent-trace datasets](#1-public-hugging-face-agent-trace-datasets)
2. [Hermes agent reasoning traces (aggregate)](#2-hermes-agent-reasoning-traces-aggregate)
3. [Internal personal-agent use-case corpus](#3-internal-personal-agent-use-case-corpus)
4. [How the sources map onto ClawBench's design](#4-how-the-sources-map-onto-clawbenchs-design)
---
## 1. Public Hugging Face agent-trace datasets
The 24 datasets on Hugging Face tagged
[`format:agent-traces`](https://huggingface.co/datasets?format=format:agent-traces&sort=trending)
inform the task-family mix and the trajectory-shape targets (turn counts,
tool diversity, recovery patterns) used throughout the ClawBench task corpus.
| # | Dataset | Rows | Cluster | Notes |
|---:|---|---:|---|---|
| 1 | [`badlogicgames/pi-mono`](https://huggingface.co/datasets/badlogicgames/pi-mono) | 627 | Pi sessions (root) | Primary source; ~627 unique sessions |
| 2 | [`cfahlgren1/pi-mono-fresh`](https://huggingface.co/datasets/cfahlgren1/pi-mono-fresh) | 627 | Pi sessions (mirror) | Mirror of root |
| 3 | [`JohnBeanerson/pi-mono-test`](https://huggingface.co/datasets/JohnBeanerson/pi-mono-test) | 627 | Pi sessions (mirror) | Mirror of root |
| 4 | [`karkowww/pi-mono`](https://huggingface.co/datasets/karkowww/pi-mono) | 627 | Pi sessions (mirror) | Mirror of root |
| 5 | [`vinhnx90/vtcode-sessions`](https://huggingface.co/datasets/vinhnx90/vtcode-sessions) | 172 | Custom agent traces | Coding agent sessions |
| 6 | [`thomasmustier/pi-for-excel-sessions`](https://huggingface.co/datasets/thomasmustier/pi-for-excel-sessions) | 140 | Pi sessions (domain) | Excel-specific agent work |
| 7 | [`0xSero/pi-sessions`](https://huggingface.co/datasets/0xSero/pi-sessions) | 96 | Pi sessions (domain) | Mixed-domain sessions |
| 8 | [`championswimmer/pi-coding-sessions`](https://huggingface.co/datasets/championswimmer/pi-coding-sessions) | 27 | Pi sessions (domain) | Coding-focused subset |
| 9 | [`jedisct1/agent-traces-swival`](https://huggingface.co/datasets/jedisct1/agent-traces-swival) | 20 | Custom format | Experimental traces |
| 10 | [`LarsEckart/approvaltests-java-sessions`](https://huggingface.co/datasets/LarsEckart/approvaltests-java-sessions) | 15 | Custom agent traces | Java / approval-test workflows |
| 11 | [`cfahlgren1/agent-sessions-list`](https://huggingface.co/datasets/cfahlgren1/agent-sessions-list) | 12 | Index dataset | Metadata for other trace repos |
| 12 | [`thomasmustier/pi-nes-sessions`](https://huggingface.co/datasets/thomasmustier/pi-nes-sessions) | 12 | Pi sessions (domain) | NES-related sessions |
| 13 | [`moikapy/0xKobolds`](https://huggingface.co/datasets/moikapy/0xKobolds) | 11 | Custom agent traces | Kobold-specific traces |
| 14 | [`badlogicgames/pi-diff-review`](https://huggingface.co/datasets/badlogicgames/pi-diff-review) | 7 | Pi sessions (review) | Diff-review traces |
| 15 | [`cfahlgren1/pi-diff-review`](https://huggingface.co/datasets/cfahlgren1/pi-diff-review) | 6 | Pi sessions (review) | Mirror of #14 |
| 16 | [`lhoestq/agent-traces-example`](https://huggingface.co/datasets/lhoestq/agent-traces-example) | 4 | Demo / example | Canonical example format |
| 17 | [`DreamyDetective/trace-demo`](https://huggingface.co/datasets/DreamyDetective/trace-demo) | 3 | Demo / example | Schema demo |
| 18 | [`davanstrien/pi-trace-parser-sessions`](https://huggingface.co/datasets/davanstrien/pi-trace-parser-sessions) | 3 | Parser experiments | Parser validation |
| 19 | [`davanstrien/pi-traces`](https://huggingface.co/datasets/davanstrien/pi-traces) | 2 | Pi sessions (sample) | Small sample |
| 20 | [`dongxx1104/Baseline_featbeach`](https://huggingface.co/datasets/dongxx1104/Baseline_featbeach) | 2 | Custom | Project-specific |
| 21 | [`victor/claude-code-sessions`](https://huggingface.co/datasets/victor/claude-code-sessions) | 1 | Claude Code | Single reference session |
| 22 | [`JohnBeanerson/claude-code-sessions-test`](https://huggingface.co/datasets/JohnBeanerson/claude-code-sessions-test) | 1 | Claude Code | Test sample |
| 23 | [`lukawskikacper/openai-agent-traces`](https://huggingface.co/datasets/lukawskikacper/openai-agent-traces) | 1 | OpenAI agents | Single reference session |
| 24 | [`mishig/traces`](https://huggingface.co/datasets/mishig/traces) | 1 | Demo / example | Single-row demo |
**Aggregate**: ~3,049 rows across 24 repos. Deduplicated (removing the three
`pi-mono` mirrors that are exact copies of `badlogicgames/pi-mono`), the
unique-source count is roughly **~1,168 sessions** across ~15 distinct agent
workflows.
### Dataset clusters
| Cluster | Approx unique sessions | What it contributes to ClawBench design |
|---|---:|---|
| **Pi sessions** (`badlogicgames/pi-mono` + domain spinoffs) | ~920 | Dominant format. Drives the turn-count distribution (`t1-t4` trajectory length targets), tool diversity expectations, and the "multi-tool" and "repo" task family definitions. |
| **Custom agent traces** (`vtcode-sessions`, `approvaltests-java-sessions`, `0xKobolds`) | ~198 | Informs the tier-2/3 coding task design: cross-file reasoning, test-driven workflows, and language-specific failure modes (`t2-node-search-patch`, `t3-node-multifile-refactor`). |
| **Claude Code / OpenAI agents** (`victor/claude-code-sessions`, `lukawskikacper/openai-agent-traces`, `JohnBeanerson/claude-code-sessions-test`) | ~3 | Single-row reference sessions; validate trace-shape assumptions but too few for quantitative weighting. |
| **Index and demo** (`agent-sessions-list`, `agent-traces-example`, `trace-demo`, `mishig/traces`, `DreamyDetective/trace-demo`) | ~22 | Schema and format validation. Used to verify that ClawBench's internal `Transcript` shape can represent what the ecosystem is publishing. |
---
## 2. Hermes agent reasoning traces (aggregate)
Source: `lambda/hermes-agent-reasoning-traces`
Aggregate statistics from a separate trace corpus of **14,701 real agent
sessions**, used to calibrate the tier-to-family mapping and tier trajectory
targets:
```json
{
"sessions_analyzed": 14701,
"observed_complexity": {
"avg_turns": 24.3,
"avg_tool_calls": 13.9,
"max_turns": 54,
"tool_diversity_per_task": "3-6"
}
}
```
### Observed category distribution
| Category | Session count | % of total | ClawBench family |
|---|---:|---:|---|
| `agent_tools` | 4,249 | 28.9% | `tools` |
| `terminal_coding` | 4,247 | 28.9% | `coding` |
| `repository_tasks` | 2,131 | 14.5% | `repo` |
| `browser_automation` | 1,687 | 11.5% | `browser` |
| `file_operations` | 891 | 6.1% | `tools` |
| `multi_tool` | 859 | 5.8% | `multi_tool` |
| `scheduling` | 308 | 2.1% | `tools` |
| `planning` | 293 | 2.0% | `tools` |
| `conversational` | 36 | 0.2% | — (excluded) |
### Tier-to-family mapping (derived)
The observed Hermes distribution directly informs the tier → family mapping
in [`clawbench/tasks.py`](../clawbench/tasks.py):
```
tier1 → coding, tools
tier2 → coding, repo, browser
tier3 → repo, multi_tool, tools
tier4 → repo, multi_tool, browser
tier5 → adversarial
```
### Design notes
- The benchmark keeps **only aggregate statistics** from this source for
reproducibility; raw traces and large processed samples are intentionally
excluded from the repo.
- Task design emphasizes **longer trajectories** (≥10 tool calls per task),
**explicit recovery** after failed tool calls, and **multi-tool behavior**
(≥3 distinct tool families per tier-3+ task) — all three properties are
direct consequences of the Hermes trajectory-shape distribution.
---
## 3. Internal personal-agent use-case corpus
In addition to the public HF agent-trace datasets and the Hermes aggregate,
ClawBench sources a proprietary scenario corpus of **72 queries** across
**12 primary scenarios** and **139 atomic capabilities**. This corpus is not
a public dataset; we only reproduce the derived scenario weights and
difficulty bands here.
### Corpus summary
```
query_total 72
primary_scenarios 12
secondary_scenarios 55
atomic_capabilities 139
difficulty_distribution
l1 (simple, single-tool) 22 queries (30.6%)
l2 (multi-step, typed tools) 39 queries (54.2%)
l3 (open-ended, recovery) 11 queries (15.3%)
```
### Design principles
1. **MECE atomic capabilities** — each query exercises a non-overlapping
subset of the 139-capability taxonomy.
2. **Parameterized case expansion** — each base query has clear and
ambiguous prompt variants.
3. **Dual-channel delivery judging** — pass/partial/fail outcomes tracked
separately from run-level scores.
### Scenario catalog (with ClawBench query weights)
These scenario names and weights are the source of truth for the
`SCENARIO_WEIGHT_DEFAULTS` table in
[`clawbench/query_catalog.py`](../clawbench/query_catalog.py):
| Scenario | Query count | Weight | Difficulty (l1 / l2 / l3) |
|---|---:|---:|---|
| `file_system_ops` | 8 | 0.13 | 4 / 4 / 0 |
| `multi_step_compound` | 7 | 0.12 | 0 / 0 / 7 |
| `data_processing_analysis` | 8 | 0.11 | 2 / 6 / 0 |
| `web_info_ops` | 6 | 0.10 | 2 / 3 / 1 |
| `coding_dev_assist` | 7 | 0.09 | 3 / 4 / 0 |
| `communication_messaging` | 5 | 0.09 | 0 / 5 / 0 |
| `calendar_reminders` | 5 | 0.08 | 3 / 2 / 0 |
| `skill_calling` | 4 | 0.07 | 0 / 4 / 0 |
| `personal_life_assistant` | 5 | 0.06 | 4 / 1 / 0 |
| `context_continuation` | 7 | 0.05 | 0 / 5 / 2 |
| `error_boundary_cases` | 6 | 0.05 | 3 / 2 / 1 |
| `system_capabilities` | 4 | 0.05 | 1 / 3 / 0 |
### v0.5 additions (beyond the internal corpus)
For v0.5, ClawBench adds eight additional high-frequency personal-agent
scenarios that are not in the original sourced corpus. These are defined
directly in `query_catalog.SCENARIO_WEIGHT_DEFAULTS`:
```
privacy_pii_handling 0.04
personal_financial_hygiene 0.03
travel_logistics_under_uncertainty 0.03
social_coordination 0.02
personal_knowledge_base 0.02
health_wellness_tracking 0.01
account_security_hygiene 0.01
multimodal_understanding 0.00 (placeholder, not yet in corpus)
```
---
## 4. How the sources map onto ClawBench's design
| Design decision | Driven by |
|---|---|
| **Tier 1 → 5 difficulty ladder** | Hermes avg/max turn distribution (24.3 avg, 54 max) and the internal corpus difficulty bands (l1/l2/l3) |
| **Task family mix** (coding / repo / browser / tools / multi_tool / adversarial) | Hermes category distribution (see §2 table) |
| **Minimum tool diversity per task** (3+ families for tier-3+) | Hermes `tool_diversity_per_task: 3-6` observation |
| **Multi-turn task design** (≥2 user turns for tier-2+) | Hermes `avg_turns: 24.3` (implies sustained multi-turn dialogue) |
| **Explicit recovery expectations** (trajectory axis rewards recovery) | Pi sessions show frequent failed-tool-call → retry patterns |
| **Browser task count** (2 tasks in the public suite) | Hermes `browser_automation` at 11.5% of all sessions |
| **`SCENARIO_WEIGHT_DEFAULTS`** (query weights in `query_catalog.py`) | Internal corpus §3 weights, verified against Hermes category frequencies |
| **Query difficulty tags** (`l1` / `l2` / `l3` on each task) | Internal corpus difficulty bands |
| **Clear vs ambiguous prompt variants** | Internal corpus §3 design principle: "parameterized case expansion" |
| **Adversarial tier-5 tasks** (contradictory requirements, hallucination resistance, graceful refusal) | Edge cases observed in Pi sessions + Hermes failure-mode patterns, not present in the internal corpus |
---
## Provenance and reproducibility notes
- **HF agent-trace datasets** are enumerated from the public
[`format:agent-traces` filter](https://huggingface.co/datasets?format=format:agent-traces)
as of 2026-04-10. The row counts above are a point-in-time snapshot; run
the filter yourself for the current state.
- **Hermes aggregate statistics** are summary numbers only. Raw trace data
is not redistributed in this repo.
- **Internal corpus** is not a public dataset. Only the derived scenario
catalog, weights, and difficulty bands are reproduced, because those are
what directly inform the ClawBench scoring layer.
- **No runtime code path** reads the files in this directory. Everything
here is design rationale, not data dependency. Deleting this folder will
not break the harness, scorer, or analyzer — it will only remove the
audit trail for why the task suite looks the way it does.

View File

@ -1,126 +0,0 @@
{
"source_dataset": "基础使用场景测试集.xlsx",
"source_version": "1.0",
"summary": {
"query_total": 72,
"primary_scene_total": 12,
"secondary_scene_total": 55,
"atomic_capability_total": 139,
"difficulty_distribution": {
"l1": 22,
"l2": 39,
"l3": 11
},
"design_principles": [
"mece_atomic_capabilities",
"parameterized_case_expansion",
"clear_and_ambiguous_query_variants",
"dual_channel_delivery_judging"
]
},
"scenario_catalog": [
{
"scenario": "file_system_ops",
"source_label_zh": "文件与系统操作",
"query_count": 8,
"weight": 0.13,
"difficulty_distribution": {"l1": 4, "l2": 4, "l3": 0}
},
{
"scenario": "web_info_ops",
"source_label_zh": "信息查询与网页操作",
"query_count": 6,
"weight": 0.1,
"difficulty_distribution": {"l1": 2, "l2": 3, "l3": 1}
},
{
"scenario": "calendar_reminders",
"source_label_zh": "日程与提醒",
"query_count": 5,
"weight": 0.08,
"difficulty_distribution": {"l1": 3, "l2": 2, "l3": 0}
},
{
"scenario": "communication_messaging",
"source_label_zh": "通讯与消息",
"query_count": 5,
"weight": 0.09,
"difficulty_distribution": {"l1": 0, "l2": 5, "l3": 0}
},
{
"scenario": "data_processing_analysis",
"source_label_zh": "数据处理与分析",
"query_count": 8,
"weight": 0.11,
"difficulty_distribution": {"l1": 2, "l2": 6, "l3": 0}
},
{
"scenario": "coding_dev_assist",
"source_label_zh": "编程与开发辅助",
"query_count": 7,
"weight": 0.09,
"difficulty_distribution": {"l1": 3, "l2": 4, "l3": 0}
},
{
"scenario": "personal_life_assistant",
"source_label_zh": "个人生活助理",
"query_count": 5,
"weight": 0.06,
"difficulty_distribution": {"l1": 4, "l2": 1, "l3": 0}
},
{
"scenario": "multi_step_compound",
"source_label_zh": "多步骤复合任务",
"query_count": 7,
"weight": 0.12,
"difficulty_distribution": {"l1": 0, "l2": 0, "l3": 7}
},
{
"scenario": "context_continuation",
"source_label_zh": "上下文理解与连续对话",
"query_count": 7,
"weight": 0.05,
"difficulty_distribution": {"l1": 0, "l2": 5, "l3": 2}
},
{
"scenario": "error_boundary_cases",
"source_label_zh": "错误处理与边界情况",
"query_count": 6,
"weight": 0.05,
"difficulty_distribution": {"l1": 3, "l2": 2, "l3": 1}
},
{
"scenario": "skill_calling",
"source_label_zh": "Skill调用",
"query_count": 4,
"weight": 0.07,
"difficulty_distribution": {"l1": 0, "l2": 4, "l3": 0}
},
{
"scenario": "system_capabilities",
"source_label_zh": "系统能力",
"query_count": 4,
"weight": 0.05,
"difficulty_distribution": {"l1": 1, "l2": 3, "l3": 0}
}
],
"current_corpus_alignment": {
"mapped_task_total": 20,
"covered_scenarios": {
"coding_dev_assist": 9,
"data_processing_analysis": 2,
"web_info_ops": 2,
"multi_step_compound": 3,
"context_continuation": 1,
"error_boundary_cases": 2,
"system_capabilities": 1
},
"missing_scenarios": [
"file_system_ops",
"calendar_reminders",
"communication_messaging",
"personal_life_assistant",
"skill_calling"
]
}
}

View File

@ -1,34 +0,0 @@
{
"source": "lambda/hermes-agent-reasoning-traces",
"sessions_analyzed": 14701,
"summary_version": "2026-04-08",
"observed_complexity": {
"avg_turns": 24.3,
"avg_tool_calls": 13.9,
"max_turns": 54,
"tool_diversity_per_task": "3-6"
},
"observed_categories": [
{"name": "terminal_coding", "count": 4247},
{"name": "agent_tools", "count": 4249},
{"name": "repository_tasks", "count": 2131},
{"name": "browser_automation", "count": 1687},
{"name": "file_operations", "count": 891},
{"name": "multi_tool", "count": 859},
{"name": "scheduling", "count": 308},
{"name": "planning", "count": 293},
{"name": "conversational", "count": 36}
],
"task_family_mapping": {
"tier1": ["coding", "tools"],
"tier2": ["coding", "repo", "browser"],
"tier3": ["repo", "multi_tool", "tools"],
"tier4": ["repo", "multi_tool", "browser"],
"tier5": ["adversarial"]
},
"design_notes": [
"The benchmark keeps only aggregate Hermes-derived statistics for reproducibility.",
"Raw traces and large processed samples are intentionally excluded from the repo.",
"Task design emphasizes longer trajectories, explicit recovery, and multi-tool behavior."
]
}

View File

@ -3,9 +3,10 @@
## Goal
Expand ClawBench from 20 tasks to 100 tasks. Cover all 72 queries from the
基础使用场景测试集 sheet at least loosely. Add new high-frequency personal-agent
scenarios that the sheet does not capture. Make every task vague-prompted,
multi-step, and verifiable through deterministic execution checks.
internal personal-agent use-case corpus (see `baselines/BASELINE_SOURCES.md`)
at least loosely. Add new high-frequency personal-agent scenarios that the
corpus does not capture. Make every task vague-prompted, multi-step, and
verifiable through deterministic execution checks.
## Core Authoring Rules (apply to every new task)

View File

@ -33,8 +33,9 @@ Across 16 scenarios spanning tier 1 to tier 5:
Every new task follows the v0.5 authoring rules: vague prompt, hidden
requirements in workspace files, multi-stage execution, deterministic
verifiers, no-fabrication grading. The 72 queries from
`基础使用场景测试集.xlsx` are all loosely covered by at least one task.
verifiers, no-fabrication grading. The 72 queries from the internal
personal-agent use-case corpus (see `baselines/BASELINE_SOURCES.md`)
are all loosely covered by at least one task.
### 2. v0.5 Framework Code (4 modules, ~1,000 LOC)