diff --git a/baselines/BASELINE_SOURCES.md b/baselines/BASELINE_SOURCES.md
new file mode 100644
index 0000000..583ef9e
--- /dev/null
+++ b/baselines/BASELINE_SOURCES.md
@@ -0,0 +1,230 @@
+# ClawBench Baseline Sources
+
+This document records the empirical sources that informed ClawBench's task
+design. ClawBench's tier structure, task families, trajectory-length targets,
+tool-family mix, scenario weights, and difficulty bands are **designed
+referencing these sources**. None of them are loaded at runtime — they are
+provenance artifacts for the design decisions baked into `tasks/` and
+[`clawbench/query_catalog.py`](../clawbench/query_catalog.py).
+
+Contents:
+
+1. [Public Hugging Face agent-trace datasets](#1-public-hugging-face-agent-trace-datasets)
+2. [Hermes agent reasoning traces (aggregate)](#2-hermes-agent-reasoning-traces-aggregate)
+3. [Internal personal-agent use-case corpus](#3-internal-personal-agent-use-case-corpus)
+4. [How the sources map onto ClawBench's design](#4-how-the-sources-map-onto-clawbenchs-design)
+
+---
+
+## 1. Public Hugging Face agent-trace datasets
+
+The 24 datasets on Hugging Face tagged
+[`format:agent-traces`](https://huggingface.co/datasets?format=format:agent-traces&sort=trending)
+inform the task-family mix and the trajectory-shape targets (turn counts,
+tool diversity, recovery patterns) used throughout the ClawBench task corpus.
+
+| # | Dataset | Rows | Cluster | Notes |
+|---:|---|---:|---|---|
+| 1 | [`badlogicgames/pi-mono`](https://huggingface.co/datasets/badlogicgames/pi-mono) | 627 | Pi sessions (root) | Primary source; ~627 unique sessions |
+| 2 | [`cfahlgren1/pi-mono-fresh`](https://huggingface.co/datasets/cfahlgren1/pi-mono-fresh) | 627 | Pi sessions (mirror) | Mirror of root |
+| 3 | [`JohnBeanerson/pi-mono-test`](https://huggingface.co/datasets/JohnBeanerson/pi-mono-test) | 627 | Pi sessions (mirror) | Mirror of root |
+| 4 | [`karkowww/pi-mono`](https://huggingface.co/datasets/karkowww/pi-mono) | 627 | Pi sessions (mirror) | Mirror of root |
+| 5 | [`vinhnx90/vtcode-sessions`](https://huggingface.co/datasets/vinhnx90/vtcode-sessions) | 172 | Custom agent traces | Coding agent sessions |
+| 6 | [`thomasmustier/pi-for-excel-sessions`](https://huggingface.co/datasets/thomasmustier/pi-for-excel-sessions) | 140 | Pi sessions (domain) | Excel-specific agent work |
+| 7 | [`0xSero/pi-sessions`](https://huggingface.co/datasets/0xSero/pi-sessions) | 96 | Pi sessions (domain) | Mixed-domain sessions |
+| 8 | [`championswimmer/pi-coding-sessions`](https://huggingface.co/datasets/championswimmer/pi-coding-sessions) | 27 | Pi sessions (domain) | Coding-focused subset |
+| 9 | [`jedisct1/agent-traces-swival`](https://huggingface.co/datasets/jedisct1/agent-traces-swival) | 20 | Custom format | Experimental traces |
+| 10 | [`LarsEckart/approvaltests-java-sessions`](https://huggingface.co/datasets/LarsEckart/approvaltests-java-sessions) | 15 | Custom agent traces | Java / approval-test workflows |
+| 11 | [`cfahlgren1/agent-sessions-list`](https://huggingface.co/datasets/cfahlgren1/agent-sessions-list) | 12 | Index dataset | Metadata for other trace repos |
+| 12 | [`thomasmustier/pi-nes-sessions`](https://huggingface.co/datasets/thomasmustier/pi-nes-sessions) | 12 | Pi sessions (domain) | NES-related sessions |
+| 13 | [`moikapy/0xKobolds`](https://huggingface.co/datasets/moikapy/0xKobolds) | 11 | Custom agent traces | Kobold-specific traces |
+| 14 | [`badlogicgames/pi-diff-review`](https://huggingface.co/datasets/badlogicgames/pi-diff-review) | 7 | Pi sessions (review) | Diff-review traces |
+| 15 | [`cfahlgren1/pi-diff-review`](https://huggingface.co/datasets/cfahlgren1/pi-diff-review) | 6 | Pi sessions (review) | Mirror of #14 |
+| 16 | [`lhoestq/agent-traces-example`](https://huggingface.co/datasets/lhoestq/agent-traces-example) | 4 | Demo / example | Canonical example format |
+| 17 | [`DreamyDetective/trace-demo`](https://huggingface.co/datasets/DreamyDetective/trace-demo) | 3 | Demo / example | Schema demo |
+| 18 | [`davanstrien/pi-trace-parser-sessions`](https://huggingface.co/datasets/davanstrien/pi-trace-parser-sessions) | 3 | Parser experiments | Parser validation |
+| 19 | [`davanstrien/pi-traces`](https://huggingface.co/datasets/davanstrien/pi-traces) | 2 | Pi sessions (sample) | Small sample |
+| 20 | [`dongxx1104/Baseline_featbeach`](https://huggingface.co/datasets/dongxx1104/Baseline_featbeach) | 2 | Custom | Project-specific |
+| 21 | [`victor/claude-code-sessions`](https://huggingface.co/datasets/victor/claude-code-sessions) | 1 | Claude Code | Single reference session |
+| 22 | [`JohnBeanerson/claude-code-sessions-test`](https://huggingface.co/datasets/JohnBeanerson/claude-code-sessions-test) | 1 | Claude Code | Test sample |
+| 23 | [`lukawskikacper/openai-agent-traces`](https://huggingface.co/datasets/lukawskikacper/openai-agent-traces) | 1 | OpenAI agents | Single reference session |
+| 24 | [`mishig/traces`](https://huggingface.co/datasets/mishig/traces) | 1 | Demo / example | Single-row demo |
+
+**Aggregate**: ~3,049 rows across 24 repos. Deduplicated (removing the three
+`pi-mono` mirrors that are exact copies of `badlogicgames/pi-mono`), the
+unique-source count is roughly **~1,168 sessions** across ~15 distinct agent
+workflows.
+
+### Dataset clusters
+
+| Cluster | Approx unique sessions | What it contributes to ClawBench design |
+|---|---:|---|
+| **Pi sessions** (`badlogicgames/pi-mono` + domain spinoffs) | ~920 | Dominant format. Drives the turn-count distribution (`t1-t4` trajectory length targets), tool diversity expectations, and the "multi-tool" and "repo" task family definitions. |
+| **Custom agent traces** (`vtcode-sessions`, `approvaltests-java-sessions`, `0xKobolds`) | ~198 | Informs the tier-2/3 coding task design: cross-file reasoning, test-driven workflows, and language-specific failure modes (`t2-node-search-patch`, `t3-node-multifile-refactor`). |
+| **Claude Code / OpenAI agents** (`victor/claude-code-sessions`, `lukawskikacper/openai-agent-traces`, `JohnBeanerson/claude-code-sessions-test`) | ~3 | Single-row reference sessions; validate trace-shape assumptions but too few for quantitative weighting. |
+| **Index and demo** (`agent-sessions-list`, `agent-traces-example`, `trace-demo`, `mishig/traces`, `DreamyDetective/trace-demo`) | ~22 | Schema and format validation. Used to verify that ClawBench's internal `Transcript` shape can represent what the ecosystem is publishing. |
+
+---
+
+## 2. Hermes agent reasoning traces (aggregate)
+
+Source: `lambda/hermes-agent-reasoning-traces`
+
+Aggregate statistics from a separate trace corpus of **14,701 real agent
+sessions**, used to calibrate the tier-to-family mapping and tier trajectory
+targets:
+
+```json
+{
+  "sessions_analyzed": 14701,
+  "observed_complexity": {
+    "avg_turns": 24.3,
+    "avg_tool_calls": 13.9,
+    "max_turns": 54,
+    "tool_diversity_per_task": "3-6"
+  }
+}
+```
+
+### Observed category distribution
+
+| Category | Session count | % of total | ClawBench family |
+|---|---:|---:|---|
+| `agent_tools` | 4,249 | 28.9% | `tools` |
+| `terminal_coding` | 4,247 | 28.9% | `coding` |
+| `repository_tasks` | 2,131 | 14.5% | `repo` |
+| `browser_automation` | 1,687 | 11.5% | `browser` |
+| `file_operations` | 891 | 6.1% | `tools` |
+| `multi_tool` | 859 | 5.8% | `multi_tool` |
+| `scheduling` | 308 | 2.1% | `tools` |
+| `planning` | 293 | 2.0% | `tools` |
+| `conversational` | 36 | 0.2% | — (excluded) |
+
+### Tier-to-family mapping (derived)
+
+The observed Hermes distribution directly informs the tier → family mapping
+in [`clawbench/tasks.py`](../clawbench/tasks.py):
+
+```
+tier1 → coding, tools
+tier2 → coding, repo, browser
+tier3 → repo, multi_tool, tools
+tier4 → repo, multi_tool, browser
+tier5 → adversarial
+```
+
+### Design notes
+
+- The benchmark keeps **only aggregate statistics** from this source for
+  reproducibility; raw traces and large processed samples are intentionally
+  excluded from the repo.
+- Task design emphasizes **longer trajectories** (≥10 tool calls per task),
+  **explicit recovery** after failed tool calls, and **multi-tool behavior**
+  (≥3 distinct tool families per tier-3+ task) — all three properties are
+  direct consequences of the Hermes trajectory-shape distribution.
+
+---
+
+## 3. Internal personal-agent use-case corpus
+
+In addition to the public HF agent-trace datasets and the Hermes aggregate,
+ClawBench sources a proprietary scenario corpus of **72 queries** across
+**12 primary scenarios** and **139 atomic capabilities**. This corpus is not
+a public dataset; we only reproduce the derived scenario weights and
+difficulty bands here.
+
+### Corpus summary
+
+```
+query_total              72
+primary_scenarios        12
+secondary_scenarios      55
+atomic_capabilities      139
+
+difficulty_distribution
+  l1 (simple, single-tool)        22 queries  (30.6%)
+  l2 (multi-step, typed tools)    39 queries  (54.2%)
+  l3 (open-ended, recovery)       11 queries  (15.3%)
+```
+
+### Design principles
+
+1. **MECE atomic capabilities** — each query exercises a non-overlapping
+   subset of the 139-capability taxonomy.
+2. **Parameterized case expansion** — each base query has clear and
+   ambiguous prompt variants.
+3. **Dual-channel delivery judging** — pass/partial/fail outcomes tracked
+   separately from run-level scores.
+
+### Scenario catalog (with ClawBench query weights)
+
+These scenario names and weights are the source of truth for the
+`SCENARIO_WEIGHT_DEFAULTS` table in
+[`clawbench/query_catalog.py`](../clawbench/query_catalog.py):
+
+| Scenario | Query count | Weight | Difficulty (l1 / l2 / l3) |
+|---|---:|---:|---|
+| `file_system_ops` | 8 | 0.13 | 4 / 4 / 0 |
+| `multi_step_compound` | 7 | 0.12 | 0 / 0 / 7 |
+| `data_processing_analysis` | 8 | 0.11 | 2 / 6 / 0 |
+| `web_info_ops` | 6 | 0.10 | 2 / 3 / 1 |
+| `coding_dev_assist` | 7 | 0.09 | 3 / 4 / 0 |
+| `communication_messaging` | 5 | 0.09 | 0 / 5 / 0 |
+| `calendar_reminders` | 5 | 0.08 | 3 / 2 / 0 |
+| `skill_calling` | 4 | 0.07 | 0 / 4 / 0 |
+| `personal_life_assistant` | 5 | 0.06 | 4 / 1 / 0 |
+| `context_continuation` | 7 | 0.05 | 0 / 5 / 2 |
+| `error_boundary_cases` | 6 | 0.05 | 3 / 2 / 1 |
+| `system_capabilities` | 4 | 0.05 | 1 / 3 / 0 |
+
+### v0.5 additions (beyond the internal corpus)
+
+For v0.5, ClawBench adds eight additional high-frequency personal-agent
+scenarios that are not in the original sourced corpus. These are defined
+directly in `query_catalog.SCENARIO_WEIGHT_DEFAULTS`:
+
+```
+privacy_pii_handling                 0.04
+personal_financial_hygiene           0.03
+travel_logistics_under_uncertainty   0.03
+social_coordination                  0.02
+personal_knowledge_base              0.02
+health_wellness_tracking             0.01
+account_security_hygiene             0.01
+multimodal_understanding             0.00   (placeholder, not yet in corpus)
+```
+
+---
+
+## 4. How the sources map onto ClawBench's design
+
+| Design decision | Driven by |
+|---|---|
+| **Tier 1 → 5 difficulty ladder** | Hermes avg/max turn distribution (24.3 avg, 54 max) and the internal corpus difficulty bands (l1/l2/l3) |
+| **Task family mix** (coding / repo / browser / tools / multi_tool / adversarial) | Hermes category distribution (see §2 table) |
+| **Minimum tool diversity per task** (3+ families for tier-3+) | Hermes `tool_diversity_per_task: 3-6` observation |
+| **Multi-turn task design** (≥2 user turns for tier-2+) | Hermes `avg_turns: 24.3` (implies sustained multi-turn dialogue) |
+| **Explicit recovery expectations** (trajectory axis rewards recovery) | Pi sessions show frequent failed-tool-call → retry patterns |
+| **Browser task count** (2 tasks in the public suite) | Hermes `browser_automation` at 11.5% of all sessions |
+| **`SCENARIO_WEIGHT_DEFAULTS`** (query weights in `query_catalog.py`) | Internal corpus §3 weights, verified against Hermes category frequencies |
+| **Query difficulty tags** (`l1` / `l2` / `l3` on each task) | Internal corpus difficulty bands |
+| **Clear vs ambiguous prompt variants** | Internal corpus §3 design principle: "parameterized case expansion" |
+| **Adversarial tier-5 tasks** (contradictory requirements, hallucination resistance, graceful refusal) | Edge cases observed in Pi sessions + Hermes failure-mode patterns, not present in the internal corpus |
+
+---
+
+## Provenance and reproducibility notes
+
+- **HF agent-trace datasets** are enumerated from the public
+  [`format:agent-traces` filter](https://huggingface.co/datasets?format=format:agent-traces)
+  as of 2026-04-10. The row counts above are a point-in-time snapshot; run
+  the filter yourself for the current state.
+- **Hermes aggregate statistics** are summary numbers only. Raw trace data
+  is not redistributed in this repo.
+- **Internal corpus** is not a public dataset. Only the derived scenario
+  catalog, weights, and difficulty bands are reproduced, because those are
+  what directly inform the ClawBench scoring layer.
+- **No runtime code path** reads the files in this directory. Everything
+  here is design rationale, not data dependency. Deleting this folder will
+  not break the harness, scorer, or analyzer — it will only remove the
+  audit trail for why the task suite looks the way it does.
diff --git a/baselines/basic_usage_query_summary.json b/baselines/basic_usage_query_summary.json
deleted file mode 100644
index b814292..0000000
--- a/baselines/basic_usage_query_summary.json
+++ /dev/null
@@ -1,126 +0,0 @@
-{
-  "source_dataset": "基础使用场景测试集.xlsx",
-  "source_version": "1.0",
-  "summary": {
-    "query_total": 72,
-    "primary_scene_total": 12,
-    "secondary_scene_total": 55,
-    "atomic_capability_total": 139,
-    "difficulty_distribution": {
-      "l1": 22,
-      "l2": 39,
-      "l3": 11
-    },
-    "design_principles": [
-      "mece_atomic_capabilities",
-      "parameterized_case_expansion",
-      "clear_and_ambiguous_query_variants",
-      "dual_channel_delivery_judging"
-    ]
-  },
-  "scenario_catalog": [
-    {
-      "scenario": "file_system_ops",
-      "source_label_zh": "文件与系统操作",
-      "query_count": 8,
-      "weight": 0.13,
-      "difficulty_distribution": {"l1": 4, "l2": 4, "l3": 0}
-    },
-    {
-      "scenario": "web_info_ops",
-      "source_label_zh": "信息查询与网页操作",
-      "query_count": 6,
-      "weight": 0.1,
-      "difficulty_distribution": {"l1": 2, "l2": 3, "l3": 1}
-    },
-    {
-      "scenario": "calendar_reminders",
-      "source_label_zh": "日程与提醒",
-      "query_count": 5,
-      "weight": 0.08,
-      "difficulty_distribution": {"l1": 3, "l2": 2, "l3": 0}
-    },
-    {
-      "scenario": "communication_messaging",
-      "source_label_zh": "通讯与消息",
-      "query_count": 5,
-      "weight": 0.09,
-      "difficulty_distribution": {"l1": 0, "l2": 5, "l3": 0}
-    },
-    {
-      "scenario": "data_processing_analysis",
-      "source_label_zh": "数据处理与分析",
-      "query_count": 8,
-      "weight": 0.11,
-      "difficulty_distribution": {"l1": 2, "l2": 6, "l3": 0}
-    },
-    {
-      "scenario": "coding_dev_assist",
-      "source_label_zh": "编程与开发辅助",
-      "query_count": 7,
-      "weight": 0.09,
-      "difficulty_distribution": {"l1": 3, "l2": 4, "l3": 0}
-    },
-    {
-      "scenario": "personal_life_assistant",
-      "source_label_zh": "个人生活助理",
-      "query_count": 5,
-      "weight": 0.06,
-      "difficulty_distribution": {"l1": 4, "l2": 1, "l3": 0}
-    },
-    {
-      "scenario": "multi_step_compound",
-      "source_label_zh": "多步骤复合任务",
-      "query_count": 7,
-      "weight": 0.12,
-      "difficulty_distribution": {"l1": 0, "l2": 0, "l3": 7}
-    },
-    {
-      "scenario": "context_continuation",
-      "source_label_zh": "上下文理解与连续对话",
-      "query_count": 7,
-      "weight": 0.05,
-      "difficulty_distribution": {"l1": 0, "l2": 5, "l3": 2}
-    },
-    {
-      "scenario": "error_boundary_cases",
-      "source_label_zh": "错误处理与边界情况",
-      "query_count": 6,
-      "weight": 0.05,
-      "difficulty_distribution": {"l1": 3, "l2": 2, "l3": 1}
-    },
-    {
-      "scenario": "skill_calling",
-      "source_label_zh": "Skill调用",
-      "query_count": 4,
-      "weight": 0.07,
-      "difficulty_distribution": {"l1": 0, "l2": 4, "l3": 0}
-    },
-    {
-      "scenario": "system_capabilities",
-      "source_label_zh": "系统能力",
-      "query_count": 4,
-      "weight": 0.05,
-      "difficulty_distribution": {"l1": 1, "l2": 3, "l3": 0}
-    }
-  ],
-  "current_corpus_alignment": {
-    "mapped_task_total": 20,
-    "covered_scenarios": {
-      "coding_dev_assist": 9,
-      "data_processing_analysis": 2,
-      "web_info_ops": 2,
-      "multi_step_compound": 3,
-      "context_continuation": 1,
-      "error_boundary_cases": 2,
-      "system_capabilities": 1
-    },
-    "missing_scenarios": [
-      "file_system_ops",
-      "calendar_reminders",
-      "communication_messaging",
-      "personal_life_assistant",
-      "skill_calling"
-    ]
-  }
-}
diff --git a/baselines/hermes_trace_summary.json b/baselines/hermes_trace_summary.json
deleted file mode 100644
index e2e3db5..0000000
--- a/baselines/hermes_trace_summary.json
+++ /dev/null
@@ -1,34 +0,0 @@
-{
-  "source": "lambda/hermes-agent-reasoning-traces",
-  "sessions_analyzed": 14701,
-  "summary_version": "2026-04-08",
-  "observed_complexity": {
-    "avg_turns": 24.3,
-    "avg_tool_calls": 13.9,
-    "max_turns": 54,
-    "tool_diversity_per_task": "3-6"
-  },
-  "observed_categories": [
-    {"name": "terminal_coding", "count": 4247},
-    {"name": "agent_tools", "count": 4249},
-    {"name": "repository_tasks", "count": 2131},
-    {"name": "browser_automation", "count": 1687},
-    {"name": "file_operations", "count": 891},
-    {"name": "multi_tool", "count": 859},
-    {"name": "scheduling", "count": 308},
-    {"name": "planning", "count": 293},
-    {"name": "conversational", "count": 36}
-  ],
-  "task_family_mapping": {
-    "tier1": ["coding", "tools"],
-    "tier2": ["coding", "repo", "browser"],
-    "tier3": ["repo", "multi_tool", "tools"],
-    "tier4": ["repo", "multi_tool", "browser"],
-    "tier5": ["adversarial"]
-  },
-  "design_notes": [
-    "The benchmark keeps only aggregate Hermes-derived statistics for reproducibility.",
-    "Raw traces and large processed samples are intentionally excluded from the repo.",
-    "Task design emphasizes longer trajectories, explicit recovery, and multi-tool behavior."
-  ]
-}
diff --git a/reports/CLAWBENCH_100_TASK_PLAN.md b/reports/CLAWBENCH_100_TASK_PLAN.md
index 4ee51a0..397e480 100644
--- a/reports/CLAWBENCH_100_TASK_PLAN.md
+++ b/reports/CLAWBENCH_100_TASK_PLAN.md
@@ -3,9 +3,10 @@
 ## Goal
 
 Expand ClawBench from 20 tasks to 100 tasks. Cover all 72 queries from the
-基础使用场景测试集 sheet at least loosely. Add new high-frequency personal-agent
-scenarios that the sheet does not capture. Make every task vague-prompted,
-multi-step, and verifiable through deterministic execution checks.
+internal personal-agent use-case corpus (see `baselines/BASELINE_SOURCES.md`)
+at least loosely. Add new high-frequency personal-agent scenarios that the
+corpus does not capture. Make every task vague-prompted, multi-step, and
+verifiable through deterministic execution checks.
 
 ## Core Authoring Rules (apply to every new task)
 
diff --git a/reports/V05_DELIVERY_REPORT.md b/reports/V05_DELIVERY_REPORT.md
index 7051b8a..0e9dbb3 100644
--- a/reports/V05_DELIVERY_REPORT.md
+++ b/reports/V05_DELIVERY_REPORT.md
@@ -33,8 +33,9 @@ Across 16 scenarios spanning tier 1 to tier 5:
 
 Every new task follows the v0.5 authoring rules: vague prompt, hidden
 requirements in workspace files, multi-stage execution, deterministic
-verifiers, no-fabrication grading. The 72 queries from
-`基础使用场景测试集.xlsx` are all loosely covered by at least one task.
+verifiers, no-fabrication grading. The 72 queries from the internal
+personal-agent use-case corpus (see `baselines/BASELINE_SOURCES.md`)
+are all loosely covered by at least one task.
 
 ### 2. v0.5 Framework Code (4 modules, ~1,000 LOC)