Replace the two separate JSON files (hermes_trace_summary.json and
basic_usage_query_summary.json) with a single markdown document that
captures every empirical source informing ClawBench's task design.
baselines/BASELINE_SOURCES.md covers:
1. The 24 public Hugging Face datasets tagged format:agent-traces,
with owner/name, row counts, cluster classification (Pi sessions,
custom agent traces, Claude Code, demo), and how each cluster
maps onto ClawBench's tier/family/trajectory design decisions.
Aggregate ~3,049 rows, ~1,168 unique sessions after mirror
deduplication.
2. The Hermes agent reasoning trace aggregate (14,701 sessions,
24.3 avg turns, category distribution) with the direct mapping
from observed categories to ClawBench task families.
3. The internal personal-agent use-case corpus (72 queries, 12
primary scenarios, 139 atomic capabilities) that contributes
the scenario_weight_defaults in query_catalog.py. The source
is not a public dataset and is only referred to as "the internal
personal-agent use-case corpus" — no filename reference.
4. A full source-to-design-decision mapping table showing which
design choice (tier ladder, family mix, tool diversity,
recovery expectations, browser task count, scenario weights,
difficulty tags, adversarial tier-5) is driven by which source.
Also scrub two remaining references to the Chinese filename in
reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md,
replacing them with pointers to baselines/BASELINE_SOURCES.md.
No runtime code paths read the baselines/ directory; these files are
provenance artifacts for the design decisions baked into tasks/ and
clawbench/query_catalog.py.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>