Commit Graph

2 Commits

Author SHA1 Message Date
Codex
e9ff163217 baselines: merge provenance docs into BASELINE_SOURCES.md
Replace the two separate JSON files (hermes_trace_summary.json and
basic_usage_query_summary.json) with a single markdown document that
captures every empirical source informing ClawBench's task design.

baselines/BASELINE_SOURCES.md covers:

1. The 24 public Hugging Face datasets tagged format:agent-traces,
   with owner/name, row counts, cluster classification (Pi sessions,
   custom agent traces, Claude Code, demo), and how each cluster
   maps onto ClawBench's tier/family/trajectory design decisions.
   Aggregate ~3,049 rows, ~1,168 unique sessions after mirror
   deduplication.

2. The Hermes agent reasoning trace aggregate (14,701 sessions,
   24.3 avg turns, category distribution) with the direct mapping
   from observed categories to ClawBench task families.

3. The internal personal-agent use-case corpus (72 queries, 12
   primary scenarios, 139 atomic capabilities) that contributes
   the scenario_weight_defaults in query_catalog.py. The source
   is not a public dataset and is only referred to as "the internal
   personal-agent use-case corpus" — no filename reference.

4. A full source-to-design-decision mapping table showing which
   design choice (tier ladder, family mix, tool diversity,
   recovery expectations, browser task count, scenario weights,
   difficulty tags, adversarial tier-5) is driven by which source.

Also scrub two remaining references to the Chinese filename in
reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md,
replacing them with pointers to baselines/BASELINE_SOURCES.md.

No runtime code paths read the baselines/ directory; these files are
provenance artifacts for the design decisions baked into tasks/ and
clawbench/query_catalog.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 20:36:18 -07:00
scoootscooob
2e39d5ccb2 Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00