clawbench

History

Codex e9ff163217 baselines: merge provenance docs into BASELINE_SOURCES.md Replace the two separate JSON files (hermes_trace_summary.json and basic_usage_query_summary.json) with a single markdown document that captures every empirical source informing ClawBench's task design. baselines/BASELINE_SOURCES.md covers: 1. The 24 public Hugging Face datasets tagged format:agent-traces, with owner/name, row counts, cluster classification (Pi sessions, custom agent traces, Claude Code, demo), and how each cluster maps onto ClawBench's tier/family/trajectory design decisions. Aggregate ~3,049 rows, ~1,168 unique sessions after mirror deduplication. 2. The Hermes agent reasoning trace aggregate (14,701 sessions, 24.3 avg turns, category distribution) with the direct mapping from observed categories to ClawBench task families. 3. The internal personal-agent use-case corpus (72 queries, 12 primary scenarios, 139 atomic capabilities) that contributes the scenario_weight_defaults in query_catalog.py. The source is not a public dataset and is only referred to as "the internal personal-agent use-case corpus" — no filename reference. 4. A full source-to-design-decision mapping table showing which design choice (tier ladder, family mix, tool diversity, recovery expectations, browser task count, scenario weights, difficulty tags, adversarial tier-5) is driven by which source. Also scrub two remaining references to the Chinese filename in reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md, replacing them with pointers to baselines/BASELINE_SOURCES.md. No runtime code paths read the baselines/ directory; these files are provenance artifacts for the design decisions baked into tasks/ and clawbench/query_catalog.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:36:18 -07:00
..
BASELINE_SOURCES.md	baselines: merge provenance docs into BASELINE_SOURCES.md	2026-04-10 20:36:18 -07:00

Codex e9ff163217 baselines: merge provenance docs into BASELINE_SOURCES.md

Replace the two separate JSON files (hermes_trace_summary.json and
basic_usage_query_summary.json) with a single markdown document that
captures every empirical source informing ClawBench's task design.

baselines/BASELINE_SOURCES.md covers:

1. The 24 public Hugging Face datasets tagged format:agent-traces,
   with owner/name, row counts, cluster classification (Pi sessions,
   custom agent traces, Claude Code, demo), and how each cluster
   maps onto ClawBench's tier/family/trajectory design decisions.
   Aggregate ~3,049 rows, ~1,168 unique sessions after mirror
   deduplication.

2. The Hermes agent reasoning trace aggregate (14,701 sessions,
   24.3 avg turns, category distribution) with the direct mapping
   from observed categories to ClawBench task families.

3. The internal personal-agent use-case corpus (72 queries, 12
   primary scenarios, 139 atomic capabilities) that contributes
   the scenario_weight_defaults in query_catalog.py. The source
   is not a public dataset and is only referred to as "the internal
   personal-agent use-case corpus" — no filename reference.

4. A full source-to-design-decision mapping table showing which
   design choice (tier ladder, family mix, tool diversity,
   recovery expectations, browser task count, scenario weights,
   difficulty tags, adversarial tier-5) is driven by which source.

Also scrub two remaining references to the Chinese filename in
reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md,
replacing them with pointers to baselines/BASELINE_SOURCES.md.

No runtime code paths read the baselines/ directory; these files are
provenance artifacts for the design decisions baked into tasks/ and
clawbench/query_catalog.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-10 20:36:18 -07:00

BASELINE_SOURCES.md

baselines: merge provenance docs into BASELINE_SOURCES.md

2026-04-10 20:36:18 -07:00