clawbench

openclaw/clawbench

Fork 0

Commit Graph

Author	SHA1	Message	Date
Codex	e9ff163217	baselines: merge provenance docs into BASELINE_SOURCES.md Replace the two separate JSON files (hermes_trace_summary.json and basic_usage_query_summary.json) with a single markdown document that captures every empirical source informing ClawBench's task design. baselines/BASELINE_SOURCES.md covers: 1. The 24 public Hugging Face datasets tagged format:agent-traces, with owner/name, row counts, cluster classification (Pi sessions, custom agent traces, Claude Code, demo), and how each cluster maps onto ClawBench's tier/family/trajectory design decisions. Aggregate ~3,049 rows, ~1,168 unique sessions after mirror deduplication. 2. The Hermes agent reasoning trace aggregate (14,701 sessions, 24.3 avg turns, category distribution) with the direct mapping from observed categories to ClawBench task families. 3. The internal personal-agent use-case corpus (72 queries, 12 primary scenarios, 139 atomic capabilities) that contributes the scenario_weight_defaults in query_catalog.py. The source is not a public dataset and is only referred to as "the internal personal-agent use-case corpus" — no filename reference. 4. A full source-to-design-decision mapping table showing which design choice (tier ladder, family mix, tool diversity, recovery expectations, browser task count, scenario weights, difficulty tags, adversarial tier-5) is driven by which source. Also scrub two remaining references to the Chinese filename in reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md, replacing them with pointers to baselines/BASELINE_SOURCES.md. No runtime code paths read the baselines/ directory; these files are provenance artifacts for the design decisions baked into tasks/ and clawbench/query_catalog.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 20:36:18 -07:00
scoootscooob	2e39d5ccb2	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00

Author

SHA1

Message

Date

Codex

e9ff163217

baselines: merge provenance docs into BASELINE_SOURCES.md

Replace the two separate JSON files (hermes_trace_summary.json and
basic_usage_query_summary.json) with a single markdown document that
captures every empirical source informing ClawBench's task design.

baselines/BASELINE_SOURCES.md covers:

1. The 24 public Hugging Face datasets tagged format:agent-traces,
   with owner/name, row counts, cluster classification (Pi sessions,
   custom agent traces, Claude Code, demo), and how each cluster
   maps onto ClawBench's tier/family/trajectory design decisions.
   Aggregate ~3,049 rows, ~1,168 unique sessions after mirror
   deduplication.

2. The Hermes agent reasoning trace aggregate (14,701 sessions,
   24.3 avg turns, category distribution) with the direct mapping
   from observed categories to ClawBench task families.

3. The internal personal-agent use-case corpus (72 queries, 12
   primary scenarios, 139 atomic capabilities) that contributes
   the scenario_weight_defaults in query_catalog.py. The source
   is not a public dataset and is only referred to as "the internal
   personal-agent use-case corpus" — no filename reference.

4. A full source-to-design-decision mapping table showing which
   design choice (tier ladder, family mix, tool diversity,
   recovery expectations, browser task count, scenario weights,
   difficulty tags, adversarial tier-5) is driven by which source.

Also scrub two remaining references to the Chinese filename in
reports/CLAWBENCH_100_TASK_PLAN.md and reports/V05_DELIVERY_REPORT.md,
replacing them with pointers to baselines/BASELINE_SOURCES.md.

No runtime code paths read the baselines/ directory; these files are
provenance artifacts for the design decisions baked into tasks/ and
clawbench/query_catalog.py.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-04-10 20:36:18 -07:00

scoootscooob

2e39d5ccb2

Bench: redesign v0.4 benchmark and HF runtime

2026-04-09 11:15:30 -07:00

2 Commits