clawbench

History

Codex 4aa017838a ClawBench v0.5: tests + task corpus expansion Tests: - tests/test_v05_framework.py (646 lines): end-to-end synthetic ecosystem covering profile parsing, fingerprint computation, k-NN prediction, surprise detection, factor analysis, diagnostic rendering - tests/test_v05_extensions.py (552 lines): unit tests for Taguchi S/N robustness profile, plugin utilization audit, manifest-vs-reality gap, calibration tracking, surprise cause attribution, recommendations generator, insights publishing, end-to-end diagnostic with all sections - tests/test_scorer.py: judge gating tests (judge cannot rescue failed deterministic completion; judge capped at 10% when deterministic verifier exists and floor met; judge dominates at 50% on semantic- only tasks) - tests/test_e2e_significance.py, test_parallel_harness.py: additional coverage for harness behavior Task corpus expansion: - 20 new task YAMLs across tier1-4 covering fs, web, calendar, messaging, data processing, social coordination, life assistance, context continuation, error boundary, skill calling, privacy redaction scenarios - Fresh asset packs for each new task (test fixtures + reference inputs/outputs) - Lower tier-1 coding task timeouts from 360s to 180s to avoid final-state wait waste (the gateway emits no chat.state:final event, so the wait is pure overhead; 180s is plenty for any tier-1 task) - Modify tier2-5 task YAMLs for verifier robustness and judge rubric updates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>		2026-04-10 19:13:37 -07:00
..
test_client.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_e2e_significance.py	ClawBench v0.5: tests + task corpus expansion	2026-04-10 19:13:37 -07:00
test_environment.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_harness.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_integration_checks.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_judge.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_parallel_harness.py	ClawBench v0.5: tests + task corpus expansion	2026-04-10 19:13:37 -07:00
test_queue.py	Queue: heartbeat and reclaim stale jobs	2026-04-09 15:45:17 -07:00
test_scorer.py	ClawBench v0.5: tests + task corpus expansion	2026-04-10 19:13:37 -07:00
test_services.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_session_labels.py	Gateway: use unique benchmark session labels	2026-04-09 18:32:41 -07:00
test_simulated_user.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_stats.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_tasks.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_trajectory.py	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
test_v05_extensions.py	ClawBench v0.5: tests + task corpus expansion	2026-04-10 19:13:37 -07:00
test_v05_framework.py	ClawBench v0.5: tests + task corpus expansion	2026-04-10 19:13:37 -07:00
test_worker.py	Queue: heartbeat and reclaim stale jobs	2026-04-09 15:45:17 -07:00