clawbench/tests
Codex 4aa017838a ClawBench v0.5: tests + task corpus expansion
Tests:
- tests/test_v05_framework.py (646 lines): end-to-end synthetic ecosystem
  covering profile parsing, fingerprint computation, k-NN prediction,
  surprise detection, factor analysis, diagnostic rendering
- tests/test_v05_extensions.py (552 lines): unit tests for Taguchi S/N
  robustness profile, plugin utilization audit, manifest-vs-reality gap,
  calibration tracking, surprise cause attribution, recommendations
  generator, insights publishing, end-to-end diagnostic with all sections
- tests/test_scorer.py: judge gating tests (judge cannot rescue failed
  deterministic completion; judge capped at 10% when deterministic
  verifier exists and floor met; judge dominates at 50% on semantic-
  only tasks)
- tests/test_e2e_significance.py, test_parallel_harness.py:
  additional coverage for harness behavior

Task corpus expansion:
- 20 new task YAMLs across tier1-4 covering fs, web, calendar,
  messaging, data processing, social coordination, life assistance,
  context continuation, error boundary, skill calling, privacy
  redaction scenarios
- Fresh asset packs for each new task (test fixtures + reference
  inputs/outputs)
- Lower tier-1 coding task timeouts from 360s to 180s to avoid
  final-state wait waste (the gateway emits no chat.state:final event,
  so the wait is pure overhead; 180s is plenty for any tier-1 task)
- Modify tier2-5 task YAMLs for verifier robustness and judge rubric
  updates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:13:37 -07:00
..
test_client.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_e2e_significance.py ClawBench v0.5: tests + task corpus expansion 2026-04-10 19:13:37 -07:00
test_environment.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_harness.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_integration_checks.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_judge.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_parallel_harness.py ClawBench v0.5: tests + task corpus expansion 2026-04-10 19:13:37 -07:00
test_queue.py Queue: heartbeat and reclaim stale jobs 2026-04-09 15:45:17 -07:00
test_scorer.py ClawBench v0.5: tests + task corpus expansion 2026-04-10 19:13:37 -07:00
test_services.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_session_labels.py Gateway: use unique benchmark session labels 2026-04-09 18:32:41 -07:00
test_simulated_user.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_stats.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_tasks.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_trajectory.py Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
test_v05_extensions.py ClawBench v0.5: tests + task corpus expansion 2026-04-10 19:13:37 -07:00
test_v05_framework.py ClawBench v0.5: tests + task corpus expansion 2026-04-10 19:13:37 -07:00
test_worker.py Queue: heartbeat and reclaim stale jobs 2026-04-09 15:45:17 -07:00