Tests:
- tests/test_v05_framework.py (646 lines): end-to-end synthetic ecosystem
covering profile parsing, fingerprint computation, k-NN prediction,
surprise detection, factor analysis, diagnostic rendering
- tests/test_v05_extensions.py (552 lines): unit tests for Taguchi S/N
robustness profile, plugin utilization audit, manifest-vs-reality gap,
calibration tracking, surprise cause attribution, recommendations
generator, insights publishing, end-to-end diagnostic with all sections
- tests/test_scorer.py: judge gating tests (judge cannot rescue failed
deterministic completion; judge capped at 10% when deterministic
verifier exists and floor met; judge dominates at 50% on semantic-
only tasks)
- tests/test_e2e_significance.py, test_parallel_harness.py:
additional coverage for harness behavior
Task corpus expansion:
- 20 new task YAMLs across tier1-4 covering fs, web, calendar,
messaging, data processing, social coordination, life assistance,
context continuation, error boundary, skill calling, privacy
redaction scenarios
- Fresh asset packs for each new task (test fixtures + reference
inputs/outputs)
- Lower tier-1 coding task timeouts from 360s to 180s to avoid
final-state wait waste (the gateway emits no chat.state:final event,
so the wait is pure overhead; 180s is plenty for any tier-1 task)
- Modify tier2-5 task YAMLs for verifier robustness and judge rubric
updates
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>