Profiles (profiles/): - frontier_opus_4_6.yaml (Anthropic Claude Opus 4.6 — closed) - frontier_gpt_5_4.yaml (OpenAI GPT-5.4 — closed) - frontier_gemini_3_pro.yaml (Google Gemini 3.1 Pro — closed) - frontier_glm_5_1.yaml (Zhipu AI GLM-5.1 via OpenRouter — open) - frontier_qwen_3_6.yaml (Alibaba Qwen3.6-Plus via OpenRouter — open) - frontier_minimax_m27.yaml (MiniMax M2.7 via OpenRouter — open) - frontier_kimi_k25.yaml (Moonshot Kimi K2.5 via OpenRouter — open) - example_research_stack.yaml (example for docs) All seven profiles share an identical plugin stack (anthropic + memory-lancedb + browser-playwright) so base_model is the only structural variable across the bake-off. Scripts (scripts/): - run_open_vs_closed_bakeoff.py: driver that runs each profile through the harness and generates a comparison table. Wraps `clawbench run --profile` via an inline Click entry (the package has no __main__.py so `python -m clawbench.cli` is a no-op). - analyze_open_vs_closed.py: historical DB analyzer — per-bucket mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/ Moonshot land in the open bucket. - ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py, scale_timeouts.py, seed_historical_db.py: task-corpus tooling. Reports (reports/): - FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run (3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6 scored 63.9% with real token streaming (174K tok, $0.18 cost). The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too easy to separate frontier models at n=1. Documents infrastructure findings around gateway plugin allowlist behavior, token streaming gaps for non-Anthropic providers, and hot-reload cascade when config changes mid-run. - open_vs_closed_bakeoff_summary.md: auto-generated headline table - FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run - REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run - PARALLEL_HARNESS_REPORT.md: concurrency validation writeup - V05_DELIVERY_REPORT.md: v0.5 framework delivery notes - CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning Artifacts (reports/artifacts/): - frontier_*.json: the 7 BenchmarkResult files from the bake-off (committed snapshot for reproducibility; runtime results still go to results/ which remains gitignored) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
48 lines
1.5 KiB
Python
48 lines
1.5 KiB
Python
"""Scale every task's timeout_seconds by a factor.
|
|
|
|
Opus is ~3x slower per-call than Sonnet. When we run Opus on timeouts
|
|
that were sized for Sonnet, every task gets cut off mid-run and scored
|
|
as if it failed. Scaling timeouts up lets us measure Opus's actual
|
|
capability instead of its unluckiness with our 240s defaults.
|
|
|
|
Usage:
|
|
python scripts/scale_timeouts.py 3.0 # triple all timeouts
|
|
python scripts/scale_timeouts.py 1.0 # reset to current values
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
import sys
|
|
from pathlib import Path
|
|
|
|
TASKS_DIR = Path(__file__).resolve().parents[1] / "tasks"
|
|
|
|
|
|
def main():
|
|
if len(sys.argv) != 2:
|
|
print("usage: python scripts/scale_timeouts.py <scale>")
|
|
sys.exit(1)
|
|
scale = float(sys.argv[1])
|
|
|
|
touched = 0
|
|
for yml in TASKS_DIR.rglob("t*.yaml"):
|
|
raw = yml.read_text(encoding="utf-8")
|
|
def repl(m: re.Match) -> str:
|
|
key = m.group(1)
|
|
orig = int(m.group(2))
|
|
scaled = max(1, int(round(orig * scale)))
|
|
return f"{key}: {scaled}"
|
|
new = re.sub(r"^(timeout_seconds):\s*(\d+)\s*$", repl, raw, flags=re.MULTILINE)
|
|
# Phase-level timeouts too
|
|
new = re.sub(r"^( timeout_seconds):\s*(\d+)\s*$", repl, new, flags=re.MULTILINE)
|
|
new = re.sub(r"^( timeout_seconds):\s*(\d+)\s*$", repl, new, flags=re.MULTILINE)
|
|
if new != raw:
|
|
yml.write_text(new, encoding="utf-8")
|
|
touched += 1
|
|
print(f"scaled timeouts in {touched} task files by {scale}x")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|