clawbench/scripts/run_open_vs_closed_bakeoff.py
Codex 4744a6ae7e ClawBench: 7-model frontier baseline + bake-off tooling
Profiles (profiles/):
- frontier_opus_4_6.yaml      (Anthropic Claude Opus 4.6 — closed)
- frontier_gpt_5_4.yaml       (OpenAI GPT-5.4 — closed)
- frontier_gemini_3_pro.yaml  (Google Gemini 3.1 Pro — closed)
- frontier_glm_5_1.yaml       (Zhipu AI GLM-5.1 via OpenRouter — open)
- frontier_qwen_3_6.yaml      (Alibaba Qwen3.6-Plus via OpenRouter — open)
- frontier_minimax_m27.yaml   (MiniMax M2.7 via OpenRouter — open)
- frontier_kimi_k25.yaml      (Moonshot Kimi K2.5 via OpenRouter — open)
- example_research_stack.yaml (example for docs)

All seven profiles share an identical plugin stack (anthropic +
memory-lancedb + browser-playwright) so base_model is the only
structural variable across the bake-off.

Scripts (scripts/):
- run_open_vs_closed_bakeoff.py: driver that runs each profile
  through the harness and generates a comparison table. Wraps
  `clawbench run --profile` via an inline Click entry (the package
  has no __main__.py so `python -m clawbench.cli` is a no-op).
- analyze_open_vs_closed.py: historical DB analyzer — per-bucket
  mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies
  OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/
  Moonshot land in the open bucket.
- ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py,
  scale_timeouts.py, seed_historical_db.py: task-corpus tooling.

Reports (reports/):
- FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run
  (3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6
  scored 63.9% with real token streaming (174K tok, $0.18 cost).
  The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too
  easy to separate frontier models at n=1. Documents
  infrastructure findings around gateway plugin allowlist behavior,
  token streaming gaps for non-Anthropic providers, and hot-reload
  cascade when config changes mid-run.
- open_vs_closed_bakeoff_summary.md: auto-generated headline table
- FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run
- REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run
- PARALLEL_HARNESS_REPORT.md: concurrency validation writeup
- V05_DELIVERY_REPORT.md: v0.5 framework delivery notes
- CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning

Artifacts (reports/artifacts/):
- frontier_*.json: the 7 BenchmarkResult files from the bake-off
  (committed snapshot for reproducibility; runtime results still
  go to results/ which remains gitignored)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:14:11 -07:00

468 lines
15 KiB
Python
Executable File
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env python3
"""Driver for the ClawBench open-source vs closed-source bake-off.
Runs four model profiles against the full 40-task suite with the judge
enabled, records each run through the v0.5 Configuration Diagnostic
pipeline, and publishes ecosystem insights at the end.
Usage:
python scripts/run_open_vs_closed_bakeoff.py \
[--runs 3] \
[--concurrency 6] \
[--judge-model anthropic/claude-sonnet-4-6] \
[--gateway-token $OPENCLAW_GATEWAY_TOKEN] \
[--dry-run]
The four profiles (bundled in profiles/):
bakeoff_sonnet_4_6.yaml anthropic/claude-sonnet-4-6 (closed)
bakeoff_opus_4_6.yaml anthropic/claude-opus-4-6 (closed)
bakeoff_qwen3_32b.yaml huggingface/Qwen/Qwen3-32B (open)
bakeoff_deepseek_v3.yaml huggingface/deepseek-ai/DeepSeek-V3 (open)
All four profiles use an identical plugin stack so the base model is
the only structural variable. The v0.5 fingerprint will reflect this.
Each run invokes `clawbench run --profile` which:
1. Runs the full 40-task suite at --runs per task
2. Records the run in .clawbench/historical/profile_runs.json
3. Publishes ecosystem insights to .clawbench/insights/
4. Writes a Configuration Diagnostic Report per submission
After all four runs complete, this script writes a comparison table
to results/open_vs_closed_bakeoff_summary.md so you have a single file
to publish or post.
"""
from __future__ import annotations
import argparse
import json
import os
import subprocess
import sys
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable
REPO_ROOT = Path(__file__).resolve().parents[1]
PROFILES_DIR = REPO_ROOT / "profiles"
RESULTS_DIR = REPO_ROOT / "results"
HISTORICAL_DB = REPO_ROOT / ".clawbench" / "historical" / "profile_runs.json"
@dataclass
class BakeoffProfile:
profile_path: Path
model: str
category: str # "closed" or "open"
display_name: str
BAKEOFF: list[BakeoffProfile] = [
BakeoffProfile(
profile_path=PROFILES_DIR / "frontier_opus_4_6.yaml",
model="anthropic/claude-opus-4-6",
category="closed",
display_name="Claude Opus 4.6",
),
BakeoffProfile(
profile_path=PROFILES_DIR / "frontier_gpt_5_4.yaml",
model="openai/gpt-5.4",
category="closed",
display_name="GPT-5.4",
),
BakeoffProfile(
profile_path=PROFILES_DIR / "frontier_gemini_3_pro.yaml",
model="google/gemini-3.1-pro-preview",
category="closed",
display_name="Gemini 3.1 Pro",
),
BakeoffProfile(
profile_path=PROFILES_DIR / "frontier_glm_5_1.yaml",
model="openrouter/z-ai/glm-5.1",
category="open",
display_name="GLM-5.1",
),
BakeoffProfile(
profile_path=PROFILES_DIR / "frontier_qwen_3_6.yaml",
model="openrouter/qwen/qwen-3.6-plus",
category="open",
display_name="Qwen3.6-Plus",
),
BakeoffProfile(
profile_path=PROFILES_DIR / "frontier_minimax_m27.yaml",
model="openrouter/minimax/minimax-m2.7",
category="open",
display_name="MiniMax M2.7",
),
BakeoffProfile(
profile_path=PROFILES_DIR / "frontier_kimi_k25.yaml",
model="openrouter/moonshotai/kimi-k2.5",
category="open",
display_name="Kimi K2.5",
),
]
def run_one(
profile: BakeoffProfile,
*,
runs: int,
concurrency: int,
judge_model: str,
gateway_token: str,
python_bin: str,
dry_run: bool,
tasks: list[str] | None = None,
) -> Path:
"""Invoke `clawbench run --profile` for one model.
The clawbench package does not ship a `__main__.py`, so `python -m
clawbench.cli` is a no-op (defines `main` but never calls it). We
invoke the CLI via an inline `-c` that drives the Click group
directly — this is the same path `pyproject.toml` uses for the
installed `clawbench` script entry point.
"""
output = RESULTS_DIR / f"{profile.profile_path.stem}.json"
args = [
"run",
"--model",
profile.model,
"--runs",
str(runs),
"--concurrency",
str(concurrency),
"--browser-concurrency",
"1",
"--judge-model",
judge_model,
"--gateway-token",
gateway_token,
"--profile",
str(profile.profile_path),
"--output",
str(output),
]
for task_id in (tasks or []):
args.extend(["--task", task_id])
cmd = [
python_bin,
"-c",
f"from clawbench.cli import cli; cli({args!r}, standalone_mode=False)",
]
print(
f"\n{'' * 70}\n [{profile.category.upper():6}] "
f"{profile.display_name} ({profile.model})\n{'' * 70}"
)
print("", " ".join(cmd))
if dry_run:
print(" (dry run — not executing)")
return output
env = os.environ.copy()
if gateway_token:
env["OPENCLAW_GATEWAY_TOKEN"] = gateway_token
proc = subprocess.run(cmd, cwd=REPO_ROOT, env=env)
if proc.returncode != 0:
print(
f" ! run for {profile.display_name} exited with code "
f"{proc.returncode}",
file=sys.stderr,
)
return output
def extract_summary(result_path: Path) -> dict:
"""Pull the headline fields we need for the comparison table."""
if not result_path.exists():
return {"error": "result file missing", "path": str(result_path)}
try:
data = json.loads(result_path.read_text(encoding="utf-8"))
except Exception as exc:
return {"error": f"parse error: {exc}", "path": str(result_path)}
return {
"model": data.get("model", ""),
"overall_score": data.get("overall_score"),
"overall_completion": data.get("overall_completion"),
"overall_trajectory": data.get("overall_trajectory"),
"overall_behavior": data.get("overall_behavior"),
"overall_reliability": data.get("overall_reliability"),
"overall_pass_hat_k": data.get("overall_pass_hat_k"),
"overall_judge_score": data.get("overall_judge_score"),
"judge_task_coverage": data.get("judge_task_coverage"),
"overall_median_latency_ms": data.get("overall_median_latency_ms"),
"overall_tokens_per_pass": data.get("overall_tokens_per_pass"),
"overall_cost_per_pass": data.get("overall_cost_per_pass"),
"hard_subset_score": data.get("hard_subset_score"),
"consensus_subset_score": data.get("consensus_subset_score"),
"n_tasks": len(data.get("task_results", [])),
}
def fmt(v, digits: int = 3) -> str:
if v is None:
return ""
try:
return f"{float(v):.{digits}f}"
except (TypeError, ValueError):
return str(v)
def fmt_pct(v) -> str:
if v is None:
return ""
try:
return f"{float(v) * 100:.1f}%"
except (TypeError, ValueError):
return str(v)
def fmt_dollar(v) -> str:
if v is None:
return ""
try:
return f"${float(v):.4f}"
except (TypeError, ValueError):
return str(v)
def fmt_int(v) -> str:
if v is None:
return ""
try:
return f"{int(round(float(v))):,}"
except (TypeError, ValueError):
return str(v)
def write_comparison_table(
profiles: Iterable[BakeoffProfile],
summaries: dict[str, dict],
output_path: Path,
) -> None:
"""Render the four-model open-vs-closed comparison as a markdown file."""
profiles = list(profiles)
lines: list[str] = []
lines.append("# ClawBench Open-Source vs Closed-Source Bake-off")
lines.append("")
lines.append(
"All four profiles share an **identical plugin stack** "
"(`anthropic` + `memory-lancedb` + `browser-playwright`) "
"so the base model is the only structural variable."
)
lines.append("")
lines.append("## Headline")
lines.append("")
header = (
"| Metric | "
+ " | ".join(f"{p.display_name}<br/>*{p.category}*" for p in profiles)
+ " |"
)
lines.append(header)
lines.append("|---" + "|---:" * len(profiles) + "|")
rows = [
("Overall score", "overall_score", fmt),
("Completion (deterministic)", "overall_completion", fmt),
("Trajectory (deterministic)", "overall_trajectory", fmt),
("Behavior (deterministic)", "overall_behavior", fmt),
("Reliability", "overall_reliability", fmt),
("pass^k", "overall_pass_hat_k", fmt_pct),
("Judge score", "overall_judge_score", fmt),
("Judge coverage", "judge_task_coverage", fmt_pct),
("Hard subset", "hard_subset_score", fmt),
("Consensus subset", "consensus_subset_score", fmt),
("Median latency (ms)", "overall_median_latency_ms", fmt_int),
("Tokens / pass", "overall_tokens_per_pass", fmt_int),
("Cost / pass", "overall_cost_per_pass", fmt_dollar),
]
for label, key, formatter in rows:
values = [formatter(summaries[p.display_name].get(key)) for p in profiles]
lines.append(f"| {label} | " + " | ".join(values) + " |")
lines.append("")
lines.append("## Category aggregates")
lines.append("")
closed = [
s for p in profiles if p.category == "closed"
for s in [summaries[p.display_name]]
if s.get("overall_score") is not None
]
open_ = [
s for p in profiles if p.category == "open"
for s in [summaries[p.display_name]]
if s.get("overall_score") is not None
]
def mean(seq, key):
vals = [s[key] for s in seq if s.get(key) is not None]
return sum(vals) / len(vals) if vals else None
lines.append("| | Closed (mean) | Open (mean) | Gap (closed open) |")
lines.append("|---|---:|---:|---:|")
for label, key, formatter in [
("Overall score", "overall_score", fmt),
("Completion", "overall_completion", fmt),
("Reliability", "overall_reliability", fmt),
("Cost / pass", "overall_cost_per_pass", fmt_dollar),
]:
c = mean(closed, key)
o = mean(open_, key)
gap = (c - o) if (c is not None and o is not None) else None
lines.append(
f"| {label} | {formatter(c)} | {formatter(o)} | "
f"{('+' + formatter(gap)) if gap is not None and gap >= 0 else formatter(gap)} |"
)
lines.append("")
lines.append("## Sources")
lines.append("")
for p in profiles:
result_path = RESULTS_DIR / f"bakeoff_{p.profile_path.stem}.json"
lines.append(
f"- **{p.display_name}** ({p.category}): `{result_path.relative_to(REPO_ROOT)}`"
)
lines.append("")
lines.append("## v0.5 Diagnostic")
lines.append("")
lines.append(
"Each run was recorded through the v0.5 Configuration Diagnostic "
"pipeline. See `.clawbench/historical/profile_runs.json` for the "
"fingerprint database and `.clawbench/insights/` for the "
"ecosystem-level plugin leaderboard, factor importance, and "
"calibration metrics refreshed after every submission."
)
lines.append("")
output_path.write_text("\n".join(lines), encoding="utf-8")
print(f"\n✓ wrote comparison table → {output_path.relative_to(REPO_ROOT)}")
def main() -> None:
parser = argparse.ArgumentParser(
description="ClawBench open-source vs closed-source bake-off driver"
)
parser.add_argument(
"--runs",
type=int,
default=3,
help="Runs per task. v0.4 spec §'Official Run Policy' mandates ≥3.",
)
parser.add_argument(
"--concurrency",
type=int,
default=6,
help="Parallel (task, run) workers against the gateway.",
)
parser.add_argument(
"--judge-model",
default="anthropic/claude-sonnet-4-6",
help="LLM judge model (same for all four runs so the judge side is held constant).",
)
parser.add_argument(
"--gateway-token",
default=os.environ.get("OPENCLAW_GATEWAY_TOKEN", ""),
help="Gateway auth token (defaults to $OPENCLAW_GATEWAY_TOKEN).",
)
parser.add_argument(
"--python-bin",
default=str(REPO_ROOT / ".venv" / "bin" / "python"),
help="Python interpreter used to invoke clawbench.cli.",
)
parser.add_argument(
"--skip",
action="append",
default=[],
help="Display name of a profile to skip (e.g. 'Opus 4.6'). May be repeated.",
)
parser.add_argument(
"--only",
action="append",
default=[],
help="Run only the named profile(s). May be repeated.",
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Print the command for each run but do not execute.",
)
parser.add_argument(
"--summary-only",
action="store_true",
help="Skip running; re-read existing result files and regenerate the comparison table.",
)
parser.add_argument(
"--task",
action="append",
default=[],
help="Run only these task IDs (may be repeated). Defaults to the full suite.",
)
args = parser.parse_args()
RESULTS_DIR.mkdir(parents=True, exist_ok=True)
selected: list[BakeoffProfile] = []
for p in BAKEOFF:
if args.only and p.display_name not in args.only:
continue
if p.display_name in args.skip:
continue
selected.append(p)
if not selected:
print("no profiles selected; nothing to do", file=sys.stderr)
sys.exit(1)
print(
f"\nClawBench open-vs-closed bake-off\n"
f" runs/task: {args.runs}\n"
f" concurrency: {args.concurrency}\n"
f" judge: {args.judge_model}\n"
f" profiles: {len(selected)} "
f"({sum(1 for p in selected if p.category == 'closed')} closed, "
f"{sum(1 for p in selected if p.category == 'open')} open)\n"
)
result_paths: dict[str, Path] = {}
if args.summary_only:
for p in selected:
result_paths[p.display_name] = (
RESULTS_DIR / f"bakeoff_{p.profile_path.stem}.json"
)
else:
for p in selected:
result_paths[p.display_name] = run_one(
p,
runs=args.runs,
concurrency=args.concurrency,
judge_model=args.judge_model,
gateway_token=args.gateway_token,
python_bin=args.python_bin,
dry_run=args.dry_run,
tasks=args.task or None,
)
if args.dry_run:
print(
"\ndry run complete. Re-run without --dry-run to execute.\n"
"Budget estimate (3 runs × 40 tasks × 4 models × $0.05 avg/pass ≈ $24 + gateway time)."
)
return
summaries = {
p.display_name: extract_summary(result_paths[p.display_name])
for p in selected
}
summary_path = RESULTS_DIR / "open_vs_closed_bakeoff_summary.md"
write_comparison_table(selected, summaries, summary_path)
print(
"\nAll runs complete. Ecosystem insights refreshed in "
f"{(REPO_ROOT / '.clawbench' / 'insights').relative_to(REPO_ROOT)}/."
)
if __name__ == "__main__":
main()