Commit Graph

5 Commits

Author SHA1 Message Date
scoootscooob
01a31e55fb sweep: per-container state isolation + qwen model-id fix
scripts/container_sweep_single.sh: clone pristine OpenClaw state to
/tmp/ per sweep before starting the gateway. Carries over config
(openclaw.json, identity/, devices/, exec-approvals.json, tasks/,
subagents/, flows/, cron/) but leaves runtime dirs (agents/,
workspace*/, logs/, memory/, cache/) empty. Sets OPENCLAW_STATE_DIR
to the isolated dir so the gateway writes to /tmp instead of the
shared host mount. Fixes the cascading "RPC agents.create timed out
after 60s" failures caused by 4k+ stale agents accumulating across
sequential sweeps.

profiles/frontier_qwen_3_6.yaml: fix base_model from
openrouter/qwen/qwen-3.6-plus (with dash) to openrouter/qwen/qwen3.6-plus
(no dash). The dashed slug is unknown to OpenRouter and silently fails;
the no-dash version is the real canonical.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 19:48:30 -07:00
scoootscooob
ee8ff79347 docs: fix ollama profile guidance 2026-04-16 19:49:04 -07:00
pllm-uci
517f2207b0 Refine local Ollama profile documentation for clarity and usability 2026-04-15 11:45:57 -07:00
pllm-uci
e2d82b34c3 Add local Ollama model support and configuration guidance to README and profiles 2026-04-15 11:45:12 -07:00
Codex
4744a6ae7e ClawBench: 7-model frontier baseline + bake-off tooling
Profiles (profiles/):
- frontier_opus_4_6.yaml      (Anthropic Claude Opus 4.6 — closed)
- frontier_gpt_5_4.yaml       (OpenAI GPT-5.4 — closed)
- frontier_gemini_3_pro.yaml  (Google Gemini 3.1 Pro — closed)
- frontier_glm_5_1.yaml       (Zhipu AI GLM-5.1 via OpenRouter — open)
- frontier_qwen_3_6.yaml      (Alibaba Qwen3.6-Plus via OpenRouter — open)
- frontier_minimax_m27.yaml   (MiniMax M2.7 via OpenRouter — open)
- frontier_kimi_k25.yaml      (Moonshot Kimi K2.5 via OpenRouter — open)
- example_research_stack.yaml (example for docs)

All seven profiles share an identical plugin stack (anthropic +
memory-lancedb + browser-playwright) so base_model is the only
structural variable across the bake-off.

Scripts (scripts/):
- run_open_vs_closed_bakeoff.py: driver that runs each profile
  through the harness and generates a comparison table. Wraps
  `clawbench run --profile` via an inline Click entry (the package
  has no __main__.py so `python -m clawbench.cli` is a no-op).
- analyze_open_vs_closed.py: historical DB analyzer — per-bucket
  mean/Taguchi S/N/per-task win rate/calibration/fANOVA. Classifies
  OpenRouter routes by inner vendor prefix so Zhipu/Qwen/MiniMax/
  Moonshot land in the open bucket.
- ingest_real_run.py, inject_judge_rubrics.py, refactor_verifiers.py,
  scale_timeouts.py, seed_historical_db.py: task-corpus tooling.

Reports (reports/):
- FRONTIER_7MODEL_BASELINE.md: full writeup of the 7-model run
  (3 tier-1 coding tasks, 1 run each, concurrency 3). Opus 4.6
  scored 63.9% with real token streaming (174K tok, $0.18 cost).
  The other 6 clustered at 33.8%-41.6% — tier-1 tasks are too
  easy to separate frontier models at n=1. Documents
  infrastructure findings around gateway plugin allowlist behavior,
  token streaming gaps for non-Anthropic providers, and hot-reload
  cascade when config changes mid-run.
- open_vs_closed_bakeoff_summary.md: auto-generated headline table
- FULL_BENCHMARK_REPORT.md: Sonnet 4.6 vs Opus 4.6 40-task run
- REAL_BENCHMARK_RESULTS.md: earlier v0.3 3-task reliability run
- PARALLEL_HARNESS_REPORT.md: concurrency validation writeup
- V05_DELIVERY_REPORT.md: v0.5 framework delivery notes
- CLAWBENCH_100_TASK_PLAN.md, CONTRIBUTING_TASKS.md: corpus planning

Artifacts (reports/artifacts/):
- frontier_*.json: the 7 BenchmarkResult files from the bake-off
  (committed snapshot for reproducibility; runtime results still
  go to results/ which remains gitignored)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:14:11 -07:00