clawbench

Author	SHA1	Message	Date
Vincent Koc	fb486a1ed3	fix(scoring): gate judge-weighted scores	2026-04-28 22:52:12 -07:00
Vincent Koc	0625ab7159	fix(runtime): harden queue and gateway lifecycle	2026-04-28 11:34:53 -07:00
Vincent Koc	38a2a0ff91	perf(app): cache leaderboard loads	2026-04-28 10:49:52 -07:00
Vincent Koc	f373e4a710	fix: harden packaging and submissions	2026-04-28 01:17:43 -07:00
scoootscooob	11d943f21c	fix: preserve preset submission settings and lazy-load plots Some checks failed CI / Python 3.12 test suite (push) Has been cancelled Details	2026-04-22 12:03:16 -07:00
pllm-uci	c209612d46	Add archive dynamics pipeline and audience-based model presets	2026-04-22 12:03:13 -07:00
pllm-uci	e2d82b34c3	Add local Ollama model support and configuration guidance to README and profiles	2026-04-15 11:45:12 -07:00
scoootscooob	380c6b4815	bench: audit contamination and harden HF leaderboard loading	2026-04-11 07:14:32 -07:00
Codex	07a20c3f18	HF Space: dynamic stats + fix leaderboard environment parsing Two fixes for the HF Space UI: 1. Leaderboard crashed with "'str' object has no attribute 'get'" because upload_result() serializes BenchmarkResult.environment as str(result.environment) when pushing to the HF Dataset, but _flatten_result called .get() on it as if it were a dict. Defensive parse: accept dict, stringified dict, or JSON object. 2. Stats ribbon (Tasks/Tiers/Browser/Judge counts) was hardcoded to the v0.3 values (20/5/2/6). Replaced with _compute_stats() which calls load_all_tasks() at startup and derives the numbers from the live task corpus, so the ribbon stays in sync with the tasks/ directory without manual edits. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 23:55:37 -07:00
Codex	b6e82d6afe	Space: harden theme construction against unsupported kwargs The previous redesign passed radius_size/block_radius/shadow_drop/shadow_spread to themes.Base().set() which are either constructor-only or version-specific, causing the HF Space to runtime-error at startup. Drop those kwargs and wrap the whole theme build in a try/except that falls back to plain Base() so any future unknown kwarg degrades gracefully instead of crashing the Space. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 22:20:57 -07:00
Codex	aed7c37207	Space: redesign UI to match OpenClaw WebUI + ClawHub design system Apply the shared OpenClaw aesthetic — dark backgrounds (#0e1015 layered), signature red accent (#ff5c5c), Inter + JetBrains Mono typography, whisper-thin color-mix borders, pill tab switcher, and rise animations. Replaces the default Gradio Base theme with custom CSS and theme tokens. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 22:10:27 -07:00
Codex	95027a561c	Space: default to hardware-tuned lane settings	2026-04-09 20:04:57 -07:00
Codex	81ba0cee3d	Queue: heartbeat and reclaim stale jobs	2026-04-09 15:45:17 -07:00
Codex	04fd29714a	HF: repair queue dataset persistence	2026-04-09 14:06:09 -07:00
Codex	a669cd2492	Docs: expand benchmark methodology and visuals	2026-04-09 12:20:48 -07:00
scoootscooob	2e39d5ccb2	Bench: redesign v0.4 benchmark and HF runtime	2026-04-09 11:15:30 -07:00
scoootscooob	a7ee76ed0f	Fix presets: only verified HF Inference models Tested each via router.huggingface.co/v1/chat/completions: - GLM 5.1 (754B), GLM 5, Qwen3 32B, DeepSeek R1, Kimi K2, MiniMax M2.5 - Gemma 4 26B MoE, Llama 3.3/3.1 70B - Claude Sonnet/Opus 4.6 (via API key) Removed models that don't work on HF free tier.	2026-04-07 13:35:07 -07:00
scoootscooob	b675ccde98	Update presets to top 15 from OpenRouter: Claude, GPT-5.4, Gemini 3.1, Grok 4.20, Qwen 3.6, DeepSeek V3.1, Kimi K2.5, MiniMax M2.7, GLM 5.1, Gemma 4, Llama 3.3	2026-04-07 13:30:07 -07:00
scoootscooob	4765d6e5aa	Add preset models: Qwen 3.5, DeepSeek R1, Kimi K2.5, MiniMax M2.5, GLM-4/Z1, Gemma 4, Claude - Fix gateway: --allow-unconfigured + token auth for headless container - Fix client: use cli client ID/mode + full operator scopes - Add 11 preset models with Submit All button - Open-source models use HF Inference API (no extra keys needed)	2026-04-07 13:27:43 -07:00
scoootscooob	1df8c430f3	Initial ClawBench: three-axis agent harness benchmark - Environment state verification (filesystem, memory, gateway queries) - Trajectory evaluation (precision/recall/F1 on tool call sequences) - Simulated users (static, adaptive LLM, adversarial) - pass^k reliability as primary metric - 14 tasks: 6 general, 5 OpenClaw, 3 adversarial - HF Docker Space with job queue and background eval worker - Gradio leaderboard with submission form	2026-04-07 12:48:31 -07:00

20 Commits