Commit Graph

20 Commits

Author SHA1 Message Date
Vincent Koc
fb486a1ed3
fix(scoring): gate judge-weighted scores 2026-04-28 22:52:12 -07:00
Vincent Koc
0625ab7159
fix(runtime): harden queue and gateway lifecycle 2026-04-28 11:34:53 -07:00
Vincent Koc
38a2a0ff91
perf(app): cache leaderboard loads 2026-04-28 10:49:52 -07:00
Vincent Koc
f373e4a710
fix: harden packaging and submissions 2026-04-28 01:17:43 -07:00
scoootscooob
11d943f21c fix: preserve preset submission settings and lazy-load plots
Some checks failed
CI / Python 3.12 test suite (push) Has been cancelled
2026-04-22 12:03:16 -07:00
pllm-uci
c209612d46 Add archive dynamics pipeline and audience-based model presets 2026-04-22 12:03:13 -07:00
pllm-uci
e2d82b34c3 Add local Ollama model support and configuration guidance to README and profiles 2026-04-15 11:45:12 -07:00
scoootscooob
380c6b4815 bench: audit contamination and harden HF leaderboard loading 2026-04-11 07:14:32 -07:00
Codex
07a20c3f18 HF Space: dynamic stats + fix leaderboard environment parsing
Two fixes for the HF Space UI:

1. Leaderboard crashed with "'str' object has no attribute 'get'"
   because upload_result() serializes BenchmarkResult.environment as
   str(result.environment) when pushing to the HF Dataset, but
   _flatten_result called .get() on it as if it were a dict.
   Defensive parse: accept dict, stringified dict, or JSON object.
2. Stats ribbon (Tasks/Tiers/Browser/Judge counts) was hardcoded to
   the v0.3 values (20/5/2/6). Replaced with _compute_stats() which
   calls load_all_tasks() at startup and derives the numbers from
   the live task corpus, so the ribbon stays in sync with the
   tasks/ directory without manual edits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 23:55:37 -07:00
Codex
b6e82d6afe Space: harden theme construction against unsupported kwargs
The previous redesign passed radius_size/block_radius/shadow_drop/shadow_spread
to themes.Base().set() which are either constructor-only or version-specific,
causing the HF Space to runtime-error at startup. Drop those kwargs and wrap
the whole theme build in a try/except that falls back to plain Base() so any
future unknown kwarg degrades gracefully instead of crashing the Space.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 22:20:57 -07:00
Codex
aed7c37207 Space: redesign UI to match OpenClaw WebUI + ClawHub design system
Apply the shared OpenClaw aesthetic — dark backgrounds (#0e1015 layered),
signature red accent (#ff5c5c), Inter + JetBrains Mono typography,
whisper-thin color-mix borders, pill tab switcher, and rise animations.
Replaces the default Gradio Base theme with custom CSS and theme tokens.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 22:10:27 -07:00
Codex
95027a561c Space: default to hardware-tuned lane settings 2026-04-09 20:04:57 -07:00
Codex
81ba0cee3d Queue: heartbeat and reclaim stale jobs 2026-04-09 15:45:17 -07:00
Codex
04fd29714a HF: repair queue dataset persistence 2026-04-09 14:06:09 -07:00
Codex
a669cd2492 Docs: expand benchmark methodology and visuals 2026-04-09 12:20:48 -07:00
scoootscooob
2e39d5ccb2 Bench: redesign v0.4 benchmark and HF runtime 2026-04-09 11:15:30 -07:00
scoootscooob
a7ee76ed0f Fix presets: only verified HF Inference models
Tested each via router.huggingface.co/v1/chat/completions:
- GLM 5.1 (754B), GLM 5, Qwen3 32B, DeepSeek R1, Kimi K2, MiniMax M2.5
- Gemma 4 26B MoE, Llama 3.3/3.1 70B
- Claude Sonnet/Opus 4.6 (via API key)
Removed models that don't work on HF free tier.
2026-04-07 13:35:07 -07:00
scoootscooob
b675ccde98 Update presets to top 15 from OpenRouter: Claude, GPT-5.4, Gemini 3.1, Grok 4.20, Qwen 3.6, DeepSeek V3.1, Kimi K2.5, MiniMax M2.7, GLM 5.1, Gemma 4, Llama 3.3 2026-04-07 13:30:07 -07:00
scoootscooob
4765d6e5aa Add preset models: Qwen 3.5, DeepSeek R1, Kimi K2.5, MiniMax M2.5, GLM-4/Z1, Gemma 4, Claude
- Fix gateway: --allow-unconfigured + token auth for headless container
- Fix client: use cli client ID/mode + full operator scopes
- Add 11 preset models with Submit All button
- Open-source models use HF Inference API (no extra keys needed)
2026-04-07 13:27:43 -07:00
scoootscooob
1df8c430f3 Initial ClawBench: three-axis agent harness benchmark
- Environment state verification (filesystem, memory, gateway queries)
- Trajectory evaluation (precision/recall/F1 on tool call sequences)
- Simulated users (static, adaptive LLM, adversarial)
- pass^k reliability as primary metric
- 14 tasks: 6 general, 5 OpenClaw, 3 adversarial
- HF Docker Space with job queue and background eval worker
- Gradio leaderboard with submission form
2026-04-07 12:48:31 -07:00