Found and fixed three blockers preventing the HF Space Docker container
from running the eval suite end-to-end, verified by building the image
locally with Docker Desktop and running a tier-1 task against Qwen3-32B
through the HF Inference API inside the container.
1. pytest was in [project.optional-dependencies].dev, not [project].
The Dockerfile does `pip install .` which only installs runtime
deps, so every task whose completion verifier runs `pytest -q`
would fail with exit 127 (command not found). Moved pytest +
pytest-asyncio into the base dependencies so the container gets
them by default. The [dev] extra is kept as an alias for
existing `pip install .[dev]` invocations.
2. clawbench/tasks.py resolved TASKS_DIR via
`Path(__file__).parent.parent / "tasks"`, which works only for
source checkouts. When pip installs the package into
/usr/local/lib/python3.11/dist-packages/clawbench, the sibling
`tasks/` directory no longer exists at that path, so
`load_all_tasks()` returned empty and `clawbench run` died with
"No tasks to run". Added a fallback resolver that tries, in
order: $CLAWBENCH_TASKS_DIR env var, sibling-of-source,
Path.cwd() / "tasks", and known Docker layout candidates
(/home/node/app/tasks, /home/user/app/tasks, /app/tasks).
Verified inside the container that `TASKS_DIR` now resolves to
/home/node/app/tasks and load_all_tasks() returns 40 tasks.
3. Tier-1 task timeouts were at 180s, which is enough for Qwen3-32B
(52.9s wall time) but causes Llama-3.3-70B to hit the wall on
t1-bugfix-discount. Raised tier-1 timeouts to 360s so slower HF
models can complete tasks within the deterministic timeout and
produce a capability signal instead of an infrastructure
timeout signal.
Also fixed a pre-existing stale test (tests/test_tasks.py expected
20 tasks, we have 40 since v0.5 corpus expansion) that was failing
on every test run.
Verified inside the container image:
- `clawbench list-tasks` returns all 40 tasks
- sessions.create passes for all 11 preset models
(9 huggingface/* + 2 anthropic/*)
- `clawbench run --model huggingface/Qwen/Qwen3-32B
--task t1-bugfix-discount --runs 1`
scored 1.000 / C=1.000 T=1.000 B=1.000 in 52.9s
with 279,702 tokens captured.
Remaining architectural note (not a blocker): the CLI path
`clawbench run` assumes the gateway is already running. Only the
queue/worker path (`app.py` → `EvalWorker._ensure_gateway`) spawns
its own gateway. For HF Space deployment this is fine because all
user submissions go through the Gradio UI → queue → worker path;
local CLI invocations inside the container need to start the
gateway manually first.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
44 lines
1.1 KiB
TOML
44 lines
1.1 KiB
TOML
[project]
|
|
name = "clawbench"
|
|
version = "0.4.0.dev1"
|
|
description = "Rigorous benchmark for AI models as OpenClaw agents"
|
|
readme = "README.md"
|
|
license = "MIT"
|
|
requires-python = ">=3.11"
|
|
dependencies = [
|
|
"websockets>=13.0,<15",
|
|
"pydantic>=2.7,<3",
|
|
"pyyaml>=6.0,<7",
|
|
"datasets>=3.0,<4",
|
|
"gradio>=5.0,<6",
|
|
"httpx>=0.27,<1",
|
|
"numpy>=1.26,<3",
|
|
"rich>=13.0,<14",
|
|
"click>=8.1,<9",
|
|
# Runtime deps for the task completion verifier. The harness shells out
|
|
# to `pytest -q` / `pytest-asyncio` inside per-task workspaces as the
|
|
# execution check; the container must have them in PATH.
|
|
"pytest>=8.0,<9",
|
|
"pytest-asyncio>=0.24,<1",
|
|
]
|
|
|
|
[project.optional-dependencies]
|
|
dev = [
|
|
# Kept as an alias for historical `pip install .[dev]` invocations.
|
|
# pytest + pytest-asyncio are now in the base [dependencies] since the
|
|
# benchmark itself runs pytest in task workspaces.
|
|
"pytest>=8.0,<9",
|
|
"pytest-asyncio>=0.24,<1",
|
|
]
|
|
|
|
[project.scripts]
|
|
clawbench = "clawbench.cli:main"
|
|
|
|
[build-system]
|
|
requires = ["hatchling"]
|
|
build-backend = "hatchling.build"
|
|
|
|
[tool.pytest.ini_options]
|
|
asyncio_mode = "auto"
|
|
testpaths = ["tests"]
|