clawbench/pyproject.toml
Codex c24d982110 HF Space: fix container eval — pytest in runtime deps, TASKS_DIR resolver, timeouts
Found and fixed three blockers preventing the HF Space Docker container
from running the eval suite end-to-end, verified by building the image
locally with Docker Desktop and running a tier-1 task against Qwen3-32B
through the HF Inference API inside the container.

1. pytest was in [project.optional-dependencies].dev, not [project].
   The Dockerfile does `pip install .` which only installs runtime
   deps, so every task whose completion verifier runs `pytest -q`
   would fail with exit 127 (command not found). Moved pytest +
   pytest-asyncio into the base dependencies so the container gets
   them by default. The [dev] extra is kept as an alias for
   existing `pip install .[dev]` invocations.

2. clawbench/tasks.py resolved TASKS_DIR via
   `Path(__file__).parent.parent / "tasks"`, which works only for
   source checkouts. When pip installs the package into
   /usr/local/lib/python3.11/dist-packages/clawbench, the sibling
   `tasks/` directory no longer exists at that path, so
   `load_all_tasks()` returned empty and `clawbench run` died with
   "No tasks to run". Added a fallback resolver that tries, in
   order: $CLAWBENCH_TASKS_DIR env var, sibling-of-source,
   Path.cwd() / "tasks", and known Docker layout candidates
   (/home/node/app/tasks, /home/user/app/tasks, /app/tasks).
   Verified inside the container that `TASKS_DIR` now resolves to
   /home/node/app/tasks and load_all_tasks() returns 40 tasks.

3. Tier-1 task timeouts were at 180s, which is enough for Qwen3-32B
   (52.9s wall time) but causes Llama-3.3-70B to hit the wall on
   t1-bugfix-discount. Raised tier-1 timeouts to 360s so slower HF
   models can complete tasks within the deterministic timeout and
   produce a capability signal instead of an infrastructure
   timeout signal.

Also fixed a pre-existing stale test (tests/test_tasks.py expected
20 tasks, we have 40 since v0.5 corpus expansion) that was failing
on every test run.

Verified inside the container image:
- `clawbench list-tasks` returns all 40 tasks
- sessions.create passes for all 11 preset models
  (9 huggingface/* + 2 anthropic/*)
- `clawbench run --model huggingface/Qwen/Qwen3-32B
    --task t1-bugfix-discount --runs 1`
  scored 1.000 / C=1.000 T=1.000 B=1.000 in 52.9s
  with 279,702 tokens captured.

Remaining architectural note (not a blocker): the CLI path
`clawbench run` assumes the gateway is already running. Only the
queue/worker path (`app.py` → `EvalWorker._ensure_gateway`) spawns
its own gateway. For HF Space deployment this is fine because all
user submissions go through the Gradio UI → queue → worker path;
local CLI invocations inside the container need to start the
gateway manually first.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 23:03:15 -07:00

44 lines
1.1 KiB
TOML

[project]
name = "clawbench"
version = "0.4.0.dev1"
description = "Rigorous benchmark for AI models as OpenClaw agents"
readme = "README.md"
license = "MIT"
requires-python = ">=3.11"
dependencies = [
"websockets>=13.0,<15",
"pydantic>=2.7,<3",
"pyyaml>=6.0,<7",
"datasets>=3.0,<4",
"gradio>=5.0,<6",
"httpx>=0.27,<1",
"numpy>=1.26,<3",
"rich>=13.0,<14",
"click>=8.1,<9",
# Runtime deps for the task completion verifier. The harness shells out
# to `pytest -q` / `pytest-asyncio` inside per-task workspaces as the
# execution check; the container must have them in PATH.
"pytest>=8.0,<9",
"pytest-asyncio>=0.24,<1",
]
[project.optional-dependencies]
dev = [
# Kept as an alias for historical `pip install .[dev]` invocations.
# pytest + pytest-asyncio are now in the base [dependencies] since the
# benchmark itself runs pytest in task workspaces.
"pytest>=8.0,<9",
"pytest-asyncio>=0.24,<1",
]
[project.scripts]
clawbench = "clawbench.cli:main"
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.pytest.ini_options]
asyncio_mode = "auto"
testpaths = ["tests"]