Compare commits

..

3 Commits

Author SHA1 Message Date
Vincent Koc
69a2311681
fix: harden adapter workspace checks
Some checks failed
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Has been cancelled
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Has been cancelled
2026-04-29 13:53:44 -07:00
Vincent Koc
82eaadbc61
Merge remote-tracking branch 'origin/main' into pr17-nonrewrite
* origin/main:
  fix(worker): harden runtime result writes
  fix(client): clean pending rpc on send failure
  test: cover environment verifier success paths
  test: cover judge score gate propagation
  fix(scoring): gate judge-weighted scores
  fix(runtime): harden benchmark cache and task paths
  fix: flag credential file access in dangerous shell patterns (#6)
  fix: flag git push --force variants as dangerous shell commands (#5)
  chore: add open-source contribution scaffolding (#3)
  fix: strip quoted strings before checking for shell redirect operators (#2)
2026-04-29 13:52:41 -07:00
scoootscooob
30334cac88 feat: add adapter canonicalization layer
Some checks are pending
CI / Python ${{ matrix.python-version }} test suite (3.11) (push) Waiting to run
CI / Python ${{ matrix.python-version }} test suite (3.12) (push) Waiting to run
2026-04-29 11:15:11 -07:00
36 changed files with 92 additions and 2889 deletions

View File

@ -10,11 +10,6 @@ agent dotfiles, Docker, or a benchmark run that is too heavy for the local
machine. Keep normal unit-test iteration local unless the user asks for
Testbox proof.
Crabbox is the sibling lane for reusable owned-capacity proof. Use
`.agents/skills/crabbox/SKILL.md` and `.crabbox.yaml` when ClawBench needs
AWS-backed reusable boxes or Crabbox sync/log/result inspection. Keep this
skill focused on Blacksmith CI parity.
## Warmup
Run from the repository root:

View File

@ -1,122 +0,0 @@
---
name: crabbox
description: Use Crabbox for ClawBench remote Linux validation, warmed reusable boxes, GitHub Actions hydration, sync timing, logs, results, caches, and lease cleanup.
---
# Crabbox
Use Crabbox when ClawBench needs remote Linux proof on owned capacity, a large
runner class, reusable warm state, or a Blacksmith alternative.
## Before Running
- Run from the repo root. Crabbox sync mirrors the current checkout.
- Prefer local targeted tests for tight edit loops.
- Prefer Blacksmith Testbox when the task explicitly asks for Blacksmith or a
Blacksmith-specific CI comparison.
- Use Crabbox for broad ClawBench gates when owned AWS capacity is the right
remote lane.
- Check `.crabbox.yaml` for repo defaults before adding flags.
- Sanity-check the selected binary before remote work. Prefer the local
`openclaw/crabbox` checkout when present because the user PATH shim can be
stale: `command -v crabbox; ../crabbox/bin/crabbox --version`.
- Install with `brew install openclaw/tap/crabbox`; auth is required before use:
`crabbox login --url https://crabbox.openclaw.ai --provider aws`.
- On macOS the user config is `~/Library/Application Support/crabbox/config.yaml`;
it must include `broker.url`, `broker.token`, and usually `provider: aws`.
## ClawBench Flow
AWS/owned-capacity flow for Python tests:
```sh
crabbox warmup --class standard --idle-timeout 90m
crabbox actions hydrate --id <cbx_id-or-slug>
crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "python -m pytest -q"
```
For commands that need hydrated HF/provider credentials or agent dotfiles, use
the helper installed by the hydration workflow:
```sh
crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "clawbench-testbox-env python -m pytest -q"
crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "clawbench-testbox-env clawbench run --model anthropic/claude-sonnet-4-6 --adapter simulated"
```
Blacksmith-backed Crabbox flow can delegate setup to the existing Testbox
workflow:
```sh
crabbox run --provider blacksmith-testbox --blacksmith-org openclaw --blacksmith-workflow .github/workflows/ci-check-testbox.yml --blacksmith-job check --blacksmith-ref main --idle-timeout 90m --timing-json --shell -- "python -m pytest -q"
```
Stop boxes you created before handoff:
```sh
crabbox stop <cbx_id-or-slug>
```
## Owned AWS Capacity
When AWS capacity is under pressure, do not start with `class=beast`.
`beast` begins at 48xlarge instances and can burn 192 vCPU quota per request.
ClawBench's owned-cloud default is `standard`; escalate to `fast`, then
`large`, and only use `beast` when the work is explicitly CPU-bound and the
smaller class already failed the goal.
Keep capacity hints enabled so brokered AWS leases print selected
region/market, quota pressure, Spot fallback, and high-pressure class warnings.
The ClawBench repo config sets `capacity.hints: true`; use
`CRABBOX_CAPACITY_HINTS=0` only when debugging hint rendering itself.
Use `beast` only for exceptional lanes:
- full benchmark sweeps where wall time is dominated by CPU, not dependency
install or network;
- release/blocker validation where a maintainer explicitly asks for the largest
owned AWS class;
- performance profiling where the point is to compare high-core behavior.
Do not use `beast` for ordinary `python -m pytest -q`, docs-only work, small
task repros, Blacksmith outage triage, or focused lint/type/test checks. Those
should use `standard` first and `fast` only when the extra cores materially
help.
## Useful Commands
```sh
crabbox status --id <id-or-slug> --wait
crabbox inspect --id <id-or-slug> --json
crabbox sync-plan
crabbox history --lease <id-or-slug>
crabbox logs <run_id>
crabbox results <run_id>
crabbox cache stats --id <id-or-slug>
crabbox ssh --id <id-or-slug>
```
Use `--debug` on `run` when measuring sync timing.
Use `--timing-json` on warmup, hydrate, and run when comparing AWS and
blacksmith-testbox timings.
Use `--market spot|on-demand` on AWS warmup or one-shot run when testing quota
or capacity behavior without changing `.crabbox.yaml`.
## Hydration Boundary
`.github/workflows/crabbox-hydrate.yml` is repo-specific on purpose. It owns
ClawBench checkout, setup-python, pip install, provider/HF env hydration,
agent-dotfile restoration, ready marker, and keepalive. Crabbox owns runner
registration, workflow dispatch, SSH sync, command execution, logs/results,
local lease claims, and idle cleanup.
Do not add ClawBench-specific setup to Crabbox. Put repo setup in the hydration
workflow and generic lease/sync behavior in Crabbox.
## Cleanup
Crabbox has coordinator-owned idle expiry and local lease claims, so ClawBench
does not need a custom ledger. Default idle timeout is 30 minutes unless config
or flags set a different value. Still stop boxes you created when done.
If `crabbox list` prints `orphan=no-active-lease`, treat it as an operator
review hint; do not delete `keep=true` machines without checking provider and
coordinator state.

View File

@ -1,48 +0,0 @@
profile: clawbench-check
provider: aws
class: standard
capacity:
market: spot
strategy: most-available
fallback: on-demand-after-120s
hints: true
regions:
- eu-west-1
actions:
workflow: .github/workflows/crabbox-hydrate.yml
job: hydrate
ref: main
runnerLabels:
- crabbox
- clawbench
runnerVersion: latest
ephemeral: true
aws:
region: eu-west-1
rootGB: 400
sync:
delete: true
checksum: false
gitSeed: true
fingerprint: true
baseRef: main
exclude:
- .artifacts
- .codex
- .DS_Store
- .pytest_cache
- .ruff_cache
- .venv
- dist
- htmlcov
- playwright-report
- test-results
env:
allow:
- CI
- CLAWBENCH_*
- OPENCLAW_*
- PYTHON*
ssh:
user: crabbox
port: "2222"

1
.github/CODEOWNERS vendored
View File

@ -1 +0,0 @@
* @openclaw/openclaw-evals

View File

@ -29,22 +29,6 @@ It installs ClawBench, hydrates provider/HF secrets into
dotfiles from repo or org secrets, and installs
`~/.local/bin/clawbench-testbox-env` for commands that need that live auth.
## `crabbox-hydrate.yml` — Crabbox Actions hydration
This workflow exists for the Crabbox CLI from `openclaw/crabbox`:
```bash
crabbox warmup --idle-timeout 90m
crabbox actions hydrate --id <cbx_id-or-slug>
crabbox run --id <cbx_id-or-slug> --shell -- "python -m pytest -q"
```
It runs on the dynamic self-hosted runner label registered by Crabbox, installs
ClawBench, hydrates the same provider/HF secrets and agent dotfiles as the
Blacksmith Testbox workflow, writes the Crabbox ready marker under
`~/.crabbox/actions/`, and keeps the job alive for follow-up SSH sync/run
commands.
## `sync-to-hf-space.yml` — auto-mirror main to the HF Space
Mirrors every push to `main` into the HF Space git remote so

View File

@ -1,166 +0,0 @@
name: Crabbox Hydrate
on:
workflow_dispatch:
inputs:
crabbox_id:
description: "Crabbox lease ID"
required: true
type: string
ref:
description: "Git ref to hydrate"
required: false
type: string
crabbox_runner_label:
description: "Dynamic Crabbox runner label"
required: true
type: string
crabbox_job:
description: "Hydration job identifier expected by Crabbox"
required: false
default: "hydrate"
type: string
crabbox_keep_alive_minutes:
description: "Minutes to keep the hydrated job alive"
required: false
default: "90"
type: string
permissions:
contents: read
jobs:
hydrate:
name: hydrate
runs-on: [self-hosted, "${{ inputs.crabbox_runner_label }}"]
timeout-minutes: 120
steps:
- name: Checkout
uses: actions/checkout@v4
with:
ref: ${{ inputs.ref || github.ref }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip
- name: Install project
run: |
python -m pip install --upgrade pip
python -m pip install -e .
- name: Prepare Crabbox shell
shell: bash
run: |
set -euo pipefail
git fetch --no-tags --depth=50 origin "+refs/heads/main:refs/remotes/origin/main"
python_dir="$(dirname "$(python -c 'import sys; print(sys.executable)')")"
sudo ln -sf "$python_dir/python" /usr/local/bin/python
sudo ln -sf "$python_dir/python" /usr/local/bin/python3
sudo ln -sf "$python_dir/pip" /usr/local/bin/pip
sudo ln -sf "$python_dir/pip" /usr/local/bin/pip3
sudo ln -sf "$python_dir/pytest" /usr/local/bin/pytest
- name: Hydrate Crabbox env helper
shell: bash
env:
HF_TOKEN: ${{ secrets.HF_TOKEN }}
HF_USERNAME: ${{ secrets.HF_USERNAME }}
CLAWBENCH_QUEUE_DATASET: ${{ vars.CLAWBENCH_QUEUE_DATASET || 'openclaw/clawbench-results' }}
CLAWBENCH_JUDGE_MODEL: ${{ vars.CLAWBENCH_JUDGE_MODEL }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
ANTHROPIC_API_KEY_OLD: ${{ secrets.ANTHROPIC_API_KEY_OLD }}
ANTHROPIC_API_TOKEN: ${{ secrets.ANTHROPIC_API_TOKEN }}
CEREBRAS_API_KEY: ${{ secrets.CEREBRAS_API_KEY }}
DEEPINFRA_API_KEY: ${{ secrets.DEEPINFRA_API_KEY }}
FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
KIMI_API_KEY: ${{ secrets.KIMI_API_KEY }}
MINIMAX_API_KEY: ${{ secrets.MINIMAX_API_KEY }}
MISTRAL_API_KEY: ${{ secrets.MISTRAL_API_KEY }}
MOONSHOT_API_KEY: ${{ secrets.MOONSHOT_API_KEY }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
OPENAI_BASE_URL: ${{ secrets.OPENAI_BASE_URL }}
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
QWEN_API_KEY: ${{ secrets.QWEN_API_KEY }}
TOGETHER_API_KEY: ${{ secrets.TOGETHER_API_KEY }}
XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
ZAI_API_KEY: ${{ secrets.ZAI_API_KEY }}
Z_AI_API_KEY: ${{ secrets.Z_AI_API_KEY }}
OPENCLAW_CODEX_AUTH_JSON: ${{ secrets.OPENCLAW_CODEX_AUTH_JSON }}
OPENCLAW_CODEX_CONFIG_TOML: ${{ secrets.OPENCLAW_CODEX_CONFIG_TOML }}
OPENCLAW_CLAUDE_JSON: ${{ secrets.OPENCLAW_CLAUDE_JSON }}
OPENCLAW_CLAUDE_CREDENTIALS_JSON: ${{ secrets.OPENCLAW_CLAUDE_CREDENTIALS_JSON }}
OPENCLAW_CLAUDE_SETTINGS_JSON: ${{ secrets.OPENCLAW_CLAUDE_SETTINGS_JSON }}
OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON: ${{ secrets.OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON }}
OPENCLAW_GEMINI_SETTINGS_JSON: ${{ secrets.OPENCLAW_GEMINI_SETTINGS_JSON }}
CLAWBENCH_CODEX_AUTH_JSON: ${{ secrets.CLAWBENCH_CODEX_AUTH_JSON }}
CLAWBENCH_CODEX_CONFIG_TOML: ${{ secrets.CLAWBENCH_CODEX_CONFIG_TOML }}
CLAWBENCH_CLAUDE_JSON: ${{ secrets.CLAWBENCH_CLAUDE_JSON }}
CLAWBENCH_CLAUDE_CREDENTIALS_JSON: ${{ secrets.CLAWBENCH_CLAUDE_CREDENTIALS_JSON }}
CLAWBENCH_CLAUDE_SETTINGS_JSON: ${{ secrets.CLAWBENCH_CLAUDE_SETTINGS_JSON }}
CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON: ${{ secrets.CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON }}
CLAWBENCH_GEMINI_SETTINGS_JSON: ${{ secrets.CLAWBENCH_GEMINI_SETTINGS_JSON }}
run: |
bash scripts/ci-hydrate-testbox-env.sh
sudo ln -sf "$HOME/.local/bin/clawbench-testbox-env" /usr/local/bin/clawbench-testbox-env
- name: Mark Crabbox ready
shell: bash
run: |
set -euo pipefail
job="${{ inputs.crabbox_job }}"
if [ -z "$job" ]; then job=hydrate; fi
mkdir -p "$HOME/.crabbox/actions"
state="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env"
env_file="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env.sh"
services_file="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.services"
write_export() {
key="$1"
value="${!key-}"
if [ -n "$value" ]; then
printf 'export %s=%q\n' "$key" "$value"
fi
}
{
for key in CI GITHUB_ACTIONS GITHUB_WORKSPACE GITHUB_REPOSITORY GITHUB_RUN_ID GITHUB_RUN_NUMBER GITHUB_RUN_ATTEMPT GITHUB_REF GITHUB_REF_NAME GITHUB_SHA GITHUB_EVENT_NAME GITHUB_ACTOR RUNNER_OS RUNNER_ARCH RUNNER_TEMP RUNNER_TOOL_CACHE; do
write_export "$key"
done
} > "${env_file}.tmp"
mv "${env_file}.tmp" "$env_file"
{
echo "# Docker containers visible from the hydrated runner"
docker ps --format '{{.Names}}\t{{.Image}}\t{{.Ports}}' 2>/dev/null || true
} > "${services_file}.tmp"
mv "${services_file}.tmp" "$services_file"
tmp="${state}.tmp"
{
echo "WORKSPACE=${GITHUB_WORKSPACE}"
echo "RUN_ID=${GITHUB_RUN_ID}"
echo "JOB=${job}"
echo "ENV_FILE=${env_file}"
echo "SERVICES_FILE=${services_file}"
echo "READY_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
} > "$tmp"
mv "$tmp" "$state"
- name: Keep Crabbox job alive
shell: bash
run: |
set -euo pipefail
minutes="${{ inputs.crabbox_keep_alive_minutes }}"
case "$minutes" in
''|*[!0-9]*) minutes=90 ;;
esac
stop="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.stop"
deadline=$(( $(date +%s) + minutes * 60 ))
while [ "$(date +%s)" -lt "$deadline" ]; do
if [ -f "$stop" ]; then
exit 0
fi
sleep 15
done

View File

@ -14,7 +14,7 @@ RUN apt-get update && \
RUN ln -s /app /openclaw
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
RUN cd /tmp && npx -y playwright@1.59.1 install --with-deps chromium && \
RUN npx -y playwright@1.59.1 install --with-deps chromium && \
CHROME_PATH="$(find /ms-playwright -path '*/chrome' -type f | sort | head -n 1)" && \
test -x "$CHROME_PATH" && \
ln -sf "$CHROME_PATH" /usr/bin/chromium
@ -28,7 +28,6 @@ COPY --chown=node:node tasks-public/ tasks-public/
COPY --chown=node:node tasks-domain/ tasks-domain/
COPY --chown=node:node profiles/ profiles/
COPY --chown=node:node baselines/ baselines/
COPY --chown=node:node scripts/ scripts/
COPY --chown=node:node app.py .
RUN python3 -m pip install --break-system-packages --no-cache-dir .

View File

@ -461,26 +461,6 @@ python3 scripts/run_posterior_dynamics_pipeline.py \
clawbench diagnose profiles/local_ollama_gpt_oss.yaml
```
### Running on Kubernetes
See [`docs/kubernetes.md`](docs/kubernetes.md) for the full runbook. The short
version:
```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
export OPENAI_API_KEY="sk-..." # or ANTHROPIC_API_KEY, OPENROUTER_API_KEY, etc.
export CLAWBENCH_MODEL="openai/gpt-5.5"
# export MLFLOW_NAMESPACE="mlflow" # MLflow deploys in a separate namespace (default: mlflow)
./scripts/k8s/deploy.sh # deploys OpenClaw + MLflow + starts eval
./scripts/k8s/deploy.sh --logs # follow progress
./scripts/k8s/deploy.sh --teardown # tear down openclaw & eval (does not delete MLflow)
```
API keys are stored in a Kubernetes Secret created by the deploy script.
MLflow is deployed in its own namespace (default: `mlflow`, configurable via
`MLFLOW_NAMESPACE`).
---
## Partner Trace Spec

View File

@ -226,73 +226,14 @@ class GatewayClient:
attempt += 1
try:
remaining = max(1.0, deadline - asyncio.get_running_loop().time())
attempt_timeout = min(30.0, remaining)
self._ws = await websockets.connect(
self.config.url,
max_size=10 * 1024 * 1024,
open_timeout=attempt_timeout,
open_timeout=min(self.config.connect_timeout, remaining),
additional_headers={"Origin": host},
)
self._listen_task = asyncio.create_task(self._listener())
challenge = await self._wait_event(
"connect.challenge", timeout=attempt_timeout
)
challenge_payload = challenge.get("payload", {})
nonce = ""
if isinstance(challenge_payload, dict):
raw_nonce = challenge_payload.get("nonce", "")
if isinstance(raw_nonce, str):
nonce = raw_nonce.strip()
role = "operator"
scopes = [
"operator.admin",
"operator.read",
"operator.write",
"operator.approvals",
"operator.pairing",
]
client_info = {
"id": "openclaw-control-ui",
"version": __version__,
"platform": "linux",
"mode": "ui",
}
connect_params: dict[str, Any] = {
"minProtocol": PROTOCOL_VERSION,
"maxProtocol": PROTOCOL_VERSION,
"client": client_info,
"role": role,
"scopes": scopes,
"caps": [],
"commands": [],
"permissions": {},
"auth": {"token": self.config.token} if self.config.token else {},
}
device = _build_connect_device(
nonce=nonce,
token=self.config.token,
client_id=str(client_info["id"]),
client_mode=str(client_info["mode"]),
role=role,
scopes=scopes,
platform=str(client_info["platform"]),
)
if device:
connect_params["device"] = device
response = await self._rpc(
"connect",
connect_params,
timeout=attempt_timeout,
)
payload = response.get("payload", {})
if payload.get("type") != "hello-ok":
raise ConnectionError(f"Expected hello-ok, got: {payload}")
logger.info("Connected to gateway (protocol v%s)", payload.get("protocol", "?"))
return
break
except Exception as exc:
await self.close()
if not _is_transient_gateway_connect_error(exc):
raise
if asyncio.get_running_loop().time() >= deadline:
@ -304,6 +245,60 @@ class GatewayClient:
delay,
)
await asyncio.sleep(delay)
self._listen_task = asyncio.create_task(self._listener())
challenge = await self._wait_event("connect.challenge", timeout=self.config.connect_timeout)
challenge_payload = challenge.get("payload", {})
nonce = ""
if isinstance(challenge_payload, dict):
raw_nonce = challenge_payload.get("nonce", "")
if isinstance(raw_nonce, str):
nonce = raw_nonce.strip()
role = "operator"
scopes = [
"operator.admin",
"operator.read",
"operator.write",
"operator.approvals",
"operator.pairing",
]
client_info = {
"id": "openclaw-control-ui",
"version": __version__,
"platform": "linux",
"mode": "ui",
}
connect_params: dict[str, Any] = {
"minProtocol": PROTOCOL_VERSION,
"maxProtocol": PROTOCOL_VERSION,
"client": client_info,
"role": role,
"scopes": scopes,
"caps": [],
"commands": [],
"permissions": {},
"auth": {"token": self.config.token} if self.config.token else {},
}
device = _build_connect_device(
nonce=nonce,
token=self.config.token,
client_id=str(client_info["id"]),
client_mode=str(client_info["mode"]),
role=role,
scopes=scopes,
platform=str(client_info["platform"]),
)
if device:
connect_params["device"] = device
response = await self._rpc(
"connect",
connect_params,
)
payload = response.get("payload", {})
if payload.get("type") != "hello-ok":
raise ConnectionError(f"Expected hello-ok, got: {payload}")
logger.info("Connected to gateway (protocol v%s)", payload.get("protocol", "?"))
async def close(self) -> None:
if self._listen_task and not self._listen_task.done():
@ -399,15 +394,6 @@ class GatewayClient:
except Exception as exc:
logger.warning("Failed to delete session %s: %s", session_key, exc)
async def abort_session(self, session_key: str, *, run_id: str | None = None) -> None:
params: dict[str, Any] = {"key": session_key}
if run_id:
params["runId"] = run_id
try:
await self._rpc("sessions.abort", params, timeout=min(self.config.request_timeout, 10.0))
except Exception as exc:
logger.warning("Failed to abort session %s run %s: %s", session_key, run_id or "-", exc)
async def get_effective_tools(self, session_key: str) -> dict[str, Any]:
response = await self._rpc("tools.effective", {"sessionKey": session_key})
return response.get("payload", {})
@ -427,27 +413,15 @@ class GatewayClient:
msg_queue: asyncio.Queue[dict[str, Any]] = asyncio.Queue()
self._event_queues[chat_queue_key] = chat_queue
self._event_queues[msg_queue_key] = msg_queue
timeout_ms = max(1, min(int(timeout * 1000), 2_147_483_647))
send_response = await self._rpc(
await self._rpc(
"sessions.send",
{
"key": session_key,
"message": message,
"idempotencyKey": idempotency_key,
"timeoutMs": timeout_ms,
},
)
send_payload = send_response.get("payload", {})
run_id = idempotency_key
if isinstance(send_payload, dict):
raw_run_id = send_payload.get("runId")
if isinstance(raw_run_id, str) and raw_run_id.strip():
run_id = raw_run_id.strip()
wait_task = asyncio.create_task(
self._wait_for_agent_run(run_id, timeout_ms=timeout_ms)
)
collected_messages: list[TranscriptMessage] = []
done = False
@ -456,31 +430,8 @@ class GatewayClient:
while not done:
remaining = deadline - asyncio.get_running_loop().time()
if remaining <= 0:
logger.warning(
"Timeout waiting for final state on session %s run %s",
session_key,
run_id,
)
logger.warning("Timeout waiting for final state on session %s", session_key)
break
if wait_task.done():
wait_payload = _task_result_or_empty(wait_task)
status = str(wait_payload.get("status", ""))
if status and status != "timeout":
logger.info(
"agent.wait observed terminal status for session %s run %s: %s",
session_key,
run_id,
status,
)
done = True
break
if status == "timeout":
logger.warning(
"agent.wait timed out for session %s run %s",
session_key,
run_id,
)
break
try:
event = await asyncio.wait_for(chat_queue.get(), timeout=min(0.5, remaining))
state = event.get("payload", {}).get("state", "")
@ -489,9 +440,6 @@ class GatewayClient:
except asyncio.TimeoutError:
pass
if not done:
await self.abort_session(session_key, run_id=run_id)
collected_messages.extend(
await _drain_message_queue(
msg_queue,
@ -516,30 +464,11 @@ class GatewayClient:
):
collected_messages = history_messages
finally:
if not wait_task.done():
wait_task.cancel()
try:
await wait_task
except asyncio.CancelledError:
pass
self._event_queues.pop(chat_queue_key, None)
self._event_queues.pop(msg_queue_key, None)
return _correlate_transcript(Transcript(messages=collected_messages))
async def _wait_for_agent_run(self, run_id: str, *, timeout_ms: int) -> dict[str, Any]:
try:
response = await self._rpc(
"agent.wait",
{"runId": run_id, "timeoutMs": timeout_ms},
timeout=(timeout_ms / 1000.0) + 10.0,
)
except Exception as exc:
logger.warning("agent.wait failed for run %s: %s", run_id, exc)
return {}
payload = response.get("payload", {})
return payload if isinstance(payload, dict) else {}
async def get_session_messages(self, session_key: str) -> list[TranscriptMessage]:
try:
response = await self._rpc("sessions.get", {"key": session_key})
@ -648,13 +577,6 @@ def _build_connect_device(
platform: str,
device_family: str | None = None,
) -> dict[str, Any] | None:
if os.environ.get("CLAWBENCH_DISABLE_GATEWAY_DEVICE_IDENTITY", "").strip().lower() in {
"1",
"true",
"yes",
"on",
}:
return None
if not nonce:
return None
@ -724,10 +646,6 @@ def _resolve_node_executable() -> str | None:
def _is_transient_gateway_connect_error(exc: Exception) -> bool:
if isinstance(exc, (TimeoutError, asyncio.TimeoutError)):
return True
if isinstance(exc, websockets.exceptions.ConnectionClosed):
return True
if isinstance(exc, InvalidStatus):
return exc.response.status_code in {502, 503, 504}
if isinstance(exc, InvalidMessage):
@ -743,13 +661,6 @@ def _describe_connect_error(exc: Exception) -> str:
return exc.__class__.__name__
def _task_result_or_empty(task: asyncio.Task[dict[str, Any]]) -> dict[str, Any]:
try:
return task.result()
except Exception:
return {}
def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | None:
role = message_data.get("role", "")
if not role:

View File

@ -19,7 +19,6 @@ from rich.console import Console
from rich.table import Table
from clawbench import __version__
from clawbench.ablation import build_ablation_profile
from clawbench.client import GatewayClient, GatewayConfig
from clawbench.releases import compute_task_snapshot_fingerprint, load_active_release
from clawbench.schemas import (
@ -87,9 +86,6 @@ class BenchmarkHarness:
browser_concurrency: int = 1,
adapter: str = "openclaw",
judge_affects_score: bool = False,
tool_profile_name: str | None = None,
enabled_toolsets: list[str] | None = None,
disabled_toolsets: list[str] | None = None,
) -> None:
self.gateway_config = gateway_config
self.model = model
@ -115,9 +111,6 @@ class BenchmarkHarness:
self.concurrency = max(1, int(concurrency))
self.browser_concurrency = max(1, int(browser_concurrency))
self.adapter = adapter
self.tool_profile_name = tool_profile_name
self.enabled_toolsets = enabled_toolsets or []
self.disabled_toolsets = disabled_toolsets or []
self.repo_root = Path(__file__).parent.parent
self.last_task_runs: dict[str, list[TaskRunResult]] = {}
@ -555,9 +548,6 @@ class BenchmarkHarness:
"prompt_variant": self.prompt_variant,
"judge_model": self.judge_model,
"judge_affects_score": self.judge_affects_score,
"tool_profile_name": self.tool_profile_name,
"enabled_toolsets": self.enabled_toolsets,
"disabled_toolsets": self.disabled_toolsets,
"benchmark_version": __version__,
"task_fingerprint": _task_definition_fingerprint(task),
}
@ -763,15 +753,6 @@ class BenchmarkHarness:
for _ in range(count)
)
active_release = load_active_release()
ablation_profile = build_ablation_profile(
model=self.model,
adapter=self.adapter,
prompt_profile=self.prompt_variant,
harness_version=__version__,
tool_profile_name=self.tool_profile_name,
enabled_toolsets=self.enabled_toolsets,
disabled_toolsets=self.disabled_toolsets,
)
result = BenchmarkResult(
submission_id=str(uuid.uuid4()),
model=self.model,
@ -789,7 +770,6 @@ class BenchmarkHarness:
"judge_model": self.judge_model,
"judge_affects_score": self.judge_affects_score,
"adapter": self.adapter,
"ablation_profile": ablation_profile.model_dump(),
"known_adapters": list(KNOWN_ADAPTERS),
"executable_adapters": sorted(EXECUTABLE_ADAPTERS),
"subsets": self.subsets,

View File

@ -28,14 +28,7 @@ logger = logging.getLogger(__name__)
HF_TOKEN = os.environ.get("HF_TOKEN", "")
# Local fallback when HF is unavailable
def _resolve_local_queue_dir() -> Path:
override = os.environ.get("CLAWBENCH_LOCAL_QUEUE_DIR", "").strip()
if override:
return Path(override).expanduser()
return Path("/data/queue") if Path("/data").exists() else Path("data/queue")
LOCAL_QUEUE_DIR = _resolve_local_queue_dir()
LOCAL_QUEUE_DIR = Path("/data/queue") if Path("/data").exists() else Path("data/queue")
class JobStatus(str, Enum):
@ -57,7 +50,6 @@ class SubmissionRequest(BaseModel):
runs_per_task: int = Field(default=3, ge=1, le=10)
max_parallel_lanes: int = Field(default=1, ge=1, le=8)
tier: str | None = None # Filter to a specific tier
task_ids: list[str] = Field(default_factory=list)
scenario: str | None = None
prompt_variant: str = "clear"
submitter: str = "" # HF username
@ -73,7 +65,6 @@ class SubmissionRequest(BaseModel):
"runs_per_task": self.runs_per_task,
"max_parallel_lanes": self.max_parallel_lanes,
"tier": self.tier or "",
"task_ids": sorted({task_id.strip() for task_id in self.task_ids if task_id.strip()}),
"scenario": self.scenario or "",
"prompt_variant": self.prompt_variant,
}

View File

@ -34,13 +34,6 @@ STALE_EVALUATION_SECONDS = max(
JOB_HEARTBEAT_INTERVAL_SECONDS * 4,
int(os.environ.get("CLAWBENCH_STALE_EVALUATION_SECONDS", "1800")),
)
OPENCLAW_EVAL_EXEC_HOSTS = {"auto", "gateway", "sandbox", "node"}
OPENCLAW_EVAL_SYSTEM_PROMPT = (
"You are running an OpenClaw benchmark task. Complete the user's request in the current "
"workspace using the available tools when needed. For file, code, browser, shell, or memory "
"tasks, make the requested changes directly and verify them when practical. Do not ask "
"follow-up questions during the benchmark. Keep any final reply brief."
)
@dataclass
@ -53,12 +46,6 @@ class ParallelLane:
state_dir: Path | None = None
log_path: Path | None = None
@property
def home_dir(self) -> Path | None:
if self.state_dir is None:
return None
return self.state_dir.parent / "home"
@property
def ws_url(self) -> str:
return f"ws://localhost:{self.port}"
@ -315,7 +302,6 @@ class EvalWorker:
prompt_variant=job.request.prompt_variant,
prepare_run=prepare_run,
progress_callback=progress_callback,
tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
)
return await harness.run()
@ -386,7 +372,6 @@ class EvalWorker:
tier=job.request.tier,
scenario=job.request.scenario,
prompt_variant=job.request.prompt_variant,
tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
)
return summary_harness.compose_result_from_task_stats(
ordered_stats,
@ -400,8 +385,7 @@ class EvalWorker:
)
finally:
self._stop_parallel_gateways()
if os.environ.get("CLAWBENCH_KEEP_PARALLEL_LANE_ROOT", "").strip() != "1":
shutil.rmtree(job_root, ignore_errors=True)
shutil.rmtree(job_root, ignore_errors=True)
async def _run_parallel_lane(self, job, lane: ParallelLane, progress: JobProgressTracker):
gateway_cmd = self._find_gateway_cmd()
@ -450,7 +434,6 @@ class EvalWorker:
progress_callback=progress_callback,
print_report=False,
quiet=True,
tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
)
result = await harness.run()
await self._sync_job_progress(job.job_id, progress.clear_lane(lane.index))
@ -465,9 +448,6 @@ class EvalWorker:
return load_all_tasks(
tier=job.request.tier,
scenario=job.request.scenario,
task_ids=list(getattr(job.request, "task_ids", []) or None)
if getattr(job.request, "task_ids", None)
else None,
prompt_variant=job.request.prompt_variant,
)
@ -527,36 +507,10 @@ class EvalWorker:
def _materialize_lane_runtime(self, lane: ParallelLane, job_root: Path) -> None:
lane_root = job_root / f"lane-{lane.index}"
lane.state_dir = lane_root / "state"
lane_home = lane.home_dir
if lane_home is not None:
(lane_home / ".config").mkdir(parents=True, exist_ok=True)
lane.log_path = lane_root / "gateway.log"
lane.port = GATEWAY_PORT + (lane.index * GATEWAY_PORT_SPACING)
self._seed_lane_state_dir(lane.state_dir)
def _run_lane_prepare_hook(self, lane: ParallelLane) -> None:
hook = os.environ.get("CLAWBENCH_LANE_PREPARE_CMD", "").strip()
if not hook:
return
if lane.state_dir is None:
raise RuntimeError(f"Lane {lane.index + 1} state dir missing before prepare hook")
lane_home = lane.home_dir
if lane_home is None:
raise RuntimeError(f"Lane {lane.index + 1} home dir missing before prepare hook")
(lane_home / ".config").mkdir(parents=True, exist_ok=True)
hook_env = {
**os.environ,
"HOME": str(lane_home),
"OPENCLAW_HOME": str(lane_home),
"OPENCLAW_STATE_DIR": str(lane.state_dir),
"OPENCLAW_CONFIG_PATH": str(lane.state_dir / "openclaw.json"),
"XDG_CONFIG_HOME": str(lane_home / ".config"),
"CLAWBENCH_LANE_INDEX": str(lane.index),
"CLAWBENCH_LANE_PORT": str(lane.port),
}
logger.info("Running lane %d prepare hook", lane.index + 1)
subprocess.run([hook], env=hook_env, check=True)
def _seed_lane_state_dir(self, target_state_dir: Path) -> None:
source_state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR", os.path.expanduser("~/.openclaw")))
shutil.rmtree(target_state_dir, ignore_errors=True)
@ -675,19 +629,13 @@ class EvalWorker:
_set_nested(data, "browser.headless", True)
_set_nested(data, "browser.noSandbox", True)
_set_nested(data, "agents.defaults.skipBootstrap", True)
_set_nested(data, "tools.exec.host", self._openclaw_eval_exec_host())
_set_nested(data, "tools.exec.security", "full")
_set_nested(data, "tools.exec.ask", "off")
_set_nested(data, "approvals.exec.enabled", False)
if self._active_model:
_set_nested(data, "agents.defaults.model.primary", self._active_model)
_set_nested(data, "agents.defaults.subagents.model.primary", self._active_model)
self._apply_eval_model_defaults(data, self._active_model)
tmp_path = cfg_path.with_suffix(".json.tmp")
tmp_path.write_text(json.dumps(data, indent=2), encoding="utf-8")
tmp_path.replace(cfg_path)
self._write_eval_exec_approvals(lane_state_dir)
def _order_task_stats(self, tasks: list[TaskDefinition], combined_stats: list) -> list:
stats_by_id = {}
@ -782,7 +730,6 @@ class EvalWorker:
"token",
"--token",
gateway_token,
"--compact",
],
stdout=log_handle,
stderr=subprocess.STDOUT,
@ -821,12 +768,6 @@ class EvalWorker:
f"Gateway /health did not respond within {health_deadline_sec}s. Log:\n{self._read_gateway_log()}"
)
await self._wait_for_gateway_ready_marker(
process=self._gateway_process,
log_reader=lambda: self._read_gateway_log(limit=20_000),
description="Gateway",
)
# Phase B: control-plane probe with retries (see the parallel
# variant in _ensure_parallel_gateway for the detailed rationale).
gateway_config = GatewayConfig(url=GATEWAY_WS_URL, token=GATEWAY_TOKEN)
@ -876,30 +817,21 @@ class EvalWorker:
# Re-inject the host config's env + plugins before every restart.
if lane.state_dir is not None:
self._reinject_host_env_to_lane(lane.state_dir)
self._run_lane_prepare_hook(lane)
if lane.state_dir is None or lane.log_path is None:
raise RuntimeError(f"Lane {lane.index + 1} runtime was not materialized before gateway startup")
lane_home = lane.home_dir
if lane_home is None:
raise RuntimeError(f"Lane {lane.index + 1} home was not materialized before gateway startup")
(lane_home / ".config").mkdir(parents=True, exist_ok=True)
logger.info("Starting lane %d gateway on port %d", lane.index + 1, lane.port)
gateway_token = os.environ.get("OPENCLAW_GATEWAY_TOKEN", "clawbench-internal-token")
gateway_env = {
**os.environ,
"HOME": str(lane_home),
"OPENCLAW_HOME": str(lane_home),
"OPENCLAW_HOME": os.environ.get("OPENCLAW_HOME", os.path.expanduser("~")),
"OPENCLAW_STATE_DIR": str(lane.state_dir),
"OPENCLAW_CONFIG_PATH": str(lane.state_dir / "openclaw.json"),
"XDG_CONFIG_HOME": str(lane_home / ".config"),
"OPENCLAW_SKIP_GMAIL_WATCHER": "1",
"OPENCLAW_SKIP_CANVAS_HOST": "1",
"OPENCLAW_NO_RESPAWN": "1",
}
self._configure_browser_runtime(gateway_cmd, gateway_env)
lane.log_path.parent.mkdir(parents=True, exist_ok=True)
lane.log_path.write_text("", encoding="utf-8")
log_handle = lane.log_path.open("a", encoding="utf-8")
try:
process = subprocess.Popen(
@ -917,7 +849,6 @@ class EvalWorker:
"token",
"--token",
gateway_token,
"--compact",
],
stdout=log_handle,
stderr=subprocess.STDOUT,
@ -960,12 +891,6 @@ class EvalWorker:
f"Log:\n{self._read_parallel_gateway_log(lane)}"
)
await self._wait_for_gateway_ready_marker(
process=process,
log_reader=lambda: self._read_parallel_gateway_log(lane, limit=20_000),
description=f"Lane {lane.index + 1} gateway",
)
# Phase B: control-plane probe with explicit retries. A healthy
# /health response does not guarantee sessions.create works
# immediately — plugin registration races can leave the gateway
@ -1077,10 +1002,6 @@ class EvalWorker:
("agents.defaults.skipBootstrap", True),
("browser.headless", True),
("browser.noSandbox", True),
("tools.exec.host", self._openclaw_eval_exec_host()),
("tools.exec.security", "full"),
("tools.exec.ask", "off"),
("approvals.exec.enabled", False),
]
if self._active_model:
config_pairs.extend(
@ -1090,61 +1011,14 @@ class EvalWorker:
]
)
try:
state_dir = Path(
gateway_env.get("OPENCLAW_STATE_DIR")
or os.environ.get("OPENCLAW_STATE_DIR")
or os.path.expanduser("~/.openclaw")
)
config_path = Path(gateway_env.get("OPENCLAW_CONFIG_PATH") or (state_dir / "openclaw.json"))
self._patch_openclaw_config(config_pairs, config_path=config_path)
self._write_eval_exec_approvals(state_dir)
self._patch_openclaw_config(config_pairs)
except Exception as exc:
logger.warning("Direct openclaw.json patch failed: %s", exc)
@staticmethod
def _openclaw_eval_exec_host() -> str:
value = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
if value in OPENCLAW_EVAL_EXEC_HOSTS:
return value
logger.warning("Invalid OPENCLAW_EXEC_HOST=%r; using gateway", value)
return "gateway"
@staticmethod
def _write_eval_exec_approvals(state_dir: Path) -> None:
state_dir.mkdir(parents=True, exist_ok=True)
approvals_path = state_dir / "exec-approvals.json"
approvals = {
"version": 1,
"socket": {
"path": str(approvals_path.with_suffix(".sock")),
"token": "clawbench-eval-token",
},
"defaults": {
"security": "full",
"ask": "off",
"askFallback": "full",
},
"agents": {
"*": {
"security": "full",
"ask": "off",
"askFallback": "full",
}
},
}
tmp_path = approvals_path.with_suffix(".json.tmp")
tmp_path.write_text(json.dumps(approvals, indent=2), encoding="utf-8")
tmp_path.replace(approvals_path)
def _patch_openclaw_config(
self,
pairs: list[tuple[str, object]],
*,
config_path: Path | None = None,
) -> None:
if config_path is None:
state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR") or os.path.expanduser("~/.openclaw"))
config_path = state_dir / "openclaw.json"
def _patch_openclaw_config(pairs: list[tuple[str, object]]) -> None:
state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR") or os.path.expanduser("~/.openclaw"))
config_path = state_dir / "openclaw.json"
if not config_path.exists():
logger.warning("openclaw.json not found at %s; skipping direct patch", config_path)
return
@ -1160,50 +1034,12 @@ class EvalWorker:
if cursor.get(parts[-1]) != value:
cursor[parts[-1]] = value
changed = True
if self._active_model:
changed = self._apply_eval_model_defaults(data, self._active_model) or changed
if not changed:
return
tmp_path = config_path.with_suffix(".json.tmp")
tmp_path.write_text(json.dumps(data, indent=2), encoding="utf-8")
tmp_path.replace(config_path)
@staticmethod
def _apply_eval_model_defaults(data: dict, model: str) -> bool:
"""Force eval model parameters that keep benchmark turns low-latency."""
agents = data.setdefault("agents", {})
if not isinstance(agents, dict):
data["agents"] = agents = {}
defaults = agents.setdefault("defaults", {})
if not isinstance(defaults, dict):
agents["defaults"] = defaults = {}
models = defaults.setdefault("models", {})
if not isinstance(models, dict):
defaults["models"] = models = {}
entry = models.setdefault(model, {})
if not isinstance(entry, dict):
entry = {}
models[model] = entry
params = entry.setdefault("params", {})
if not isinstance(params, dict):
params = {}
entry["params"] = params
changed = False
if defaults.get("systemPromptOverride") != OPENCLAW_EVAL_SYSTEM_PROMPT:
defaults["systemPromptOverride"] = OPENCLAW_EVAL_SYSTEM_PROMPT
changed = True
if params.get("fastMode") is not True:
params["fastMode"] = True
changed = True
if model.startswith("openai/"):
if params.get("transport") != "sse":
params["transport"] = "sse"
changed = True
if params.get("openaiWsWarmup") is not False:
params["openaiWsWarmup"] = False
changed = True
return changed
def _find_gateway_cmd(self) -> list[str] | None:
import shutil
@ -1223,15 +1059,13 @@ class EvalWorker:
# Use a generous dedicated config for the probe. A healthy gateway
# usually responds to sessions.create in under a second, but plugin
# initialization (especially OpenRouter model list fetch) can add
# 10-30s after /health reports 200. On cold Docker lanes OpenClaw may
# also install provider runtime SDKs during the first sessions.create,
# so keep this bound configurable and separate from steady-state RPCs.
probe_timeout = float(os.environ.get("CLAWBENCH_GATEWAY_PROBE_TIMEOUT_SECONDS", "180"))
# 10-30s after /health reports 200. The 60s outer bound ensures we
# don't give up during a cold-start scenario.
probe_config = GatewayConfig(
url=gateway_config.url,
token=gateway_config.token,
connect_timeout=gateway_config.connect_timeout,
request_timeout=probe_timeout,
request_timeout=30.0,
)
async def _probe() -> None:
@ -1242,67 +1076,25 @@ class EvalWorker:
await client.delete_session(session_key)
try:
await asyncio.wait_for(_probe(), timeout=probe_timeout + 10.0)
await asyncio.wait_for(_probe(), timeout=60.0)
except asyncio.TimeoutError as exc:
raise RuntimeError(
f"Gateway control-plane probe timed out after {probe_timeout:.0f}s "
"Gateway control-plane probe timed out after 60s "
"(sessions.create hung on a freshly-started gateway); "
"lane will be retried by the queue."
) from exc
async def _wait_for_gateway_ready_marker(self, process: subprocess.Popen, log_reader, description: str) -> None:
# OpenClaw 2026.4.26 can answer /health before channels and sidecars
# finish startup. Probing sessions.create during that window can hold the
# session write lock for minutes. Some lane gateway modes do not emit
# the final ready marker, so wait for it briefly after sidecar startup
# and then let the bounded control-plane probe decide.
ready_deadline_sec = int(os.environ.get("CLAWBENCH_GATEWAY_READY_TIMEOUT_SECONDS", "420"))
marker_grace_sec = int(os.environ.get("CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS", "90"))
saw_sidecar_start = False
sidecar_start_elapsed: int | None = None
for elapsed in range(ready_deadline_sec):
if process.poll() is not None:
raise RuntimeError(
f"{description} exited with code {process.returncode}. Log:\n{log_reader()[-4_000:]}"
)
log_text = log_reader()
if "[gateway] ready" in log_text:
logger.info("%s ready after %ss", description, elapsed)
return
if "[gateway] starting channels and sidecars" in log_text:
saw_sidecar_start = True
if sidecar_start_elapsed is None:
sidecar_start_elapsed = elapsed
if sidecar_start_elapsed is not None and elapsed - sidecar_start_elapsed >= marker_grace_sec:
logger.info(
"%s did not emit ready marker %ss after sidecar startup; probing control plane",
description,
marker_grace_sec,
)
return
if not saw_sidecar_start and elapsed >= 15:
return
await asyncio.sleep(1)
logger.warning(
"%s did not log ready within %ss; probing control plane anyway. Log:\n%s",
description,
ready_deadline_sec,
log_reader()[-4_000:],
)
def _read_gateway_log(self, limit: int = 4_000) -> str:
def _read_gateway_log(self) -> str:
try:
return Path("/tmp/gateway.log").read_text(encoding="utf-8", errors="replace")[-limit:]
return Path("/tmp/gateway.log").read_text(encoding="utf-8", errors="replace")[-4_000:]
except Exception:
return "(no gateway log)"
def _read_parallel_gateway_log(self, lane: ParallelLane, limit: int = 4_000) -> str:
def _read_parallel_gateway_log(self, lane: ParallelLane) -> str:
if lane.log_path is None:
return "(no gateway log)"
try:
return lane.log_path.read_text(encoding="utf-8", errors="replace")[-limit:]
return lane.log_path.read_text(encoding="utf-8", errors="replace")[-4_000:]
except Exception:
return "(no gateway log)"

View File

@ -1,367 +0,0 @@
# Running ClawBench on Kubernetes
ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar
connects to the gateway over loopback (`ws://localhost:18789`), runs the
19-task eval suite, and optionally logs results to MLflow.
```
┌─── OpenClaw Pod ─────────────────────────────┐
│ gateway container (ws://localhost:18789) │
│ clawbench sidecar ──► gateway via loopback │
└──────────────────────────────────────────────┘
│ │
▼ ▼
Model provider API MLflow (optional)
```
All commands use `scripts/k8s/deploy.sh`. The script has these modes:
| Flag | What it does |
|------|-------------|
| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar |
| `--openclaw-only` | Deploy OpenClaw gateway only |
| `--mlflow-only` | Deploy MLflow only |
| `--add-sidecar` | Inject clawbench sidecar (starts eval) |
| `--remove-sidecar` | Remove clawbench sidecar |
| `--logs` | Tail sidecar logs |
| `--teardown` | Delete eval namespace (keeps MLflow) |
---
## Prerequisites
- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds)
- A container image for ClawBench (see [Building images](#building-images))
- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)
For local testing with Kind:
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
---
## Environment variables
Set these **before** running `deploy.sh`.
### Required
| Variable | Purpose |
|----------|---------|
| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) |
| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) |
### Optional
| Variable | Default | Purpose |
|----------|---------|---------|
| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image |
| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image |
| `OPENCLAW_GATEWAY_TOKEN` | *(generated by script)* | Gateway token; set this when attaching the sidecar to an existing gateway |
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate |
| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace |
| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set |
| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID |
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image |
| `ANTHROPIC_API_KEY` | | Added to K8s secret if set |
| `OPENROUTER_API_KEY` | | Added to K8s secret if set |
| `GEMINI_API_KEY` | | Added to K8s secret if set |
| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config |
### Model routing
The gateway routes by provider prefix:
| Model string | Required variables |
|-------------|-------------------|
| `openai/gpt-5.5` | `OPENAI_API_KEY` |
| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` |
| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` |
| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` |
For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model
server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/`
prefix for the model name:
```bash
export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
export OPENAI_API_KEY="none" # dummy value if the endpoint doesn't require auth
export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"
```
---
## Full deploy (quick start)
Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.
```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
# Export API keys before running. The script stores them in a K8s Secret
# ("clawbench-secrets") that the gateway and sidecar containers read.
export OPENAI_API_KEY="sk-..."
# Model to evaluate (default: openai/gpt-5.5)
# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
./scripts/k8s/deploy.sh
```
Verify:
```bash
# Should show 2/2 containers (gateway + clawbench)
kubectl get pods -n clawbench-eval
# Follow eval progress
./scripts/k8s/deploy.sh --logs
```
When the eval finishes, copy results and clean up:
```bash
# Copy results from the sidecar
POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json
# Remove the sidecar (keeps OpenClaw + MLflow running)
./scripts/k8s/deploy.sh --remove-sidecar
# Or tear down everything
./scripts/k8s/deploy.sh --teardown
```
---
## Existing cluster + existing MLflow
If you already have an OpenShift or Kubernetes cluster and an MLflow instance,
you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup
required.
```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
# API keys — export before running deploy.sh. The script creates a
# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
# At least one provider key is required.
export OPENAI_API_KEY="sk-..."
# export ANTHROPIC_API_KEY="sk-ant-..."
# export OPENROUTER_API_KEY="sk-or-..."
# export GEMINI_API_KEY="..."
# Model to evaluate (default: openai/gpt-5.5)
export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
# If attaching to an existing OpenClaw gateway, this must match that gateway.
# If deploy.sh creates OpenClaw, it generates this token for you.
# export OPENCLAW_GATEWAY_TOKEN="..."
# Point to your existing MLflow
export MLFLOW_TRACKING_URI="https://mlflow.example.com"
export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5" # or use MLFLOW_EXPERIMENT_ID=42
# Deploy OpenClaw gateway into your cluster
./scripts/k8s/deploy.sh --openclaw-only
```
Verify OpenClaw is running:
```bash
kubectl get pods -n clawbench-eval
# Expect: openclaw-xxxx 1/1 Running
```
Then start the eval:
```bash
./scripts/k8s/deploy.sh --add-sidecar
./scripts/k8s/deploy.sh --logs
```
The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment
and patches the experiment name/ID into the clawbench ConfigMap. When the eval
completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that
experiment.
`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist.
`MLFLOW_EXPERIMENT_ID` requires an existing experiment.
---
## Step-by-step deploy
Use this when you want to deploy components individually or bring your own
OpenClaw/MLflow.
### Step 1: Deploy OpenClaw gateway
```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
export OPENAI_API_KEY="sk-..."
./scripts/k8s/deploy.sh --openclaw-only
```
Verify:
```bash
kubectl get pods -n clawbench-eval
# Expect: openclaw-xxxx 1/1 Running
```
This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token
auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway
token and creates the `clawbench-secrets` Secret automatically.
**Skip this step** if you already have an OpenClaw deployment. Your existing
gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`):
```json
{
"browser": {
"enabled": true,
"headless": true,
"noSandbox": true,
"ssrfPolicy": {
"allowedHostnames": ["localhost", "127.0.0.1"]
}
},
"tools": {
"profile": "coding",
"alsoAllow": ["browser"]
}
}
```
Key requirements:
- `browser.enabled: true` — activates the bundled browser plugin
- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default
- `browser.ssrfPolicy` — several eval tasks need localhost access
- Gateway must bind to loopback with token auth; export the matching
`OPENCLAW_GATEWAY_TOKEN` before running `--add-sidecar`
### Step 2: Deploy MLflow
```bash
./scripts/k8s/deploy.sh --mlflow-only
```
Verify:
```bash
kubectl get pods -n mlflow
# Expect: mlflow-xxxx 1/1 Running
```
Deploys a single-replica MLflow server with SQLite backend into the `mlflow`
namespace. The clawbench ConfigMap defaults to
`http://mlflow-service.mlflow.svc.cluster.local:5000`.
**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`:
```bash
export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
export MLFLOW_EXPERIMENT_ID=4 # or MLFLOW_EXPERIMENT_NAME
```
### Step 3: Run the eval
```bash
./scripts/k8s/deploy.sh --add-sidecar
```
This patches the OpenClaw deployment to inject a clawbench sidecar that:
1. Waits for the gateway (TCP check on port 18789, up to 3 min)
2. Checks MLflow connectivity if configured
3. Runs `clawbench run` with settings from the ConfigMap
4. Logs results to MLflow on success
5. Sleeps indefinitely so you can retrieve logs and results
Verify:
```bash
kubectl get pods -n $CLAWBENCH_NAMESPACE
# Expect: openclaw-xxxx 2/2 Running (gateway + clawbench)
./scripts/k8s/deploy.sh --logs
# Should show "Waiting for gateway..." then "Starting eval..."
```
When finished, remove the sidecar:
```bash
./scripts/k8s/deploy.sh --remove-sidecar
```
---
## ConfigMap tuning
The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval
behavior. Override at deploy time via env vars, or patch after deploy:
| Key | Default | What it controls |
|-----|---------|-----------------|
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test |
| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) |
| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes |
| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) |
| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) |
| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds |
| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds |
| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run |
| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint |
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
---
## MLflow integration
Results are logged via `scripts/log_to_mlflow.py` after a successful eval.
**What gets logged:**
- **Params**: model, provider, benchmark version, OpenClaw version, judge model
- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior,
reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
- **Tags**: submission ID, timestamp, certified flag
- **Artifacts**: full benchmark result JSON
---
## Building images
### ClawBench image
`quay.io/sallyom/clawbench:latest` is public
For Kubernetes, use the lightweight sidecar image instead — it only includes
the eval harness and MLflow client:
```bash
docker build -t clawbench:latest -f scripts/k8s/Dockerfile .
# For Kind clusters, load directly instead of pushing to a registry:
kind load docker-image clawbench:latest --name openclaw
# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
# Ensure you build for the right architecture, usually amd64 for non-local k8s
```
Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it.
---
## Cleanup
```bash
# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
./scripts/k8s/deploy.sh --remove-sidecar
# Delete eval namespace (keeps MLflow running)
./scripts/k8s/deploy.sh --teardown
# Delete the Kind cluster entirely
kind delete cluster --name openclaw
```

View File

@ -10,8 +10,7 @@ dependencies = [
"pydantic>=2.7,<3",
"pyyaml>=6.0,<7",
"datasets>=3.0,<4",
"gradio>=6.7.0,<7",
"pillow>=12.2.0,<13",
"gradio>=5.0,<6",
"httpx>=0.27,<1",
"numpy>=1.26,<3",
"rich>=13.0,<14",
@ -19,8 +18,8 @@ dependencies = [
# Runtime deps for the task completion verifier. The harness shells out
# to `pytest -q` / `pytest-asyncio` inside per-task workspaces as the
# execution check; the container must have them in PATH.
"pytest>=9.0.3,<10",
"pytest-asyncio>=1,<2",
"pytest>=8.0,<9",
"pytest-asyncio>=0.24,<1",
]
[project.optional-dependencies]
@ -28,14 +27,11 @@ dev = [
# Kept as an alias for historical `pip install .[dev]` invocations.
# pytest + pytest-asyncio are now in the base [dependencies] since the
# benchmark itself runs pytest in task workspaces.
"pytest>=9.0.3,<10",
"pytest-asyncio>=1,<2",
"pytest>=8.0,<9",
"pytest-asyncio>=0.24,<1",
"pre-commit>=4.0,<5",
"ruff>=0.9,<1",
]
mlflow = [
"mlflow>=2.10,<3",
]
hermes = [
"hermes-agent @ git+https://github.com/NousResearch/hermes-agent.git@main",
]

View File

@ -1,198 +0,0 @@
#!/bin/bash
# Cherry-pick variant of container_sweep_single.sh: runs ONLY the tasks listed
# in $CHERRY_TASKS (comma-separated task IDs), with state-dir isolation.
#
# Required env vars:
# SWEEP_LABEL (e.g. opus47)
# SWEEP_MODEL (e.g. anthropic/claude-opus-4-7)
# SWEEP_PROFILE (absolute path in container)
# SWEEP_LOGDIR (default /data/drift_2026-04-20-cherry)
# SWEEP_OUT_TAG (default v2026-4-20-cherry)
# CHERRY_TASKS (comma-separated task IDs, e.g. "t2-ctx-pronoun-resolve,t3-fin-budget-monthly")
set -u
: "${SWEEP_LABEL:?SWEEP_LABEL required}"
: "${SWEEP_MODEL:?SWEEP_MODEL required}"
: "${SWEEP_PROFILE:?SWEEP_PROFILE required}"
: "${CHERRY_TASKS:?CHERRY_TASKS required (comma-separated task IDs)}"
: "${SWEEP_LOGDIR:=/data/drift_2026-04-20-cherry}"
: "${SWEEP_OUT_TAG:=v2026-4-20-cherry}"
cd /data
LOGDIR="$SWEEP_LOGDIR"
mkdir -p "$LOGDIR"
export OPENCLAW_GATEWAY_TOKEN="local-dev-token-for-testing"
export CLAWBENCH_RUN_CACHE_DIR="/data/run_cache"
mkdir -p "$CLAWBENCH_RUN_CACHE_DIR"
export NODE_OPTIONS="--max-old-space-size=4096"
# OpenClaw 4.22+ has slower agents.create / sessions.create on cold start
# (we observed 72s for opus-4-7). Bump RPC timeouts so the harness doesn't
# cancel mid-flight. Override defaults of 30s / 60s respectively.
export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-120}"
export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
export CLAWBENCH_PER_RUN_BUDGET_SECONDS="${CLAWBENCH_PER_RUN_BUDGET_SECONDS:-900}"
export HERMES_STEP_TIMEOUT_SECONDS="${HERMES_STEP_TIMEOUT_SECONDS:-180}"
# State-dir isolation (same as container_sweep_single.sh)
SRC_STATE="/home/node/.openclaw"
FRESH_STATE="/tmp/openclaw-state-${SWEEP_LABEL}-$$"
echo "[state-isolate] cloning config from $SRC_STATE to $FRESH_STATE"
mkdir -p "$FRESH_STATE"
[ -f "$SRC_STATE/openclaw.json" ] && cp "$SRC_STATE/openclaw.json" "$FRESH_STATE/openclaw.json"
[ -f "$SRC_STATE/exec-approvals.json" ] && cp "$SRC_STATE/exec-approvals.json" "$FRESH_STATE/exec-approvals.json"
for d in identity devices tasks subagents flows cron; do
[ -d "$SRC_STATE/$d" ] && cp -r "$SRC_STATE/$d" "$FRESH_STATE/$d"
done
mkdir -p "$FRESH_STATE/agents" "$FRESH_STATE/workspace" "$FRESH_STATE/logs" "$FRESH_STATE/memory" "$FRESH_STATE/cache"
export OPENCLAW_STATE_DIR="$FRESH_STATE"
export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
echo "[state-isolate] OPENCLAW_STATE_DIR=$OPENCLAW_STATE_DIR"
python - <<'PY'
import json
import os
from pathlib import Path
cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
data = json.loads(cfg_path.read_text(encoding="utf-8")) if cfg_path.exists() else {}
def set_nested(root, dotted, value):
cursor = root
parts = dotted.split(".")
for part in parts[:-1]:
child = cursor.get(part)
if not isinstance(child, dict):
child = {}
cursor[part] = child
cursor = child
cursor[parts[-1]] = value
exec_host = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
if exec_host not in {"auto", "gateway", "sandbox", "node"}:
raise SystemExit(f"invalid OPENCLAW_EXEC_HOST={exec_host!r}")
set_nested(data, "tools.exec.host", exec_host)
set_nested(data, "tools.exec.security", "full")
set_nested(data, "tools.exec.ask", "off")
set_nested(data, "approvals.exec.enabled", False)
cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
approvals_path = cfg_path.with_name("exec-approvals.json")
approvals = {
"version": 1,
"socket": {
"path": str(approvals_path.with_suffix(".sock")),
"token": "container-cherry-eval-token",
},
"defaults": {"security": "full", "ask": "off", "askFallback": "full"},
"agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
}
approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
PY
# Map model to cache subdir (for archiving)
case "$SWEEP_MODEL" in
anthropic/claude-opus-4-7) CACHE_SUB="anthropic_claude-opus-4-7" ;;
anthropic/claude-opus-4-6) CACHE_SUB="anthropic_claude-opus-4-6" ;;
anthropic/claude-sonnet-4-6) CACHE_SUB="anthropic_claude-sonnet-4-6" ;;
openai/gpt-5.5) CACHE_SUB="openai_gpt-5.5" ;;
openai/gpt-5.4) CACHE_SUB="openai_gpt-5.4" ;;
google/gemini-3.1-pro-preview) CACHE_SUB="google_gemini-3.1-pro-preview" ;;
openrouter/z-ai/glm-5.1) CACHE_SUB="openrouter_z-ai_glm-5.1" ;;
openrouter/qwen/qwen3.6-plus) CACHE_SUB="openrouter_qwen_qwen3.6-plus" ;;
openrouter/minimax/minimax-m2.7) CACHE_SUB="openrouter_minimax_minimax-m2.7" ;;
openrouter/moonshotai/kimi-k2.6) CACHE_SUB="openrouter_moonshotai_kimi-k2.6" ;;
openrouter/moonshotai/kimi-k2.5) CACHE_SUB="openrouter_moonshotai_kimi-k2.5" ;;
openrouter/deepseek/deepseek-v4-pro) CACHE_SUB="openrouter_deepseek_deepseek-v4-pro" ;;
deepseek/deepseek-v4-pro) CACHE_SUB="deepseek_deepseek-v4-pro" ;;
deepseek/v4-pro) CACHE_SUB="deepseek_v4-pro" ;;
*) CACHE_SUB="" ;;
esac
OUT="$LOGDIR/docker_${SWEEP_LABEL}_${SWEEP_OUT_TAG}.json"
LOG="$LOGDIR/docker_${SWEEP_LABEL}_${SWEEP_OUT_TAG}.log"
GWLOG="$LOGDIR/gateway_${SWEEP_LABEL}.log"
echo "===== CHERRY-PICK SWEEP $(date '+%Y-%m-%d %H:%M:%S') ====="
echo "label: $SWEEP_LABEL"
echo "model: $SWEEP_MODEL"
echo "tasks: $CHERRY_TASKS"
echo "out: $OUT"
# Force-clear this model's run_cache (including fixed-task slots — so they
# actually re-run against the new image instead of hitting old cache).
if [ -n "$CACHE_SUB" ] && [ -d "$CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB" ]; then
echo "clearing cache: $CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB"
rm -rf "$CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB"
fi
[ -f "$OUT" ] && rm -f "$OUT"
# Start gateway with bumped heap
echo "Starting gateway on :18789 (heap=4GB) ..."
openclaw gateway --port 18789 > "$GWLOG" 2>&1 &
GATEWAY_PID=$!
ready=0
for i in $(seq 1 120); do
if curl -sf -H "Authorization: Bearer $OPENCLAW_GATEWAY_TOKEN" http://127.0.0.1:18789/ready > /dev/null 2>&1; then
echo "Gateway ready after ${i}s"
ready=1
break
fi
sleep 1
done
if [ $ready -ne 1 ]; then
echo "ERROR: gateway failed to become ready within 120s"
tail -30 "$GWLOG"
exit 1
fi
# Build -t args from comma-separated list
TASK_ARGS=()
IFS=',' read -ra TASK_ARR <<< "$CHERRY_TASKS"
for t in "${TASK_ARR[@]}"; do
TASK_ARGS+=("-t" "$t")
done
echo "===== $(date '+%H:%M:%S') running clawbench with tasks: ${TASK_ARR[*]} ====="
# NOTE: --profile intentionally OMITTED. The legacy frontier_*.yaml profile
# format is incompatible with OpenClaw 4.22+ (loads n_tools_total=0,
# starves the agent of tools, all runs fail with environment_unavailable
# or timeout). Running with the default openclaw tool stack — same for
# all models, so the comparison stays apples-to-apples.
PROFILE_ARG=""
if [ -n "${USE_PROFILE:-}" ] && [ -f "$SWEEP_PROFILE" ]; then
PROFILE_ARG="--profile $SWEEP_PROFILE"
fi
clawbench run \
--model "$SWEEP_MODEL" \
--runs 3 \
--concurrency "${CLAWBENCH_CONCURRENCY:-1}" \
$PROFILE_ARG \
--judge-model "anthropic/claude-sonnet-4-6" \
"${TASK_ARGS[@]}" \
-o "$OUT" \
> "$LOG" 2>&1
status=$?
if [ $status -eq 0 ]; then
echo "===== $(date '+%H:%M:%S') done $SWEEP_LABEL (exit 0) ====="
else
echo "===== $(date '+%H:%M:%S') FAILED $SWEEP_LABEL (exit $status) ====="
tail -20 "$LOG"
fi
# Archive cache to v2026-4-20-cherry tag
# shellcheck disable=SC1091
source "$(dirname "$0")/_archive_cache.sh" 2>/dev/null && archive_run_cache || echo "[archive] helper missing"
kill $GATEWAY_PID 2>/dev/null
wait $GATEWAY_PID 2>/dev/null
# Clean up isolated state dir
[ -n "${FRESH_STATE:-}" ] && [ -d "$FRESH_STATE" ] && rm -rf "$FRESH_STATE"
exit $status

View File

@ -1,231 +0,0 @@
#!/bin/bash
# Run one OpenClaw model/profile through the HF-style isolated lane worker.
set -Eeuo pipefail
: "${SWEEP_MODEL:?SWEEP_MODEL required}"
: "${SWEEP_LABEL:?SWEEP_LABEL required}"
: "${SWEEP_OUT_TAG:=lane-container}"
: "${SWEEP_LANES:=3}"
: "${SWEEP_RUNS:=1}"
: "${SWEEP_LOGDIR:=/data/results}"
: "${CLAWBENCH_PER_RUN_BUDGET_SECONDS:=900}"
: "${CLAWBENCH_PER_TURN_TIMEOUT_SECONDS:=300}"
: "${OPENCLAW_EXEC_HOST:=gateway}"
cd /home/node/app
export CLAWBENCH_LOCAL_QUEUE_DIR="${CLAWBENCH_LOCAL_QUEUE_DIR:-/data/queue/$SWEEP_LABEL}"
mkdir -p "$SWEEP_LOGDIR" /data/results "$CLAWBENCH_LOCAL_QUEUE_DIR" /data/run_cache /data/lane_runtime
export HF_TOKEN=""
export OPENCLAW_GATEWAY_TOKEN="${OPENCLAW_GATEWAY_TOKEN:-local-dev-token-for-testing}"
export OPENCLAW_SKIP_GMAIL_WATCHER=1
export OPENCLAW_SKIP_CANVAS_HOST=1
export OPENCLAW_NO_RESPAWN=1
export CLAWBENCH_DISABLE_GATEWAY_DEVICE_IDENTITY=1
export CLAWBENCH_PER_RUN_BUDGET_SECONDS
export CLAWBENCH_PER_TURN_TIMEOUT_SECONDS
export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-180}"
export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
export CLAWBENCH_GATEWAY_HEALTH_TIMEOUT_SECONDS="${CLAWBENCH_GATEWAY_HEALTH_TIMEOUT_SECONDS:-240}"
export CLAWBENCH_LANE_STARTUP_STAGGER_SECONDS="${CLAWBENCH_LANE_STARTUP_STAGGER_SECONDS:-90}"
export CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS="${CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS:-90}"
export CLAWBENCH_KEEP_PARALLEL_LANE_ROOT="${CLAWBENCH_KEEP_PARALLEL_LANE_ROOT:-0}"
export CLAWBENCH_PARALLEL_LANE_ROOT="/data/lane_runtime/$SWEEP_LABEL"
export CLAWBENCH_TOOL_PROFILE_NAME="${CLAWBENCH_TOOL_PROFILE_NAME:-$SWEEP_LABEL}"
export NODE_OPTIONS="${NODE_OPTIONS:-"--max-old-space-size=4096"}"
if command -v npm >/dev/null 2>&1; then
export NODE_PATH="${NODE_PATH:-$(npm root -g 2>/dev/null || true)}"
fi
SRC_STATE="${OPENCLAW_CONFIG_SOURCE:-/config/openclaw}"
if [ ! -d "$SRC_STATE" ]; then
SRC_STATE="/home/node/.openclaw"
fi
safe_model="${SWEEP_MODEL//\//_}"
safe_model="${safe_model//:/_}"
OUT="$SWEEP_LOGDIR/${SWEEP_LABEL}_openclaw_${safe_model}_${SWEEP_OUT_TAG}.json"
LOG="$SWEEP_LOGDIR/${SWEEP_LABEL}_openclaw_${safe_model}_${SWEEP_OUT_TAG}.log"
export SWEEP_OUTPUT_PATH="$OUT"
FRESH_HOME="/tmp/openclaw-home-${SWEEP_LABEL}-$$"
FRESH_STATE="$FRESH_HOME/.openclaw"
rm -rf "$FRESH_HOME" "$CLAWBENCH_PARALLEL_LANE_ROOT"
mkdir -p "$FRESH_STATE" "$FRESH_HOME/.config"
if [ -f "$SRC_STATE/openclaw.json" ]; then
cp "$SRC_STATE/openclaw.json" "$FRESH_STATE/openclaw.json"
fi
if [ -d "$SRC_STATE/plugins" ]; then
mkdir -p "$FRESH_STATE/plugins"
cp -R "$SRC_STATE/plugins/." "$FRESH_STATE/plugins/" 2>/dev/null || true
fi
mkdir -p \
"$FRESH_STATE/agents" \
"$FRESH_STATE/workspace" \
"$FRESH_STATE/logs" \
"$FRESH_STATE/memory" \
"$FRESH_STATE/cache" \
"$FRESH_STATE/identity" \
"$FRESH_STATE/devices" \
"$FRESH_STATE/tasks" \
"$FRESH_STATE/subagents" \
"$FRESH_STATE/flows" \
"$FRESH_STATE/cron"
export HOME="$FRESH_HOME"
export OPENCLAW_HOME="$FRESH_HOME"
export OPENCLAW_STATE_DIR="$FRESH_STATE"
export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
export XDG_CONFIG_HOME="$FRESH_HOME/.config"
python - <<'PY'
import json
import os
from pathlib import Path
cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
if not cfg_path.exists():
raise SystemExit("missing openclaw.json")
data = json.loads(cfg_path.read_text(encoding="utf-8"))
def set_nested(root, dotted, value):
cursor = root
parts = dotted.split(".")
for part in parts[:-1]:
child = cursor.get(part)
if not isinstance(child, dict):
child = {}
cursor[part] = child
cursor = child
cursor[parts[-1]] = value
agents = data.setdefault("agents", {})
if isinstance(agents, dict):
agents["list"] = []
channels = data.get("channels")
if isinstance(channels, dict):
for channel in channels.values():
if isinstance(channel, dict):
channel["enabled"] = False
exec_approvals = channel.get("execApprovals")
if not isinstance(exec_approvals, dict):
exec_approvals = {}
channel["execApprovals"] = exec_approvals
exec_approvals["enabled"] = False
plugins = data.setdefault("plugins", {})
stale = {"marxbiotech-git-tools", "lab"}
allow = plugins.get("allow")
if isinstance(allow, list):
plugins["allow"] = [item for item in allow if item not in stale]
entries = plugins.get("entries")
if isinstance(entries, dict):
for item in stale:
entries.pop(item, None)
set_nested(data, "browser.headless", True)
set_nested(data, "browser.noSandbox", True)
set_nested(data, "gateway.reload.mode", "off")
set_nested(data, "agents.defaults.skipBootstrap", True)
set_nested(data, "agents.defaults.sandbox.mode", "off")
set_nested(data, "agents.defaults.model.primary", os.environ["SWEEP_MODEL"])
set_nested(data, "agents.defaults.subagents.model.primary", os.environ["SWEEP_MODEL"])
set_nested(
data,
"agents.defaults.systemPromptOverride",
"You are running an OpenClaw benchmark task. Complete the user's request in the current "
"workspace using the available tools when needed. For file, code, browser, shell, or memory "
"tasks, make the requested changes directly and verify them when practical. Do not ask "
"follow-up questions during the benchmark. Keep any final reply brief.",
)
set_nested(data, "tools.exec.host", os.environ.get("OPENCLAW_EXEC_HOST", "gateway"))
set_nested(data, "tools.exec.security", "full")
set_nested(data, "tools.exec.ask", "off")
set_nested(data, "approvals.exec.enabled", False)
models = data.setdefault("agents", {}).setdefault("defaults", {}).setdefault("models", {})
model_entry = models.setdefault(os.environ["SWEEP_MODEL"], {})
params = model_entry.setdefault("params", {})
params["fastMode"] = True
if os.environ["SWEEP_MODEL"].startswith("openai/"):
params["transport"] = "sse"
params["openaiWsWarmup"] = False
cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
approvals_path = cfg_path.with_name("exec-approvals.json")
approvals = {
"version": 1,
"socket": {
"path": str(approvals_path.with_suffix(".sock")),
"token": "container-lane-eval-token",
},
"defaults": {"security": "full", "ask": "off", "askFallback": "full"},
"agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
}
approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
PY
echo "===== CONTAINER LANE EVAL START $(date '+%Y-%m-%d %H:%M:%S') ====="
echo "label: $SWEEP_LABEL"
echo "model: $SWEEP_MODEL"
echo "runs: $SWEEP_RUNS"
echo "lanes: $SWEEP_LANES"
echo "tasks: ${SWEEP_TASKS:-${CHERRY_TASKS:-all}}"
echo "out: $OUT"
echo "log: $LOG"
echo "home: $HOME"
echo "state: $OPENCLAW_STATE_DIR"
openclaw --version 2>/dev/null || true
set +e
python - <<'PY' > "$LOG" 2>&1
import asyncio
import json
import logging
import os
import shutil
from pathlib import Path
from clawbench.queue import JobQueue, JobStatus, SubmissionRequest
from clawbench.worker import EvalWorker, RESULTS_DIR
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
async def main() -> int:
queue = JobQueue()
queue._jobs.clear()
queue._save_local()
task_ids_raw = os.environ.get("SWEEP_TASKS") or os.environ.get("CHERRY_TASKS") or ""
task_ids = [item.strip() for item in task_ids_raw.split(",") if item.strip()]
request = SubmissionRequest(
model=os.environ["SWEEP_MODEL"],
runs_per_task=int(os.environ["SWEEP_RUNS"]),
max_parallel_lanes=int(os.environ["SWEEP_LANES"]),
task_ids=task_ids,
prompt_variant=os.environ.get("SWEEP_PROMPT_VARIANT", "clear"),
judge_model=os.environ.get("CLAWBENCH_JUDGE_MODEL", ""),
notes=os.environ.get("SWEEP_LABEL", ""),
)
job = await queue.submit(request)
worker = EvalWorker(queue)
await worker._process_job(job)
final = await queue.get_status(job.job_id)
print(json.dumps(final.model_dump() if final else {}, indent=2), flush=True)
if final is None or final.status != JobStatus.FINISHED or not final.result_id:
return 1
result_path = RESULTS_DIR / f"{final.result_id}.json"
output_path = Path(os.environ["SWEEP_OUTPUT_PATH"])
output_path.parent.mkdir(parents=True, exist_ok=True)
shutil.copy2(result_path, output_path)
return 0
raise SystemExit(asyncio.run(main()))
PY
status=$?
set -e
echo "===== lane eval exit=$status $(date '+%Y-%m-%d %H:%M:%S') ====="
tail -120 "$LOG" 2>/dev/null || true
exit "$status"

View File

@ -43,13 +43,6 @@ mkdir -p "$CLAWBENCH_RUN_CACHE_DIR"
# OOM fix: give the gateway Node process a 4GB old-space ceiling instead of the default ~2GB.
# Scoped via env so we don't stomp on other Node processes (clawbench itself is python).
export NODE_OPTIONS="--max-old-space-size=4096"
# OpenClaw 4.22+ has slower agents.create / sessions.create on cold start
# (we observed 72s for opus-4-7). Bump RPC timeouts so the harness doesn't
# cancel mid-flight. Override defaults of 30s / 60s respectively.
export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-120}"
export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
export CLAWBENCH_PER_RUN_BUDGET_SECONDS="${CLAWBENCH_PER_RUN_BUDGET_SECONDS:-900}"
export HERMES_STEP_TIMEOUT_SECONDS="${HERMES_STEP_TIMEOUT_SECONDS:-180}"
# State-dir isolation: the shared /home/node/.openclaw mount accumulates cruft
# across sweeps (agents/, workspace/, logs/, memory/, stale openclaw.json.*.tmp)
@ -80,68 +73,23 @@ done
# Ensure runtime dirs exist but are empty
mkdir -p "$FRESH_STATE/agents" "$FRESH_STATE/workspace" "$FRESH_STATE/logs" "$FRESH_STATE/memory" "$FRESH_STATE/cache"
export OPENCLAW_STATE_DIR="$FRESH_STATE"
export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
echo "[state-isolate] OPENCLAW_STATE_DIR=$OPENCLAW_STATE_DIR"
du -sh "$FRESH_STATE" 2>/dev/null | sed 's/^/[state-isolate] size: /'
python - <<'PY'
import json
import os
from pathlib import Path
cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
data = json.loads(cfg_path.read_text(encoding="utf-8")) if cfg_path.exists() else {}
def set_nested(root, dotted, value):
cursor = root
parts = dotted.split(".")
for part in parts[:-1]:
child = cursor.get(part)
if not isinstance(child, dict):
child = {}
cursor[part] = child
cursor = child
cursor[parts[-1]] = value
exec_host = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
if exec_host not in {"auto", "gateway", "sandbox", "node"}:
raise SystemExit(f"invalid OPENCLAW_EXEC_HOST={exec_host!r}")
set_nested(data, "tools.exec.host", exec_host)
set_nested(data, "tools.exec.security", "full")
set_nested(data, "tools.exec.ask", "off")
set_nested(data, "approvals.exec.enabled", False)
cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
approvals_path = cfg_path.with_name("exec-approvals.json")
approvals = {
"version": 1,
"socket": {
"path": str(approvals_path.with_suffix(".sock")),
"token": "container-single-eval-token",
},
"defaults": {"security": "full", "ask": "off", "askFallback": "full"},
"agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
}
approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
PY
# Map label -> cache subdir (matches what clawbench writes)
case "$SWEEP_MODEL" in
anthropic/claude-opus-4-7) CACHE_SUB="anthropic_claude-opus-4-7" ;;
anthropic/claude-sonnet-4-7) CACHE_SUB="anthropic_claude-sonnet-4-7" ;;
anthropic/claude-opus-4-6) CACHE_SUB="anthropic_claude-opus-4-6" ;;
anthropic/claude-sonnet-4-6) CACHE_SUB="anthropic_claude-sonnet-4-6" ;;
openai/gpt-5.5) CACHE_SUB="openai_gpt-5.5" ;;
openai/gpt-5.4) CACHE_SUB="openai_gpt-5.4" ;;
openai/gpt-5.2) CACHE_SUB="openai_gpt-5.2" ;;
google/gemini-3.1-pro-preview) CACHE_SUB="google_gemini-3.1-pro-preview" ;;
openrouter/z-ai/glm-5.1) CACHE_SUB="openrouter_z-ai_glm-5.1" ;;
openrouter/qwen/qwen3.6-plus) CACHE_SUB="openrouter_qwen_qwen3.6-plus" ;;
openrouter/minimax/minimax-m2.7) CACHE_SUB="openrouter_minimax_minimax-m2.7" ;;
openrouter/moonshotai/kimi-k2.6) CACHE_SUB="openrouter_moonshotai_kimi-k2.6" ;;
openrouter/moonshotai/kimi-k2.5) CACHE_SUB="openrouter_moonshotai_kimi-k2.5" ;;
deepseek/v4-pro) CACHE_SUB="deepseek_v4-pro" ;;
# kimi-k2.6 is not yet supported in the openclaw version under test — skip.
*) CACHE_SUB="" ;;
esac
@ -191,19 +139,11 @@ if [ $ready -ne 1 ]; then
fi
echo "===== $(date '+%H:%M:%S') starting $SWEEP_LABEL ($SWEEP_MODEL) ====="
# NOTE: --profile intentionally OMITTED unless USE_PROFILE=1 is set. The
# legacy frontier_*.yaml profile format is incompatible with OpenClaw
# 4.22+ (loads n_tools_total=0). Running with the default openclaw tool
# stack — identical across all models, so comparisons stay valid.
PROFILE_ARG=""
if [ -n "${USE_PROFILE:-}" ] && [ -f "$SWEEP_PROFILE" ]; then
PROFILE_ARG="--profile $SWEEP_PROFILE"
fi
clawbench run \
--model "$SWEEP_MODEL" \
--runs 3 \
--concurrency "${CLAWBENCH_CONCURRENCY:-1}" \
$PROFILE_ARG \
--concurrency 4 \
--profile "$SWEEP_PROFILE" \
--judge-model "anthropic/claude-sonnet-4-6" \
-o "$OUT" \
> "$LOG" 2>&1

View File

@ -1,33 +0,0 @@
# Lightweight ClawBench image for Kubernetes sidecar use.
# Does NOT include the full OpenClaw server or Chromium — the gateway runs
# in a separate container. Node.js is copied from the OpenClaw image for
# the device-identity handshake required by the gateway protocol.
FROM ghcr.io/openclaw/openclaw:latest AS openclaw
FROM python:3.12-slim
COPY --from=openclaw /usr/local/bin/node /usr/local/bin/node
RUN apt-get update && \
apt-get install -y --no-install-recommends git && \
rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./
COPY clawbench/ clawbench/
COPY tasks-public/ tasks-public/
COPY tasks-domain/ tasks-domain/
COPY profiles/ profiles/
COPY baselines/ baselines/
COPY scripts/ scripts/
RUN pip install --no-cache-dir ".[mlflow]"
RUN mkdir -p /results && chmod 777 /results
RUN useradd -m -d /home/node clawbench
USER clawbench
ENV HOME=/home/node
ENTRYPOINT ["clawbench"]

View File

@ -1,486 +0,0 @@
#!/usr/bin/env bash
# Deploy ClawBench evals on Kubernetes (works on OpenShift too).
#
# 0-to-hero pipeline:
# Step 0: Create a cluster (see --help for Kind instructions)
# Step 1: Deploy OpenClaw gateway (optional — bring your own)
# Step 2: Deploy MLflow tracking server (optional — bring your own)
# Step 3: Run evals via sidecar (add / remove)
#
# Usage:
# ./scripts/k8s/deploy.sh # Full deploy: OpenClaw + MLflow + eval
# ./scripts/k8s/deploy.sh --openclaw-only # Step 1: deploy OpenClaw gateway
# ./scripts/k8s/deploy.sh --mlflow-only # Step 2: deploy MLflow
# ./scripts/k8s/deploy.sh --add-sidecar # Step 3: add eval sidecar (starts eval)
# ./scripts/k8s/deploy.sh --remove-sidecar # Step 3: remove eval sidecar
# ./scripts/k8s/deploy.sh --logs # Tail clawbench sidecar logs
# ./scripts/k8s/deploy.sh --teardown # Delete eval namespace (keeps MLflow)
#
# Environment (required):
# CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval
# OPENAI_API_KEY Model provider API key (or another provider key)
#
# Environment (optional):
# CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest)
# OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
# OPENCLAW_GATEWAY_TOKEN Existing gateway token (generated if unset)
# CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5)
# MLFLOW_NAMESPACE MLflow namespace (default: mlflow)
# MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy if set)
# MLFLOW_EXPERIMENT_ID MLflow experiment ID
# MLFLOW_EXPERIMENT_NAME MLflow experiment name
# MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
# ANTHROPIC_API_KEY Anthropic key (added to secret if set)
# OPENROUTER_API_KEY OpenRouter key (added to secret if set)
# GEMINI_API_KEY Gemini key (added to secret if set)
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
NS="${CLAWBENCH_NAMESPACE:-}"
MLFLOW_NS="${MLFLOW_NAMESPACE:-mlflow}"
CLAWBENCH_IMG="${CLAWBENCH_IMAGE:-quay.io/sallyom/clawbench:latest}"
OPENCLAW_IMG="${OPENCLAW_IMAGE:-ghcr.io/openclaw/openclaw:latest}"
MLFLOW_IMG="${MLFLOW_IMAGE:-ghcr.io/mlflow/mlflow:v2.21.3}"
# ---------------------------------------------------------------------------
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
cat <<'HELP'
ClawBench Kubernetes Deployment
===============================
0-to-hero pipeline for running ClawBench evals on Kubernetes.
Step 0: Create a cluster
For local testing with Kind, see:
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
Step 1: Deploy OpenClaw gateway (optional — skip if you have one)
Step 2: Deploy MLflow tracking server (optional — skip if you have one)
Step 3: Run evals via sidecar (add/remove to OpenClaw deployment)
Usage:
./scripts/k8s/deploy.sh Full deploy (steps 1+2+3)
./scripts/k8s/deploy.sh --openclaw-only Step 1: OpenClaw only
./scripts/k8s/deploy.sh --mlflow-only Step 2: MLflow only
./scripts/k8s/deploy.sh --add-sidecar Step 3: add eval sidecar (starts eval)
./scripts/k8s/deploy.sh --remove-sidecar Step 3: remove eval sidecar
./scripts/k8s/deploy.sh --logs Tail clawbench sidecar logs
./scripts/k8s/deploy.sh --teardown Delete eval namespace (keeps MLflow)
Required environment:
CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval
OPENAI_API_KEY Model provider API key (or ANTHROPIC_API_KEY, etc.)
Optional environment:
CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest)
OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
OPENCLAW_GATEWAY_TOKEN Existing gateway token (generated if unset)
CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5)
MLFLOW_NAMESPACE MLflow namespace (default: mlflow)
MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy)
MLFLOW_EXPERIMENT_ID MLflow experiment ID
MLFLOW_EXPERIMENT_NAME MLflow experiment name
MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
ANTHROPIC_API_KEY Anthropic key (added to secret if set)
OPENROUTER_API_KEY OpenRouter key (added to secret if set)
GEMINI_API_KEY Gemini key (added to secret if set)
Works on Kubernetes and OpenShift.
HELP
exit 0
fi
command -v kubectl &>/dev/null || { echo "Missing: kubectl" >&2; exit 1; }
if [[ -z "$NS" ]]; then
echo "CLAWBENCH_NAMESPACE is required." >&2
echo " export CLAWBENCH_NAMESPACE=clawbench-eval" >&2
exit 1
fi
MODE="full"
while [[ $# -gt 0 ]]; do
case "$1" in
--openclaw-only) MODE="openclaw-only" ;;
--mlflow-only) MODE="mlflow-only" ;;
--add-sidecar) MODE="add-sidecar" ;;
--remove-sidecar) MODE="remove-sidecar" ;;
--logs) MODE="logs" ;;
--teardown) MODE="teardown" ;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
shift
done
kubectl cluster-info &>/dev/null || { echo "Cannot connect to cluster. Check kubeconfig." >&2; exit 1; }
# ---------------------------------------------------------------------------
# --logs
# ---------------------------------------------------------------------------
if [[ "$MODE" == "logs" ]]; then
kubectl logs deploy/openclaw -c clawbench -n "$NS" -f
exit 0
fi
# ---------------------------------------------------------------------------
# --teardown
# ---------------------------------------------------------------------------
if [[ "$MODE" == "teardown" ]]; then
echo "Deleting namespace '$NS'..."
kubectl delete namespace "$NS" --ignore-not-found
echo "Done. MLflow namespace '$MLFLOW_NS' was not deleted."
exit 0
fi
# ---------------------------------------------------------------------------
# --remove-sidecar
# ---------------------------------------------------------------------------
if [[ "$MODE" == "remove-sidecar" ]]; then
echo "Removing clawbench sidecar from openclaw in namespace '$NS'..."
INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next((i for i,c in enumerate(cs) if c['name']=='clawbench'),-1))")
if [[ "$INDEX" == "-1" ]]; then
echo "No clawbench sidecar found."
else
kubectl patch deploy/openclaw -n "$NS" --type=json \
-p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]"
echo "Sidecar removed."
fi
exit 0
fi
# ---------------------------------------------------------------------------
# Create namespace + secret
# ---------------------------------------------------------------------------
ensure_namespace_and_secret() {
if ! kubectl get namespace "$NS" &>/dev/null; then
echo "Creating namespace '$NS'..."
kubectl create namespace "$NS"
fi
if ! kubectl get secret clawbench-secrets -n "$NS" &>/dev/null; then
echo "Creating clawbench-secrets..."
if [[ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ]]; then
GATEWAY_TOKEN="$OPENCLAW_GATEWAY_TOKEN"
GATEWAY_TOKEN_SOURCE="from OPENCLAW_GATEWAY_TOKEN"
else
GATEWAY_TOKEN=$(python3 -c "import secrets,base64; print(base64.b64encode(secrets.token_bytes(32)).decode())")
GATEWAY_TOKEN_SOURCE="generated"
fi
SECRET_ARGS=(
--from-literal=OPENCLAW_GATEWAY_TOKEN="$GATEWAY_TOKEN"
)
[[ -n "${OPENAI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENAI_API_KEY="$OPENAI_API_KEY")
[[ -n "${ANTHROPIC_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY")
[[ -n "${OPENROUTER_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENROUTER_API_KEY="$OPENROUTER_API_KEY")
[[ -n "${GEMINI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=GEMINI_API_KEY="$GEMINI_API_KEY")
if [[ ${#SECRET_ARGS[@]} -eq 1 ]]; then
echo "Warning: No API keys provided. Set OPENAI_API_KEY or another provider key." >&2
fi
kubectl create secret generic clawbench-secrets -n "$NS" "${SECRET_ARGS[@]}"
echo " Gateway token: $GATEWAY_TOKEN_SOURCE"
[[ -n "${OPENAI_API_KEY:-}" ]] && echo " OPENAI_API_KEY: set"
[[ -n "${ANTHROPIC_API_KEY:-}" ]] && echo " ANTHROPIC_API_KEY: set"
[[ -n "${OPENROUTER_API_KEY:-}" ]] && echo " OPENROUTER_API_KEY: set"
[[ -n "${GEMINI_API_KEY:-}" ]] && echo " GEMINI_API_KEY: set"
else
echo "Secret clawbench-secrets already exists in '$NS'."
fi
return 0
}
# ---------------------------------------------------------------------------
# Step 1: Deploy OpenClaw
# ---------------------------------------------------------------------------
deploy_openclaw() {
echo ""
echo "Step 1: Deploying OpenClaw gateway (image: $OPENCLAW_IMG)..."
kubectl apply -f "$SCRIPT_DIR/openclaw/configmap.yaml" -n "$NS"
# Patch gateway config with custom OpenAI-compatible base URL
if [[ -n "${OPENAI_API_BASE:-}" ]]; then
echo " Patching gateway config: models.providers.openai.baseUrl = $OPENAI_API_BASE"
EXISTING_JSON=$(kubectl get configmap openclaw-config -n "$NS" -o jsonpath='{.data.openclaw\.json}')
PATCHED_JSON=$(echo "$EXISTING_JSON" | python3 -c "
import json, sys, os
cfg = json.load(sys.stdin)
openai_cfg = cfg.setdefault('models', {}).setdefault('providers', {}).setdefault('openai', {})
openai_cfg['baseUrl'] = os.environ['OPENAI_API_BASE']
openai_cfg.setdefault('models', [])
json.dump(cfg, sys.stdout, indent=2)
")
kubectl create configmap openclaw-config -n "$NS" \
--from-literal="openclaw.json=$PATCHED_JSON" \
--dry-run=client -o yaml | kubectl apply -f - -n "$NS" >/dev/null
fi
kubectl apply -f "$SCRIPT_DIR/openclaw/pvc.yaml" -n "$NS"
kubectl apply -f "$SCRIPT_DIR/openclaw/service.yaml" -n "$NS"
if [[ "$OPENCLAW_IMG" != "ghcr.io/openclaw/openclaw:latest" ]]; then
kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
kubectl set image "deploy/openclaw" "gateway=$OPENCLAW_IMG" -n "$NS"
else
kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
fi
echo "Waiting for OpenClaw rollout..."
kubectl rollout status deploy/openclaw -n "$NS" --timeout=180s || \
echo " (rollout still in progress)"
echo "OpenClaw deployed."
}
# ---------------------------------------------------------------------------
# Step 2: Deploy MLflow
# ---------------------------------------------------------------------------
deploy_mlflow() {
if [[ -n "${MLFLOW_TRACKING_URI:-}" ]]; then
echo ""
echo "Step 2: Skipping MLflow deploy (MLFLOW_TRACKING_URI is set: $MLFLOW_TRACKING_URI)"
return
fi
echo ""
echo "Step 2: Deploying MLflow (namespace: $MLFLOW_NS, image: $MLFLOW_IMG)..."
if ! kubectl get namespace "$MLFLOW_NS" &>/dev/null; then
kubectl create namespace "$MLFLOW_NS"
fi
kubectl apply -f "$SCRIPT_DIR/mlflow/pvc.yaml" -n "$MLFLOW_NS"
kubectl apply -f "$SCRIPT_DIR/mlflow/service.yaml" -n "$MLFLOW_NS"
if [[ "$MLFLOW_IMG" != "ghcr.io/mlflow/mlflow:v2.21.3" ]]; then
kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
kubectl set image "deploy/mlflow" "mlflow=$MLFLOW_IMG" -n "$MLFLOW_NS"
else
kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
fi
echo "Waiting for MLflow rollout..."
kubectl rollout status deploy/mlflow -n "$MLFLOW_NS" --timeout=120s || \
echo " (rollout still in progress)"
MLFLOW_TRACKING_URI="http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000"
echo "MLflow deployed: $MLFLOW_TRACKING_URI"
}
# ---------------------------------------------------------------------------
# Step 3: Add clawbench sidecar (starts eval)
# ---------------------------------------------------------------------------
add_sidecar() {
echo ""
echo "Step 3: Adding clawbench eval sidecar..."
echo "Applying clawbench ConfigMap..."
kubectl apply -f "$SCRIPT_DIR/manifests/configmap.yaml" -n "$NS" >/dev/null
if [[ -n "${CLAWBENCH_MODEL:-}" ]]; then
kubectl patch configmap clawbench-config -n "$NS" \
--type merge -p "{\"data\":{\"CLAWBENCH_MODEL\":\"$CLAWBENCH_MODEL\"}}" >/dev/null
echo " Model: $CLAWBENCH_MODEL"
fi
if [[ -n "${OPENAI_API_BASE:-}" ]]; then
kubectl patch configmap clawbench-config -n "$NS" \
--type merge -p "{\"data\":{\"OPENAI_API_BASE\":\"$OPENAI_API_BASE\"}}" >/dev/null
echo " OpenAI API base: $OPENAI_API_BASE"
fi
# Patch MLflow settings into ConfigMap
PATCH_DATA=""
MLFLOW_URI="${MLFLOW_TRACKING_URI:-http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000}"
PATCH_DATA="\"MLFLOW_TRACKING_URI\":\"$MLFLOW_URI\""
if [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]]; then
PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_ID\":\"$MLFLOW_EXPERIMENT_ID\""
fi
if [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]]; then
PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_NAME\":\"$MLFLOW_EXPERIMENT_NAME\""
fi
kubectl patch configmap clawbench-config -n "$NS" \
--type merge -p "{\"data\":{$PATCH_DATA}}" >/dev/null
echo " MLflow URI: $MLFLOW_URI"
[[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]] && echo " MLflow experiment ID: $MLFLOW_EXPERIMENT_ID"
[[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]] && echo " MLflow experiment name: $MLFLOW_EXPERIMENT_NAME"
# Check if sidecar already exists
HAS_SIDECAR=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print('yes' if any(c['name']=='clawbench' for c in cs) else 'no')")
if [[ "$HAS_SIDECAR" == "yes" ]]; then
echo "Removing existing clawbench sidecar..."
INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next(i for i,c in enumerate(cs) if c['name']=='clawbench'))")
kubectl patch deploy/openclaw -n "$NS" --type=json \
-p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]" >/dev/null
fi
# Find the OpenClaw home volume, and capture existing volumes so add-sidecar
# also works with bring-your-own deployments that lack this repo's PVC layout.
VOLUME_INFO=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "
import json, sys
spec = json.load(sys.stdin)['spec']['template']['spec']
volume_names = [v.get('name') for v in spec.get('volumes', []) if v.get('name')]
home_volume = 'openclaw-home'
for c in spec['containers']:
if c['name'] == 'gateway':
for vm in c.get('volumeMounts', []):
if vm['mountPath'] == '/home/node/.openclaw':
home_volume = vm['name']
break
print(json.dumps({
'home_volume': home_volume,
'volumes_present': 'volumes' in spec,
'volume_names': volume_names,
}))
")
echo "Adding clawbench sidecar (image: $CLAWBENCH_IMG)..."
PATCH=$(VOLUME_INFO="$VOLUME_INFO" CLAWBENCH_IMG="$CLAWBENCH_IMG" python3 - <<'PY'
import json
import os
info = json.loads(os.environ["VOLUME_INFO"])
home_volume = info["home_volume"]
command = r"""echo "Waiting for gateway on localhost:18789..."
for i in $(seq 1 90); do
python3 -c "import socket; s=socket.create_connection((\"127.0.0.1\",18789),2); s.close()" 2>/dev/null && echo "Gateway ready" && break
sleep 2
done
if [ -n "${MLFLOW_TRACKING_URI:-}" ]; then
echo "Checking MLflow at ${MLFLOW_TRACKING_URI}..."
python3 -c "import httpx,os; r=httpx.get(os.environ[\"MLFLOW_TRACKING_URI\"]+\"/health\"); print(\"MLflow OK:\",r.status_code)" 2>&1 || echo "MLflow pre-check failed (will retry at log time)"
fi
echo "Starting eval..."
clawbench run \
--model "${CLAWBENCH_MODEL}" \
--gateway-token "${OPENCLAW_GATEWAY_TOKEN}" \
--runs "${CLAWBENCH_RUNS}" \
--concurrency "${CLAWBENCH_CONCURRENCY}" \
${CLAWBENCH_JUDGE_MODEL:+--judge-model "${CLAWBENCH_JUDGE_MODEL}"} \
$([ -n "${CLAWBENCH_TASKS:-}" ] && for t in ${CLAWBENCH_TASKS}; do printf -- "-t %s " "$t"; done) \
-o /results/benchmark.json
RC=$?
if [ $RC -eq 0 ] && [ -n "${MLFLOW_TRACKING_URI:-}" ]; then
python scripts/log_to_mlflow.py /results/benchmark.json
fi
echo "ClawBench finished (exit=$RC)"
sleep infinity"""
container = {
"name": "clawbench",
"image": os.environ["CLAWBENCH_IMG"],
"imagePullPolicy": "IfNotPresent",
"command": ["/bin/bash", "-c", command],
"envFrom": [{"configMapRef": {"name": "clawbench-config"}}],
"env": [
{
"name": "OPENCLAW_GATEWAY_TOKEN",
"valueFrom": {
"secretKeyRef": {
"name": "clawbench-secrets",
"key": "OPENCLAW_GATEWAY_TOKEN",
}
},
}
],
"resources": {
"requests": {"memory": "1Gi", "cpu": "500m"},
"limits": {"memory": "4Gi", "cpu": "2"},
},
"volumeMounts": [
{"name": home_volume, "mountPath": "/home/node/.openclaw"},
{"name": "clawbench-results", "mountPath": "/results"},
{"name": "tmp-volume", "mountPath": "/tmp"},
],
"securityContext": {
"allowPrivilegeEscalation": False,
"capabilities": {"drop": ["ALL"]},
},
}
patch = [{"op": "add", "path": "/spec/template/spec/containers/-", "value": container}]
existing_volumes = set(info["volume_names"])
required_volumes = [
{"name": home_volume, "emptyDir": {}},
{"name": "clawbench-results", "emptyDir": {}},
{"name": "tmp-volume", "emptyDir": {}},
]
missing_volumes = []
for volume in required_volumes:
if volume["name"] not in existing_volumes and volume["name"] not in {
item["name"] for item in missing_volumes
}:
missing_volumes.append(volume)
if missing_volumes:
if info["volumes_present"]:
patch.extend(
{"op": "add", "path": "/spec/template/spec/volumes/-", "value": volume}
for volume in missing_volumes
)
else:
patch.append(
{"op": "add", "path": "/spec/template/spec/volumes", "value": missing_volumes}
)
print(json.dumps(patch))
PY
)
kubectl patch deploy/openclaw -n "$NS" --type=json -p "$PATCH" >/dev/null
echo ""
echo "Waiting for rollout..."
kubectl rollout status deploy/openclaw -n "$NS" --timeout=300s 2>/dev/null || \
echo " (rollout timeout — eval runs for 30-60 min)"
echo ""
echo "Eval is running. Follow logs with:"
echo " ./scripts/k8s/deploy.sh --logs"
echo ""
echo "When finished, remove the sidecar with:"
echo " ./scripts/k8s/deploy.sh --remove-sidecar"
}
# ---------------------------------------------------------------------------
# Execute
# ---------------------------------------------------------------------------
case "$MODE" in
full)
ensure_namespace_and_secret
deploy_openclaw
deploy_mlflow
add_sidecar
;;
openclaw-only)
ensure_namespace_and_secret
deploy_openclaw
echo ""
echo "OpenClaw is running. Next steps:"
echo " ./scripts/k8s/deploy.sh --mlflow-only # Deploy MLflow"
echo " ./scripts/k8s/deploy.sh --add-sidecar # Start eval"
;;
mlflow-only)
deploy_mlflow
;;
add-sidecar)
if ! kubectl get deploy/openclaw -n "$NS" &>/dev/null; then
echo "Deployment 'openclaw' not found in namespace '$NS'." >&2
echo "Deploy OpenClaw first with: ./scripts/k8s/deploy.sh --openclaw-only" >&2
exit 1
fi
ensure_namespace_and_secret
add_sidecar
;;
esac

View File

@ -1,18 +0,0 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: clawbench-config
labels:
app: clawbench
data:
CLAWBENCH_MODEL: "openai/gpt-5.5"
OPENAI_API_BASE: ""
CLAWBENCH_RUNS: "3"
CLAWBENCH_CONCURRENCY: "4"
CLAWBENCH_JUDGE_MODEL: ""
CLAWBENCH_TASKS: ""
CLAWBENCH_CONNECT_TIMEOUT: "120"
CLAWBENCH_REQUEST_TIMEOUT: "300"
CLAWBENCH_PER_RUN_BUDGET_SECONDS: "600"
MLFLOW_TRACKING_URI: "http://mlflow-service.mlflow.svc.cluster.local:5000"
MLFLOW_EXPERIMENT_NAME: "clawbench"

View File

@ -1,15 +0,0 @@
# Reference template — do NOT apply directly.
# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
# from exported environment variables (OPENAI_API_KEY, etc.).
apiVersion: v1
kind: Secret
metadata:
name: clawbench-secrets
labels:
app: clawbench
type: Opaque
stringData:
OPENAI_API_KEY: "REPLACE_ME"
# Add other provider keys as needed:
# ANTHROPIC_API_KEY: "REPLACE_ME"
# OPENROUTER_API_KEY: "REPLACE_ME"

View File

@ -1,68 +0,0 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow
labels:
app: mlflow
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: mlflow
template:
metadata:
labels:
app: mlflow
spec:
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:v2.21.3
command:
- mlflow
- server
- --host
- "0.0.0.0"
- --port
- "5000"
- --backend-store-uri
- sqlite:///mlflow/mlflow.db
- --default-artifact-root
- /mlflow/artifacts
- --serve-artifacts
ports:
- name: http
containerPort: 5000
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 15
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
volumeMounts:
- name: mlflow-data
mountPath: /mlflow
volumes:
- name: mlflow-data
persistentVolumeClaim:
claimName: mlflow-data-pvc

View File

@ -1,12 +0,0 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mlflow-data-pvc
labels:
app: mlflow
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi

View File

@ -1,15 +0,0 @@
apiVersion: v1
kind: Service
metadata:
name: mlflow-service
labels:
app: mlflow
spec:
type: ClusterIP
selector:
app: mlflow
ports:
- name: http
port: 5000
targetPort: 5000
protocol: TCP

View File

@ -1,36 +0,0 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: openclaw-config
labels:
app: openclaw
data:
openclaw.json: |
{
"gateway": {
"mode": "local",
"bind": "loopback",
"port": 18789,
"auth": {
"mode": "token"
}
},
"browser": {
"enabled": true,
"headless": true,
"noSandbox": true,
"ssrfPolicy": {
"allowedHostnames": ["localhost", "127.0.0.1"]
}
},
"tools": {
"profile": "coding",
"alsoAllow": ["browser"]
},
"agents": {
"defaults": {
"workspace": "~/.openclaw/workspace"
}
},
"cron": { "enabled": false }
}

View File

@ -1,146 +0,0 @@
# OpenClaw gateway deployment for ClawBench evals.
#
# Build the image with browser support:
# docker build --build-arg OPENCLAW_INSTALL_BROWSER=1 \
# -t quay.io/yourorg/openclaw:eval .
#
# Or use upstream without browser (browser eval tasks will score 0):
# image: ghcr.io/openclaw/openclaw:latest
apiVersion: apps/v1
kind: Deployment
metadata:
name: openclaw
labels:
app: openclaw
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: openclaw
template:
metadata:
labels:
app: openclaw
spec:
initContainers:
- name: init-config
image: registry.access.redhat.com/ubi9-minimal:latest
command:
- sh
- -c
- |
cp /config/openclaw.json /home/node/.openclaw/openclaw.json
chmod 666 /home/node/.openclaw/openclaw.json
mkdir -p /home/node/.openclaw/workspace
mkdir -p /home/node/.openclaw/agents
chmod 777 /home/node/.openclaw /home/node/.openclaw/workspace /home/node/.openclaw/agents
echo "Config initialized"
volumeMounts:
- name: openclaw-home
mountPath: /home/node/.openclaw
- name: config-template
mountPath: /config
resources:
limits:
cpu: 200m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
containers:
- name: gateway
image: ghcr.io/openclaw/openclaw:latest
imagePullPolicy: IfNotPresent
command:
- sh
- -c
- umask 007 && exec node dist/index.js gateway run --bind loopback --port 18789 --allow-unconfigured
env:
- name: HOME
value: /home/node
- name: NODE_ENV
value: production
- name: OPENCLAW_CONFIG_DIR
value: /home/node/.openclaw
- name: OPENCLAW_STATE_DIR
value: /home/node/.openclaw
- name: OPENCLAW_GATEWAY_TOKEN
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: OPENCLAW_GATEWAY_TOKEN
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: OPENAI_API_KEY
optional: true
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: ANTHROPIC_API_KEY
optional: true
- name: OPENROUTER_API_KEY
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: OPENROUTER_API_KEY
optional: true
- name: GEMINI_API_KEY
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: GEMINI_API_KEY
optional: true
ports:
- name: gateway
containerPort: 18789
protocol: TCP
livenessProbe:
exec:
command:
- node
- -e
- "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
exec:
command:
- node
- -e
- "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
resources:
requests:
cpu: 250m
memory: 1Gi
limits:
cpu: "2"
memory: 4Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
volumeMounts:
- name: openclaw-home
mountPath: /home/node/.openclaw
- name: tmp-volume
mountPath: /tmp
terminationGracePeriodSeconds: 30
volumes:
- name: openclaw-home
persistentVolumeClaim:
claimName: openclaw-home-pvc
- name: config-template
configMap:
name: openclaw-config
- name: tmp-volume
emptyDir: {}

View File

@ -1,12 +0,0 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: openclaw-home-pvc
labels:
app: openclaw
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

View File

@ -1,17 +0,0 @@
# Reference template — do NOT apply directly.
# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
# from exported environment variables (OPENAI_API_KEY, etc.).
apiVersion: v1
kind: Secret
metadata:
name: clawbench-secrets
labels:
app: openclaw
type: Opaque
stringData:
OPENCLAW_GATEWAY_TOKEN: "REPLACE_ME"
OPENAI_API_KEY: "REPLACE_ME"
# Add other provider keys as needed:
# ANTHROPIC_API_KEY: "REPLACE_ME"
# OPENROUTER_API_KEY: "REPLACE_ME"
# GEMINI_API_KEY: "REPLACE_ME"

View File

@ -1,15 +0,0 @@
apiVersion: v1
kind: Service
metadata:
name: openclaw
labels:
app: openclaw
spec:
type: ClusterIP
selector:
app: openclaw
ports:
- name: gateway
port: 18789
targetPort: 18789
protocol: TCP

View File

@ -1,125 +0,0 @@
#!/usr/bin/env python3
"""Log a ClawBench BenchmarkResult to MLflow.
Standalone script -- not imported by the clawbench package.
Requires: pip install mlflow (or pip install clawbench[mlflow])
Usage:
python scripts/log_to_mlflow.py /results/benchmark.json
Environment:
MLFLOW_TRACKING_URI MLflow tracking server (default: http://localhost:5000)
MLFLOW_EXPERIMENT_NAME Experiment name (default: clawbench)
"""
from __future__ import annotations
import json
import os
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
def main(result_path: str) -> None:
try:
import mlflow
except ImportError:
print(
"mlflow is not installed. Install with: pip install mlflow"
" (or pip install clawbench[mlflow])",
file=sys.stderr,
)
sys.exit(1)
from clawbench.schemas import BenchmarkResult
with open(result_path, encoding="utf-8") as f:
result = BenchmarkResult(**json.load(f))
experiment_id = os.environ.get("MLFLOW_EXPERIMENT_ID")
if experiment_id:
experiment = mlflow.set_experiment(experiment_id=experiment_id)
else:
experiment = mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT_NAME", "clawbench"))
run_name = f"{result.model}-{result.submission_id[:8]}"
with mlflow.start_run(run_name=run_name):
mlflow.log_params(
{
"model": result.model,
"provider": result.provider,
"benchmark_version": result.benchmark_version,
"openclaw_version": result.openclaw_version or "unknown",
"judge_model": result.judge_model or "none",
"task_snapshot_fingerprint": result.task_snapshot_fingerprint or "unknown",
}
)
mlflow.log_metrics(
{
"overall_score": result.overall_score,
"overall_completion": result.overall_completion,
"overall_trajectory": result.overall_trajectory,
"overall_behavior": result.overall_behavior,
"overall_reliability": result.overall_reliability,
"overall_pass_hat_k": result.overall_pass_hat_k,
"overall_judge_score": result.overall_judge_score,
"overall_judge_confidence": result.overall_judge_confidence,
"overall_judge_pass_rate": result.overall_judge_pass_rate,
"judge_task_coverage": result.judge_task_coverage,
"overall_weighted_query_score": result.overall_weighted_query_score,
"overall_median_latency_ms": result.overall_median_latency_ms,
"overall_p95_latency_ms": result.overall_p95_latency_ms,
"overall_total_tokens": result.overall_total_tokens,
"overall_cost_usd": result.overall_cost_usd,
"overall_tokens_per_pass": result.overall_tokens_per_pass,
"overall_cost_per_pass": result.overall_cost_per_pass,
"overall_ci_lower": result.overall_ci_lower,
"overall_ci_upper": result.overall_ci_upper,
}
)
for tier in result.tier_results:
mlflow.log_metrics(
{
f"{tier.tier}/score": tier.mean_task_score,
f"{tier.tier}/completion": tier.mean_completion,
f"{tier.tier}/trajectory": tier.mean_trajectory,
f"{tier.tier}/behavior": tier.mean_behavior,
f"{tier.tier}/reliability": tier.mean_reliability,
}
)
for i, task in enumerate(result.task_results):
mlflow.log_metrics(
{
f"task/{task.task_id}/score": task.mean_task_score,
f"task/{task.task_id}/reliability": task.reliability_score,
},
step=i,
)
mlflow.set_tags(
{
"submission_id": result.submission_id,
"timestamp": result.timestamp,
"certified": str(result.certified),
}
)
try:
mlflow.log_artifact(result_path)
except Exception as e:
print(f"Warning: artifact upload failed: {e}", file=sys.stderr)
print("Metrics and params were logged successfully.", file=sys.stderr)
print(f"Logged to MLflow: experiment={experiment.name} run={run_name}")
if __name__ == "__main__":
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} <result.json>", file=sys.stderr)
sys.exit(1)
main(sys.argv[1])

View File

@ -5,23 +5,13 @@ from __future__ import annotations
import os
from http.server import BaseHTTPRequestHandler, HTTPServer
from pathlib import Path
from urllib.parse import unquote, urlsplit
ROOT = Path(__file__).parent / "articles"
ARTICLES = {path.stem: path for path in ROOT.glob("*.html") if path.is_file()}
def article_for_request_path(request_path: str) -> Path | None:
path = unquote(urlsplit(request_path).path)
if not path.startswith("/article/"):
return None
slug = path.removeprefix("/article/")
return ARTICLES.get(slug)
class Handler(BaseHTTPRequestHandler):
def do_GET(self) -> None: # noqa: N802
path = unquote(urlsplit(self.path).path)
path = self.path.split("?")[0]
if path == "/health":
self.send_response(200)
self.send_header("Content-Type", "application/json")
@ -32,8 +22,9 @@ class Handler(BaseHTTPRequestHandler):
self._index()
return
if path.startswith("/article/"):
article = article_for_request_path(self.path)
if article is not None:
slug = path.split("/", 2)[2]
article = ROOT / f"{slug}.html"
if article.exists():
self._html(article.read_bytes())
return
self.send_response(404)
@ -42,7 +33,8 @@ class Handler(BaseHTTPRequestHandler):
def _index(self) -> None:
items = []
for slug in sorted(ARTICLES):
for f in sorted(ROOT.glob("*.html")):
slug = f.stem
items.append(f'<li><a href="/article/{slug}">{slug}</a></li>')
body = (
"<!doctype html><html><body>"

View File

@ -20,46 +20,6 @@ def test_testbox_workflow_hydrates_secrets_and_dotfiles():
assert "CLAWBENCH_CODEX_AUTH_JSON" in workflow
def test_crabbox_config_uses_actions_hydration():
config = Path(".crabbox.yaml").read_text(encoding="utf-8")
assert "profile: clawbench-check" in config
assert "provider: aws" in config
assert "workflow: .github/workflows/crabbox-hydrate.yml" in config
assert "job: hydrate" in config
assert "baseRef: main" in config
assert "- clawbench" in config
assert "- CLAWBENCH_*" in config
assert "- OPENCLAW_*" in config
def test_crabbox_workflow_hydrates_secrets_dotfiles_and_ready_marker():
workflow = Path(".github/workflows/crabbox-hydrate.yml").read_text(encoding="utf-8")
assert "crabbox_id:" in workflow
assert "crabbox_runner_label:" in workflow
assert 'runs-on: [self-hosted, "${{ inputs.crabbox_runner_label }}"]' in workflow
assert "actions/setup-python@v5" in workflow
assert "python -m pip install -e ." in workflow
assert "scripts/ci-hydrate-testbox-env.sh" in workflow
assert "HF_TOKEN" in workflow
assert "OPENCLAW_CODEX_AUTH_JSON" in workflow
assert "CLAWBENCH_CODEX_AUTH_JSON" in workflow
assert "/usr/local/bin/clawbench-testbox-env" in workflow
assert "$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env" in workflow
assert "crabbox_keep_alive_minutes" in workflow
def test_crabbox_skill_documents_clawbench_flow():
skill = Path(".agents/skills/crabbox/SKILL.md").read_text(encoding="utf-8")
assert "openclaw/crabbox" in skill
assert ".crabbox.yaml" in skill
assert "crabbox actions hydrate" in skill
assert "clawbench-testbox-env" in skill
assert ".github/workflows/crabbox-hydrate.yml" in skill
def test_testbox_helper_sources_hydrated_profile():
script = Path("scripts/ci-hydrate-testbox-env.sh").read_text(encoding="utf-8")

View File

@ -107,7 +107,7 @@ async def test_gateway_client_retries_transient_drain_errors(monkeypatch: pytest
async def fake_wait_event(self, event_name: str, *, timeout: float):
return {"payload": {"nonce": ""}}
async def fake_rpc(self, method: str, params=None, **kwargs):
async def fake_rpc(self, method: str, params=None):
return {"payload": {"type": "hello-ok", "protocol": 3}}
async def fake_listener(self):
@ -144,7 +144,7 @@ async def test_gateway_client_retries_half_closed_handshake_errors(
async def fake_wait_event(self, event_name: str, *, timeout: float):
return {"payload": {"nonce": ""}}
async def fake_rpc(self, method: str, params=None, **kwargs):
async def fake_rpc(self, method: str, params=None):
return {"payload": {"type": "hello-ok", "protocol": 3}}
async def fake_listener(self):
@ -226,71 +226,3 @@ async def test_rpc_timeout_cleans_pending_request():
assert sent_frames[0]["method"] == "sessions.create"
assert client._pending == {}
@pytest.mark.asyncio
async def test_send_and_wait_passes_gateway_timeout_and_waits_for_run():
client = GatewayClient(GatewayConfig(request_timeout=1))
session_key = "session-1"
calls: list[tuple[str, dict | None, dict]] = []
async def fake_rpc(method: str, params=None, **kwargs):
calls.append((method, params, kwargs))
if method == "sessions.send":
return {"ok": True, "payload": {"runId": "run-1"}}
if method == "agent.wait":
return {"ok": True, "payload": {"runId": "run-1", "status": "completed"}}
if method == "sessions.get":
return {
"ok": True,
"payload": {
"messages": [
{
"role": "assistant",
"content": [{"type": "text", "text": "Done."}],
}
]
},
}
return {"ok": True, "payload": {}}
client._rpc = fake_rpc # type: ignore[method-assign]
transcript = await client.send_and_wait(session_key, "hello", timeout=1.5)
send_call = next(call for call in calls if call[0] == "sessions.send")
assert send_call[1] == {
"key": session_key,
"message": "hello",
"idempotencyKey": send_call[1]["idempotencyKey"],
"timeoutMs": 1500,
}
wait_call = next(call for call in calls if call[0] == "agent.wait")
assert wait_call[1] == {"runId": "run-1", "timeoutMs": 1500}
assert wait_call[2]["timeout"] == 11.5
assert [message.text for message in transcript.assistant_messages] == ["Done."]
@pytest.mark.asyncio
async def test_send_and_wait_aborts_run_when_no_terminal_state_arrives():
client = GatewayClient(GatewayConfig(request_timeout=1))
session_key = "session-1"
calls: list[tuple[str, dict | None, dict]] = []
async def fake_rpc(method: str, params=None, **kwargs):
calls.append((method, params, kwargs))
if method == "sessions.send":
return {"ok": True, "payload": {"runId": "run-timeout"}}
if method == "agent.wait":
await asyncio.sleep(60)
if method == "sessions.abort":
return {"ok": True, "payload": {"status": "aborted"}}
if method == "sessions.get":
return {"ok": True, "payload": {"messages": []}}
return {"ok": True, "payload": {}}
client._rpc = fake_rpc # type: ignore[method-assign]
await client.send_and_wait(session_key, "hello", timeout=0.01)
assert ("sessions.abort", {"key": session_key, "runId": "run-timeout"}, {"timeout": 1}) in calls

View File

@ -20,13 +20,6 @@ def test_submission_request_defaults_to_single_parallel_lane():
assert request.max_parallel_lanes == 1
assert request.runs_per_task == 3
assert request.judge_affects_score is False
assert request.task_ids == []
def test_local_queue_dir_honors_env_override(tmp_path, monkeypatch):
monkeypatch.setenv("CLAWBENCH_LOCAL_QUEUE_DIR", str(tmp_path / "queue"))
assert queue_module._resolve_local_queue_dir() == tmp_path / "queue"
def test_submission_request_fingerprint_includes_judge_score_gate():
@ -40,29 +33,6 @@ def test_submission_request_fingerprint_includes_judge_score_gate():
assert advisory.active_fingerprint() != weighted.active_fingerprint()
def test_submission_request_fingerprint_includes_task_ids():
all_tasks = SubmissionRequest(model="anthropic/claude-sonnet-4-6")
subset = SubmissionRequest(
model="anthropic/claude-sonnet-4-6",
task_ids=["t1-fs-quick-note"],
)
assert all_tasks.active_fingerprint() != subset.active_fingerprint()
def test_submission_request_fingerprint_canonicalizes_task_ids():
first = SubmissionRequest(
model="anthropic/claude-sonnet-4-6",
task_ids=[" t2-demo ", "t1-demo", "t2-demo"],
)
second = SubmissionRequest(
model="anthropic/claude-sonnet-4-6",
task_ids=["t1-demo", "t2-demo"],
)
assert first.active_fingerprint() == second.active_fingerprint()
def test_save_local_replaces_queue_file_atomically(tmp_path, monkeypatch):
monkeypatch.setattr(queue_module, "LOCAL_QUEUE_DIR", tmp_path)
monkeypatch.setattr(queue_module, "HF_TOKEN", "")

View File

@ -1,25 +0,0 @@
from importlib import util
from pathlib import Path
def load_serve_module():
serve_path = (
Path(__file__).resolve().parents[1]
/ "tasks-public"
/ "assets"
/ "t3_web_research_and_cite"
/ "serve.py"
)
spec = util.spec_from_file_location("t3_web_research_serve", serve_path)
module = util.module_from_spec(spec)
assert spec.loader is not None
spec.loader.exec_module(module)
return module
def test_article_paths_resolve_only_known_article_slugs():
serve = load_serve_module()
assert serve.article_for_request_path("/article/01_grid_basics").name == "01_grid_basics.html"
assert serve.article_for_request_path("/article/../../serve.py") is None
assert serve.article_for_request_path("/article/%2e%2e/%2e%2e/serve.py") is None

View File

@ -6,14 +6,7 @@ from types import SimpleNamespace
import pytest
from clawbench.queue import Job, JobQueue, JobStatus, SubmissionRequest
from clawbench.worker import (
GATEWAY_PORT,
GATEWAY_PORT_SPACING,
OPENCLAW_EVAL_SYSTEM_PROMPT,
EvalWorker,
JobProgressTracker,
ParallelLane,
)
from clawbench.worker import GATEWAY_PORT, GATEWAY_PORT_SPACING, EvalWorker, JobProgressTracker, ParallelLane
class DummyTask:
@ -98,12 +91,7 @@ def test_configure_browser_runtime_sets_benchmark_safe_openclaw_config(monkeypat
assert json.loads(config_path.read_text(encoding="utf-8")) == {
"agents": {"defaults": {"skipBootstrap": True}},
"browser": {"headless": True, "noSandbox": True},
"tools": {"exec": {"host": "gateway", "security": "full", "ask": "off"}},
"approvals": {"exec": {"enabled": False}},
}
approvals = json.loads((state_dir / "exec-approvals.json").read_text(encoding="utf-8"))
assert approvals["defaults"] == {"security": "full", "ask": "off", "askFallback": "full"}
assert approvals["agents"]["*"] == {"security": "full", "ask": "off", "askFallback": "full"}
def test_configure_browser_runtime_pins_subagents_to_active_model(monkeypatch):
@ -126,56 +114,10 @@ def test_configure_browser_runtime_pins_subagents_to_active_model(monkeypatch):
"defaults": {
"skipBootstrap": True,
"model": {"primary": "openai-codex/gpt-5.4"},
"models": {"openai-codex/gpt-5.4": {"params": {"fastMode": True}}},
"systemPromptOverride": OPENCLAW_EVAL_SYSTEM_PROMPT,
"subagents": {"model": {"primary": "openai-codex/gpt-5.4"}},
}
},
"browser": {"headless": True, "noSandbox": True},
"tools": {"exec": {"host": "gateway", "security": "full", "ask": "off"}},
"approvals": {"exec": {"enabled": False}},
}
def test_configure_browser_runtime_uses_gateway_env_config_path(tmp_path: Path, monkeypatch):
worker = EvalWorker(JobQueue())
worker.set_active_model("openai-codex/gpt-5.4")
parent_state = tmp_path / "parent"
lane_state = tmp_path / "lane"
parent_state.mkdir()
lane_state.mkdir()
parent_config = parent_state / "openclaw.json"
lane_config = lane_state / "openclaw.json"
parent_config.write_text("{}", encoding="utf-8")
lane_config.write_text("{}", encoding="utf-8")
monkeypatch.setenv("OPENCLAW_STATE_DIR", str(parent_state))
worker._configure_browser_runtime(
["node", "/openclaw/dist/cli.js"],
{
"OPENCLAW_STATE_DIR": str(lane_state),
"OPENCLAW_CONFIG_PATH": str(lane_config),
},
)
assert json.loads(parent_config.read_text(encoding="utf-8")) == {}
lane_data = json.loads(lane_config.read_text(encoding="utf-8"))
assert lane_data["agents"]["defaults"]["model"]["primary"] == "openai-codex/gpt-5.4"
assert lane_data["tools"]["exec"] == {"host": "gateway", "security": "full", "ask": "off"}
assert (lane_state / "exec-approvals.json").exists()
assert not (parent_state / "exec-approvals.json").exists()
def test_eval_model_defaults_pin_openai_to_sse_transport() -> None:
data: dict[str, object] = {}
changed = EvalWorker._apply_eval_model_defaults(data, "openai/gpt-5.5")
assert changed is True
assert data["agents"]["defaults"]["models"]["openai/gpt-5.5"]["params"] == {
"fastMode": True,
"transport": "sse",
"openaiWsWarmup": False,
}
@ -273,11 +215,6 @@ def test_materialize_lane_runtime_spaces_ports_and_copies_auth(tmp_path: Path, m
assert lane1.port == GATEWAY_PORT + GATEWAY_PORT_SPACING
assert lane1.state_dir is not None
assert (lane1.state_dir / "agents" / "main" / "agent" / "auth-profiles.json").exists()
lane_cfg = json.loads((lane1.state_dir / "openclaw.json").read_text(encoding="utf-8"))
assert lane_cfg["tools"]["exec"] == {"host": "gateway", "security": "full", "ask": "off"}
assert lane_cfg["approvals"]["exec"] == {"enabled": False}
lane_approvals = json.loads((lane1.state_dir / "exec-approvals.json").read_text(encoding="utf-8"))
assert lane_approvals["defaults"] == {"security": "full", "ask": "off", "askFallback": "full"}
@pytest.mark.asyncio