Compare commits
3 Commits
main
...
codex/adap
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
69a2311681 | ||
|
|
82eaadbc61 | ||
|
|
30334cac88 |
@ -10,11 +10,6 @@ agent dotfiles, Docker, or a benchmark run that is too heavy for the local
|
||||
machine. Keep normal unit-test iteration local unless the user asks for
|
||||
Testbox proof.
|
||||
|
||||
Crabbox is the sibling lane for reusable owned-capacity proof. Use
|
||||
`.agents/skills/crabbox/SKILL.md` and `.crabbox.yaml` when ClawBench needs
|
||||
AWS-backed reusable boxes or Crabbox sync/log/result inspection. Keep this
|
||||
skill focused on Blacksmith CI parity.
|
||||
|
||||
## Warmup
|
||||
|
||||
Run from the repository root:
|
||||
|
||||
@ -1,122 +0,0 @@
|
||||
---
|
||||
name: crabbox
|
||||
description: Use Crabbox for ClawBench remote Linux validation, warmed reusable boxes, GitHub Actions hydration, sync timing, logs, results, caches, and lease cleanup.
|
||||
---
|
||||
|
||||
# Crabbox
|
||||
|
||||
Use Crabbox when ClawBench needs remote Linux proof on owned capacity, a large
|
||||
runner class, reusable warm state, or a Blacksmith alternative.
|
||||
|
||||
## Before Running
|
||||
|
||||
- Run from the repo root. Crabbox sync mirrors the current checkout.
|
||||
- Prefer local targeted tests for tight edit loops.
|
||||
- Prefer Blacksmith Testbox when the task explicitly asks for Blacksmith or a
|
||||
Blacksmith-specific CI comparison.
|
||||
- Use Crabbox for broad ClawBench gates when owned AWS capacity is the right
|
||||
remote lane.
|
||||
- Check `.crabbox.yaml` for repo defaults before adding flags.
|
||||
- Sanity-check the selected binary before remote work. Prefer the local
|
||||
`openclaw/crabbox` checkout when present because the user PATH shim can be
|
||||
stale: `command -v crabbox; ../crabbox/bin/crabbox --version`.
|
||||
- Install with `brew install openclaw/tap/crabbox`; auth is required before use:
|
||||
`crabbox login --url https://crabbox.openclaw.ai --provider aws`.
|
||||
- On macOS the user config is `~/Library/Application Support/crabbox/config.yaml`;
|
||||
it must include `broker.url`, `broker.token`, and usually `provider: aws`.
|
||||
|
||||
## ClawBench Flow
|
||||
|
||||
AWS/owned-capacity flow for Python tests:
|
||||
|
||||
```sh
|
||||
crabbox warmup --class standard --idle-timeout 90m
|
||||
crabbox actions hydrate --id <cbx_id-or-slug>
|
||||
crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "python -m pytest -q"
|
||||
```
|
||||
|
||||
For commands that need hydrated HF/provider credentials or agent dotfiles, use
|
||||
the helper installed by the hydration workflow:
|
||||
|
||||
```sh
|
||||
crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "clawbench-testbox-env python -m pytest -q"
|
||||
crabbox run --id <cbx_id-or-slug> --timing-json --shell -- "clawbench-testbox-env clawbench run --model anthropic/claude-sonnet-4-6 --adapter simulated"
|
||||
```
|
||||
|
||||
Blacksmith-backed Crabbox flow can delegate setup to the existing Testbox
|
||||
workflow:
|
||||
|
||||
```sh
|
||||
crabbox run --provider blacksmith-testbox --blacksmith-org openclaw --blacksmith-workflow .github/workflows/ci-check-testbox.yml --blacksmith-job check --blacksmith-ref main --idle-timeout 90m --timing-json --shell -- "python -m pytest -q"
|
||||
```
|
||||
|
||||
Stop boxes you created before handoff:
|
||||
|
||||
```sh
|
||||
crabbox stop <cbx_id-or-slug>
|
||||
```
|
||||
|
||||
## Owned AWS Capacity
|
||||
|
||||
When AWS capacity is under pressure, do not start with `class=beast`.
|
||||
`beast` begins at 48xlarge instances and can burn 192 vCPU quota per request.
|
||||
ClawBench's owned-cloud default is `standard`; escalate to `fast`, then
|
||||
`large`, and only use `beast` when the work is explicitly CPU-bound and the
|
||||
smaller class already failed the goal.
|
||||
|
||||
Keep capacity hints enabled so brokered AWS leases print selected
|
||||
region/market, quota pressure, Spot fallback, and high-pressure class warnings.
|
||||
The ClawBench repo config sets `capacity.hints: true`; use
|
||||
`CRABBOX_CAPACITY_HINTS=0` only when debugging hint rendering itself.
|
||||
|
||||
Use `beast` only for exceptional lanes:
|
||||
|
||||
- full benchmark sweeps where wall time is dominated by CPU, not dependency
|
||||
install or network;
|
||||
- release/blocker validation where a maintainer explicitly asks for the largest
|
||||
owned AWS class;
|
||||
- performance profiling where the point is to compare high-core behavior.
|
||||
|
||||
Do not use `beast` for ordinary `python -m pytest -q`, docs-only work, small
|
||||
task repros, Blacksmith outage triage, or focused lint/type/test checks. Those
|
||||
should use `standard` first and `fast` only when the extra cores materially
|
||||
help.
|
||||
|
||||
## Useful Commands
|
||||
|
||||
```sh
|
||||
crabbox status --id <id-or-slug> --wait
|
||||
crabbox inspect --id <id-or-slug> --json
|
||||
crabbox sync-plan
|
||||
crabbox history --lease <id-or-slug>
|
||||
crabbox logs <run_id>
|
||||
crabbox results <run_id>
|
||||
crabbox cache stats --id <id-or-slug>
|
||||
crabbox ssh --id <id-or-slug>
|
||||
```
|
||||
|
||||
Use `--debug` on `run` when measuring sync timing.
|
||||
Use `--timing-json` on warmup, hydrate, and run when comparing AWS and
|
||||
blacksmith-testbox timings.
|
||||
Use `--market spot|on-demand` on AWS warmup or one-shot run when testing quota
|
||||
or capacity behavior without changing `.crabbox.yaml`.
|
||||
|
||||
## Hydration Boundary
|
||||
|
||||
`.github/workflows/crabbox-hydrate.yml` is repo-specific on purpose. It owns
|
||||
ClawBench checkout, setup-python, pip install, provider/HF env hydration,
|
||||
agent-dotfile restoration, ready marker, and keepalive. Crabbox owns runner
|
||||
registration, workflow dispatch, SSH sync, command execution, logs/results,
|
||||
local lease claims, and idle cleanup.
|
||||
|
||||
Do not add ClawBench-specific setup to Crabbox. Put repo setup in the hydration
|
||||
workflow and generic lease/sync behavior in Crabbox.
|
||||
|
||||
## Cleanup
|
||||
|
||||
Crabbox has coordinator-owned idle expiry and local lease claims, so ClawBench
|
||||
does not need a custom ledger. Default idle timeout is 30 minutes unless config
|
||||
or flags set a different value. Still stop boxes you created when done.
|
||||
If `crabbox list` prints `orphan=no-active-lease`, treat it as an operator
|
||||
review hint; do not delete `keep=true` machines without checking provider and
|
||||
coordinator state.
|
||||
@ -1,48 +0,0 @@
|
||||
profile: clawbench-check
|
||||
provider: aws
|
||||
class: standard
|
||||
capacity:
|
||||
market: spot
|
||||
strategy: most-available
|
||||
fallback: on-demand-after-120s
|
||||
hints: true
|
||||
regions:
|
||||
- eu-west-1
|
||||
actions:
|
||||
workflow: .github/workflows/crabbox-hydrate.yml
|
||||
job: hydrate
|
||||
ref: main
|
||||
runnerLabels:
|
||||
- crabbox
|
||||
- clawbench
|
||||
runnerVersion: latest
|
||||
ephemeral: true
|
||||
aws:
|
||||
region: eu-west-1
|
||||
rootGB: 400
|
||||
sync:
|
||||
delete: true
|
||||
checksum: false
|
||||
gitSeed: true
|
||||
fingerprint: true
|
||||
baseRef: main
|
||||
exclude:
|
||||
- .artifacts
|
||||
- .codex
|
||||
- .DS_Store
|
||||
- .pytest_cache
|
||||
- .ruff_cache
|
||||
- .venv
|
||||
- dist
|
||||
- htmlcov
|
||||
- playwright-report
|
||||
- test-results
|
||||
env:
|
||||
allow:
|
||||
- CI
|
||||
- CLAWBENCH_*
|
||||
- OPENCLAW_*
|
||||
- PYTHON*
|
||||
ssh:
|
||||
user: crabbox
|
||||
port: "2222"
|
||||
1
.github/CODEOWNERS
vendored
1
.github/CODEOWNERS
vendored
@ -1 +0,0 @@
|
||||
* @openclaw/openclaw-evals
|
||||
16
.github/workflows/README.md
vendored
16
.github/workflows/README.md
vendored
@ -29,22 +29,6 @@ It installs ClawBench, hydrates provider/HF secrets into
|
||||
dotfiles from repo or org secrets, and installs
|
||||
`~/.local/bin/clawbench-testbox-env` for commands that need that live auth.
|
||||
|
||||
## `crabbox-hydrate.yml` — Crabbox Actions hydration
|
||||
|
||||
This workflow exists for the Crabbox CLI from `openclaw/crabbox`:
|
||||
|
||||
```bash
|
||||
crabbox warmup --idle-timeout 90m
|
||||
crabbox actions hydrate --id <cbx_id-or-slug>
|
||||
crabbox run --id <cbx_id-or-slug> --shell -- "python -m pytest -q"
|
||||
```
|
||||
|
||||
It runs on the dynamic self-hosted runner label registered by Crabbox, installs
|
||||
ClawBench, hydrates the same provider/HF secrets and agent dotfiles as the
|
||||
Blacksmith Testbox workflow, writes the Crabbox ready marker under
|
||||
`~/.crabbox/actions/`, and keeps the job alive for follow-up SSH sync/run
|
||||
commands.
|
||||
|
||||
## `sync-to-hf-space.yml` — auto-mirror main to the HF Space
|
||||
|
||||
Mirrors every push to `main` into the HF Space git remote so
|
||||
|
||||
166
.github/workflows/crabbox-hydrate.yml
vendored
166
.github/workflows/crabbox-hydrate.yml
vendored
@ -1,166 +0,0 @@
|
||||
name: Crabbox Hydrate
|
||||
|
||||
on:
|
||||
workflow_dispatch:
|
||||
inputs:
|
||||
crabbox_id:
|
||||
description: "Crabbox lease ID"
|
||||
required: true
|
||||
type: string
|
||||
ref:
|
||||
description: "Git ref to hydrate"
|
||||
required: false
|
||||
type: string
|
||||
crabbox_runner_label:
|
||||
description: "Dynamic Crabbox runner label"
|
||||
required: true
|
||||
type: string
|
||||
crabbox_job:
|
||||
description: "Hydration job identifier expected by Crabbox"
|
||||
required: false
|
||||
default: "hydrate"
|
||||
type: string
|
||||
crabbox_keep_alive_minutes:
|
||||
description: "Minutes to keep the hydrated job alive"
|
||||
required: false
|
||||
default: "90"
|
||||
type: string
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
|
||||
jobs:
|
||||
hydrate:
|
||||
name: hydrate
|
||||
runs-on: [self-hosted, "${{ inputs.crabbox_runner_label }}"]
|
||||
timeout-minutes: 120
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v4
|
||||
with:
|
||||
ref: ${{ inputs.ref || github.ref }}
|
||||
|
||||
- name: Set up Python
|
||||
uses: actions/setup-python@v5
|
||||
with:
|
||||
python-version: "3.12"
|
||||
cache: pip
|
||||
|
||||
- name: Install project
|
||||
run: |
|
||||
python -m pip install --upgrade pip
|
||||
python -m pip install -e .
|
||||
|
||||
- name: Prepare Crabbox shell
|
||||
shell: bash
|
||||
run: |
|
||||
set -euo pipefail
|
||||
git fetch --no-tags --depth=50 origin "+refs/heads/main:refs/remotes/origin/main"
|
||||
python_dir="$(dirname "$(python -c 'import sys; print(sys.executable)')")"
|
||||
sudo ln -sf "$python_dir/python" /usr/local/bin/python
|
||||
sudo ln -sf "$python_dir/python" /usr/local/bin/python3
|
||||
sudo ln -sf "$python_dir/pip" /usr/local/bin/pip
|
||||
sudo ln -sf "$python_dir/pip" /usr/local/bin/pip3
|
||||
sudo ln -sf "$python_dir/pytest" /usr/local/bin/pytest
|
||||
|
||||
- name: Hydrate Crabbox env helper
|
||||
shell: bash
|
||||
env:
|
||||
HF_TOKEN: ${{ secrets.HF_TOKEN }}
|
||||
HF_USERNAME: ${{ secrets.HF_USERNAME }}
|
||||
CLAWBENCH_QUEUE_DATASET: ${{ vars.CLAWBENCH_QUEUE_DATASET || 'openclaw/clawbench-results' }}
|
||||
CLAWBENCH_JUDGE_MODEL: ${{ vars.CLAWBENCH_JUDGE_MODEL }}
|
||||
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
||||
ANTHROPIC_API_KEY_OLD: ${{ secrets.ANTHROPIC_API_KEY_OLD }}
|
||||
ANTHROPIC_API_TOKEN: ${{ secrets.ANTHROPIC_API_TOKEN }}
|
||||
CEREBRAS_API_KEY: ${{ secrets.CEREBRAS_API_KEY }}
|
||||
DEEPINFRA_API_KEY: ${{ secrets.DEEPINFRA_API_KEY }}
|
||||
FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
|
||||
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
|
||||
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
|
||||
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
|
||||
KIMI_API_KEY: ${{ secrets.KIMI_API_KEY }}
|
||||
MINIMAX_API_KEY: ${{ secrets.MINIMAX_API_KEY }}
|
||||
MISTRAL_API_KEY: ${{ secrets.MISTRAL_API_KEY }}
|
||||
MOONSHOT_API_KEY: ${{ secrets.MOONSHOT_API_KEY }}
|
||||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||
OPENAI_BASE_URL: ${{ secrets.OPENAI_BASE_URL }}
|
||||
OPENROUTER_API_KEY: ${{ secrets.OPENROUTER_API_KEY }}
|
||||
QWEN_API_KEY: ${{ secrets.QWEN_API_KEY }}
|
||||
TOGETHER_API_KEY: ${{ secrets.TOGETHER_API_KEY }}
|
||||
XAI_API_KEY: ${{ secrets.XAI_API_KEY }}
|
||||
ZAI_API_KEY: ${{ secrets.ZAI_API_KEY }}
|
||||
Z_AI_API_KEY: ${{ secrets.Z_AI_API_KEY }}
|
||||
OPENCLAW_CODEX_AUTH_JSON: ${{ secrets.OPENCLAW_CODEX_AUTH_JSON }}
|
||||
OPENCLAW_CODEX_CONFIG_TOML: ${{ secrets.OPENCLAW_CODEX_CONFIG_TOML }}
|
||||
OPENCLAW_CLAUDE_JSON: ${{ secrets.OPENCLAW_CLAUDE_JSON }}
|
||||
OPENCLAW_CLAUDE_CREDENTIALS_JSON: ${{ secrets.OPENCLAW_CLAUDE_CREDENTIALS_JSON }}
|
||||
OPENCLAW_CLAUDE_SETTINGS_JSON: ${{ secrets.OPENCLAW_CLAUDE_SETTINGS_JSON }}
|
||||
OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON: ${{ secrets.OPENCLAW_CLAUDE_SETTINGS_LOCAL_JSON }}
|
||||
OPENCLAW_GEMINI_SETTINGS_JSON: ${{ secrets.OPENCLAW_GEMINI_SETTINGS_JSON }}
|
||||
CLAWBENCH_CODEX_AUTH_JSON: ${{ secrets.CLAWBENCH_CODEX_AUTH_JSON }}
|
||||
CLAWBENCH_CODEX_CONFIG_TOML: ${{ secrets.CLAWBENCH_CODEX_CONFIG_TOML }}
|
||||
CLAWBENCH_CLAUDE_JSON: ${{ secrets.CLAWBENCH_CLAUDE_JSON }}
|
||||
CLAWBENCH_CLAUDE_CREDENTIALS_JSON: ${{ secrets.CLAWBENCH_CLAUDE_CREDENTIALS_JSON }}
|
||||
CLAWBENCH_CLAUDE_SETTINGS_JSON: ${{ secrets.CLAWBENCH_CLAUDE_SETTINGS_JSON }}
|
||||
CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON: ${{ secrets.CLAWBENCH_CLAUDE_SETTINGS_LOCAL_JSON }}
|
||||
CLAWBENCH_GEMINI_SETTINGS_JSON: ${{ secrets.CLAWBENCH_GEMINI_SETTINGS_JSON }}
|
||||
run: |
|
||||
bash scripts/ci-hydrate-testbox-env.sh
|
||||
sudo ln -sf "$HOME/.local/bin/clawbench-testbox-env" /usr/local/bin/clawbench-testbox-env
|
||||
|
||||
- name: Mark Crabbox ready
|
||||
shell: bash
|
||||
run: |
|
||||
set -euo pipefail
|
||||
job="${{ inputs.crabbox_job }}"
|
||||
if [ -z "$job" ]; then job=hydrate; fi
|
||||
mkdir -p "$HOME/.crabbox/actions"
|
||||
state="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env"
|
||||
env_file="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env.sh"
|
||||
services_file="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.services"
|
||||
write_export() {
|
||||
key="$1"
|
||||
value="${!key-}"
|
||||
if [ -n "$value" ]; then
|
||||
printf 'export %s=%q\n' "$key" "$value"
|
||||
fi
|
||||
}
|
||||
{
|
||||
for key in CI GITHUB_ACTIONS GITHUB_WORKSPACE GITHUB_REPOSITORY GITHUB_RUN_ID GITHUB_RUN_NUMBER GITHUB_RUN_ATTEMPT GITHUB_REF GITHUB_REF_NAME GITHUB_SHA GITHUB_EVENT_NAME GITHUB_ACTOR RUNNER_OS RUNNER_ARCH RUNNER_TEMP RUNNER_TOOL_CACHE; do
|
||||
write_export "$key"
|
||||
done
|
||||
} > "${env_file}.tmp"
|
||||
mv "${env_file}.tmp" "$env_file"
|
||||
{
|
||||
echo "# Docker containers visible from the hydrated runner"
|
||||
docker ps --format '{{.Names}}\t{{.Image}}\t{{.Ports}}' 2>/dev/null || true
|
||||
} > "${services_file}.tmp"
|
||||
mv "${services_file}.tmp" "$services_file"
|
||||
tmp="${state}.tmp"
|
||||
{
|
||||
echo "WORKSPACE=${GITHUB_WORKSPACE}"
|
||||
echo "RUN_ID=${GITHUB_RUN_ID}"
|
||||
echo "JOB=${job}"
|
||||
echo "ENV_FILE=${env_file}"
|
||||
echo "SERVICES_FILE=${services_file}"
|
||||
echo "READY_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
|
||||
} > "$tmp"
|
||||
mv "$tmp" "$state"
|
||||
|
||||
- name: Keep Crabbox job alive
|
||||
shell: bash
|
||||
run: |
|
||||
set -euo pipefail
|
||||
minutes="${{ inputs.crabbox_keep_alive_minutes }}"
|
||||
case "$minutes" in
|
||||
''|*[!0-9]*) minutes=90 ;;
|
||||
esac
|
||||
stop="$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.stop"
|
||||
deadline=$(( $(date +%s) + minutes * 60 ))
|
||||
while [ "$(date +%s)" -lt "$deadline" ]; do
|
||||
if [ -f "$stop" ]; then
|
||||
exit 0
|
||||
fi
|
||||
sleep 15
|
||||
done
|
||||
@ -14,7 +14,7 @@ RUN apt-get update && \
|
||||
RUN ln -s /app /openclaw
|
||||
|
||||
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
|
||||
RUN cd /tmp && npx -y playwright@1.59.1 install --with-deps chromium && \
|
||||
RUN npx -y playwright@1.59.1 install --with-deps chromium && \
|
||||
CHROME_PATH="$(find /ms-playwright -path '*/chrome' -type f | sort | head -n 1)" && \
|
||||
test -x "$CHROME_PATH" && \
|
||||
ln -sf "$CHROME_PATH" /usr/bin/chromium
|
||||
@ -28,7 +28,6 @@ COPY --chown=node:node tasks-public/ tasks-public/
|
||||
COPY --chown=node:node tasks-domain/ tasks-domain/
|
||||
COPY --chown=node:node profiles/ profiles/
|
||||
COPY --chown=node:node baselines/ baselines/
|
||||
COPY --chown=node:node scripts/ scripts/
|
||||
COPY --chown=node:node app.py .
|
||||
|
||||
RUN python3 -m pip install --break-system-packages --no-cache-dir .
|
||||
|
||||
20
README.md
20
README.md
@ -461,26 +461,6 @@ python3 scripts/run_posterior_dynamics_pipeline.py \
|
||||
clawbench diagnose profiles/local_ollama_gpt_oss.yaml
|
||||
```
|
||||
|
||||
### Running on Kubernetes
|
||||
|
||||
See [`docs/kubernetes.md`](docs/kubernetes.md) for the full runbook. The short
|
||||
version:
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_NAMESPACE=clawbench-eval
|
||||
export OPENAI_API_KEY="sk-..." # or ANTHROPIC_API_KEY, OPENROUTER_API_KEY, etc.
|
||||
export CLAWBENCH_MODEL="openai/gpt-5.5"
|
||||
# export MLFLOW_NAMESPACE="mlflow" # MLflow deploys in a separate namespace (default: mlflow)
|
||||
|
||||
./scripts/k8s/deploy.sh # deploys OpenClaw + MLflow + starts eval
|
||||
./scripts/k8s/deploy.sh --logs # follow progress
|
||||
./scripts/k8s/deploy.sh --teardown # tear down openclaw & eval (does not delete MLflow)
|
||||
```
|
||||
|
||||
API keys are stored in a Kubernetes Secret created by the deploy script.
|
||||
MLflow is deployed in its own namespace (default: `mlflow`, configurable via
|
||||
`MLFLOW_NAMESPACE`).
|
||||
|
||||
---
|
||||
|
||||
## Partner Trace Spec
|
||||
|
||||
@ -226,73 +226,14 @@ class GatewayClient:
|
||||
attempt += 1
|
||||
try:
|
||||
remaining = max(1.0, deadline - asyncio.get_running_loop().time())
|
||||
attempt_timeout = min(30.0, remaining)
|
||||
self._ws = await websockets.connect(
|
||||
self.config.url,
|
||||
max_size=10 * 1024 * 1024,
|
||||
open_timeout=attempt_timeout,
|
||||
open_timeout=min(self.config.connect_timeout, remaining),
|
||||
additional_headers={"Origin": host},
|
||||
)
|
||||
self._listen_task = asyncio.create_task(self._listener())
|
||||
challenge = await self._wait_event(
|
||||
"connect.challenge", timeout=attempt_timeout
|
||||
)
|
||||
challenge_payload = challenge.get("payload", {})
|
||||
nonce = ""
|
||||
if isinstance(challenge_payload, dict):
|
||||
raw_nonce = challenge_payload.get("nonce", "")
|
||||
if isinstance(raw_nonce, str):
|
||||
nonce = raw_nonce.strip()
|
||||
|
||||
role = "operator"
|
||||
scopes = [
|
||||
"operator.admin",
|
||||
"operator.read",
|
||||
"operator.write",
|
||||
"operator.approvals",
|
||||
"operator.pairing",
|
||||
]
|
||||
client_info = {
|
||||
"id": "openclaw-control-ui",
|
||||
"version": __version__,
|
||||
"platform": "linux",
|
||||
"mode": "ui",
|
||||
}
|
||||
connect_params: dict[str, Any] = {
|
||||
"minProtocol": PROTOCOL_VERSION,
|
||||
"maxProtocol": PROTOCOL_VERSION,
|
||||
"client": client_info,
|
||||
"role": role,
|
||||
"scopes": scopes,
|
||||
"caps": [],
|
||||
"commands": [],
|
||||
"permissions": {},
|
||||
"auth": {"token": self.config.token} if self.config.token else {},
|
||||
}
|
||||
device = _build_connect_device(
|
||||
nonce=nonce,
|
||||
token=self.config.token,
|
||||
client_id=str(client_info["id"]),
|
||||
client_mode=str(client_info["mode"]),
|
||||
role=role,
|
||||
scopes=scopes,
|
||||
platform=str(client_info["platform"]),
|
||||
)
|
||||
if device:
|
||||
connect_params["device"] = device
|
||||
|
||||
response = await self._rpc(
|
||||
"connect",
|
||||
connect_params,
|
||||
timeout=attempt_timeout,
|
||||
)
|
||||
payload = response.get("payload", {})
|
||||
if payload.get("type") != "hello-ok":
|
||||
raise ConnectionError(f"Expected hello-ok, got: {payload}")
|
||||
logger.info("Connected to gateway (protocol v%s)", payload.get("protocol", "?"))
|
||||
return
|
||||
break
|
||||
except Exception as exc:
|
||||
await self.close()
|
||||
if not _is_transient_gateway_connect_error(exc):
|
||||
raise
|
||||
if asyncio.get_running_loop().time() >= deadline:
|
||||
@ -304,6 +245,60 @@ class GatewayClient:
|
||||
delay,
|
||||
)
|
||||
await asyncio.sleep(delay)
|
||||
self._listen_task = asyncio.create_task(self._listener())
|
||||
challenge = await self._wait_event("connect.challenge", timeout=self.config.connect_timeout)
|
||||
challenge_payload = challenge.get("payload", {})
|
||||
nonce = ""
|
||||
if isinstance(challenge_payload, dict):
|
||||
raw_nonce = challenge_payload.get("nonce", "")
|
||||
if isinstance(raw_nonce, str):
|
||||
nonce = raw_nonce.strip()
|
||||
|
||||
role = "operator"
|
||||
scopes = [
|
||||
"operator.admin",
|
||||
"operator.read",
|
||||
"operator.write",
|
||||
"operator.approvals",
|
||||
"operator.pairing",
|
||||
]
|
||||
client_info = {
|
||||
"id": "openclaw-control-ui",
|
||||
"version": __version__,
|
||||
"platform": "linux",
|
||||
"mode": "ui",
|
||||
}
|
||||
connect_params: dict[str, Any] = {
|
||||
"minProtocol": PROTOCOL_VERSION,
|
||||
"maxProtocol": PROTOCOL_VERSION,
|
||||
"client": client_info,
|
||||
"role": role,
|
||||
"scopes": scopes,
|
||||
"caps": [],
|
||||
"commands": [],
|
||||
"permissions": {},
|
||||
"auth": {"token": self.config.token} if self.config.token else {},
|
||||
}
|
||||
device = _build_connect_device(
|
||||
nonce=nonce,
|
||||
token=self.config.token,
|
||||
client_id=str(client_info["id"]),
|
||||
client_mode=str(client_info["mode"]),
|
||||
role=role,
|
||||
scopes=scopes,
|
||||
platform=str(client_info["platform"]),
|
||||
)
|
||||
if device:
|
||||
connect_params["device"] = device
|
||||
|
||||
response = await self._rpc(
|
||||
"connect",
|
||||
connect_params,
|
||||
)
|
||||
payload = response.get("payload", {})
|
||||
if payload.get("type") != "hello-ok":
|
||||
raise ConnectionError(f"Expected hello-ok, got: {payload}")
|
||||
logger.info("Connected to gateway (protocol v%s)", payload.get("protocol", "?"))
|
||||
|
||||
async def close(self) -> None:
|
||||
if self._listen_task and not self._listen_task.done():
|
||||
@ -399,15 +394,6 @@ class GatewayClient:
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to delete session %s: %s", session_key, exc)
|
||||
|
||||
async def abort_session(self, session_key: str, *, run_id: str | None = None) -> None:
|
||||
params: dict[str, Any] = {"key": session_key}
|
||||
if run_id:
|
||||
params["runId"] = run_id
|
||||
try:
|
||||
await self._rpc("sessions.abort", params, timeout=min(self.config.request_timeout, 10.0))
|
||||
except Exception as exc:
|
||||
logger.warning("Failed to abort session %s run %s: %s", session_key, run_id or "-", exc)
|
||||
|
||||
async def get_effective_tools(self, session_key: str) -> dict[str, Any]:
|
||||
response = await self._rpc("tools.effective", {"sessionKey": session_key})
|
||||
return response.get("payload", {})
|
||||
@ -427,27 +413,15 @@ class GatewayClient:
|
||||
msg_queue: asyncio.Queue[dict[str, Any]] = asyncio.Queue()
|
||||
self._event_queues[chat_queue_key] = chat_queue
|
||||
self._event_queues[msg_queue_key] = msg_queue
|
||||
timeout_ms = max(1, min(int(timeout * 1000), 2_147_483_647))
|
||||
|
||||
send_response = await self._rpc(
|
||||
await self._rpc(
|
||||
"sessions.send",
|
||||
{
|
||||
"key": session_key,
|
||||
"message": message,
|
||||
"idempotencyKey": idempotency_key,
|
||||
"timeoutMs": timeout_ms,
|
||||
},
|
||||
)
|
||||
send_payload = send_response.get("payload", {})
|
||||
run_id = idempotency_key
|
||||
if isinstance(send_payload, dict):
|
||||
raw_run_id = send_payload.get("runId")
|
||||
if isinstance(raw_run_id, str) and raw_run_id.strip():
|
||||
run_id = raw_run_id.strip()
|
||||
|
||||
wait_task = asyncio.create_task(
|
||||
self._wait_for_agent_run(run_id, timeout_ms=timeout_ms)
|
||||
)
|
||||
|
||||
collected_messages: list[TranscriptMessage] = []
|
||||
done = False
|
||||
@ -456,31 +430,8 @@ class GatewayClient:
|
||||
while not done:
|
||||
remaining = deadline - asyncio.get_running_loop().time()
|
||||
if remaining <= 0:
|
||||
logger.warning(
|
||||
"Timeout waiting for final state on session %s run %s",
|
||||
session_key,
|
||||
run_id,
|
||||
)
|
||||
logger.warning("Timeout waiting for final state on session %s", session_key)
|
||||
break
|
||||
if wait_task.done():
|
||||
wait_payload = _task_result_or_empty(wait_task)
|
||||
status = str(wait_payload.get("status", ""))
|
||||
if status and status != "timeout":
|
||||
logger.info(
|
||||
"agent.wait observed terminal status for session %s run %s: %s",
|
||||
session_key,
|
||||
run_id,
|
||||
status,
|
||||
)
|
||||
done = True
|
||||
break
|
||||
if status == "timeout":
|
||||
logger.warning(
|
||||
"agent.wait timed out for session %s run %s",
|
||||
session_key,
|
||||
run_id,
|
||||
)
|
||||
break
|
||||
try:
|
||||
event = await asyncio.wait_for(chat_queue.get(), timeout=min(0.5, remaining))
|
||||
state = event.get("payload", {}).get("state", "")
|
||||
@ -489,9 +440,6 @@ class GatewayClient:
|
||||
except asyncio.TimeoutError:
|
||||
pass
|
||||
|
||||
if not done:
|
||||
await self.abort_session(session_key, run_id=run_id)
|
||||
|
||||
collected_messages.extend(
|
||||
await _drain_message_queue(
|
||||
msg_queue,
|
||||
@ -516,30 +464,11 @@ class GatewayClient:
|
||||
):
|
||||
collected_messages = history_messages
|
||||
finally:
|
||||
if not wait_task.done():
|
||||
wait_task.cancel()
|
||||
try:
|
||||
await wait_task
|
||||
except asyncio.CancelledError:
|
||||
pass
|
||||
self._event_queues.pop(chat_queue_key, None)
|
||||
self._event_queues.pop(msg_queue_key, None)
|
||||
|
||||
return _correlate_transcript(Transcript(messages=collected_messages))
|
||||
|
||||
async def _wait_for_agent_run(self, run_id: str, *, timeout_ms: int) -> dict[str, Any]:
|
||||
try:
|
||||
response = await self._rpc(
|
||||
"agent.wait",
|
||||
{"runId": run_id, "timeoutMs": timeout_ms},
|
||||
timeout=(timeout_ms / 1000.0) + 10.0,
|
||||
)
|
||||
except Exception as exc:
|
||||
logger.warning("agent.wait failed for run %s: %s", run_id, exc)
|
||||
return {}
|
||||
payload = response.get("payload", {})
|
||||
return payload if isinstance(payload, dict) else {}
|
||||
|
||||
async def get_session_messages(self, session_key: str) -> list[TranscriptMessage]:
|
||||
try:
|
||||
response = await self._rpc("sessions.get", {"key": session_key})
|
||||
@ -648,13 +577,6 @@ def _build_connect_device(
|
||||
platform: str,
|
||||
device_family: str | None = None,
|
||||
) -> dict[str, Any] | None:
|
||||
if os.environ.get("CLAWBENCH_DISABLE_GATEWAY_DEVICE_IDENTITY", "").strip().lower() in {
|
||||
"1",
|
||||
"true",
|
||||
"yes",
|
||||
"on",
|
||||
}:
|
||||
return None
|
||||
if not nonce:
|
||||
return None
|
||||
|
||||
@ -724,10 +646,6 @@ def _resolve_node_executable() -> str | None:
|
||||
|
||||
|
||||
def _is_transient_gateway_connect_error(exc: Exception) -> bool:
|
||||
if isinstance(exc, (TimeoutError, asyncio.TimeoutError)):
|
||||
return True
|
||||
if isinstance(exc, websockets.exceptions.ConnectionClosed):
|
||||
return True
|
||||
if isinstance(exc, InvalidStatus):
|
||||
return exc.response.status_code in {502, 503, 504}
|
||||
if isinstance(exc, InvalidMessage):
|
||||
@ -743,13 +661,6 @@ def _describe_connect_error(exc: Exception) -> str:
|
||||
return exc.__class__.__name__
|
||||
|
||||
|
||||
def _task_result_or_empty(task: asyncio.Task[dict[str, Any]]) -> dict[str, Any]:
|
||||
try:
|
||||
return task.result()
|
||||
except Exception:
|
||||
return {}
|
||||
|
||||
|
||||
def _parse_single_message(message_data: dict[str, Any]) -> TranscriptMessage | None:
|
||||
role = message_data.get("role", "")
|
||||
if not role:
|
||||
|
||||
@ -19,7 +19,6 @@ from rich.console import Console
|
||||
from rich.table import Table
|
||||
|
||||
from clawbench import __version__
|
||||
from clawbench.ablation import build_ablation_profile
|
||||
from clawbench.client import GatewayClient, GatewayConfig
|
||||
from clawbench.releases import compute_task_snapshot_fingerprint, load_active_release
|
||||
from clawbench.schemas import (
|
||||
@ -87,9 +86,6 @@ class BenchmarkHarness:
|
||||
browser_concurrency: int = 1,
|
||||
adapter: str = "openclaw",
|
||||
judge_affects_score: bool = False,
|
||||
tool_profile_name: str | None = None,
|
||||
enabled_toolsets: list[str] | None = None,
|
||||
disabled_toolsets: list[str] | None = None,
|
||||
) -> None:
|
||||
self.gateway_config = gateway_config
|
||||
self.model = model
|
||||
@ -115,9 +111,6 @@ class BenchmarkHarness:
|
||||
self.concurrency = max(1, int(concurrency))
|
||||
self.browser_concurrency = max(1, int(browser_concurrency))
|
||||
self.adapter = adapter
|
||||
self.tool_profile_name = tool_profile_name
|
||||
self.enabled_toolsets = enabled_toolsets or []
|
||||
self.disabled_toolsets = disabled_toolsets or []
|
||||
self.repo_root = Path(__file__).parent.parent
|
||||
self.last_task_runs: dict[str, list[TaskRunResult]] = {}
|
||||
|
||||
@ -555,9 +548,6 @@ class BenchmarkHarness:
|
||||
"prompt_variant": self.prompt_variant,
|
||||
"judge_model": self.judge_model,
|
||||
"judge_affects_score": self.judge_affects_score,
|
||||
"tool_profile_name": self.tool_profile_name,
|
||||
"enabled_toolsets": self.enabled_toolsets,
|
||||
"disabled_toolsets": self.disabled_toolsets,
|
||||
"benchmark_version": __version__,
|
||||
"task_fingerprint": _task_definition_fingerprint(task),
|
||||
}
|
||||
@ -763,15 +753,6 @@ class BenchmarkHarness:
|
||||
for _ in range(count)
|
||||
)
|
||||
active_release = load_active_release()
|
||||
ablation_profile = build_ablation_profile(
|
||||
model=self.model,
|
||||
adapter=self.adapter,
|
||||
prompt_profile=self.prompt_variant,
|
||||
harness_version=__version__,
|
||||
tool_profile_name=self.tool_profile_name,
|
||||
enabled_toolsets=self.enabled_toolsets,
|
||||
disabled_toolsets=self.disabled_toolsets,
|
||||
)
|
||||
result = BenchmarkResult(
|
||||
submission_id=str(uuid.uuid4()),
|
||||
model=self.model,
|
||||
@ -789,7 +770,6 @@ class BenchmarkHarness:
|
||||
"judge_model": self.judge_model,
|
||||
"judge_affects_score": self.judge_affects_score,
|
||||
"adapter": self.adapter,
|
||||
"ablation_profile": ablation_profile.model_dump(),
|
||||
"known_adapters": list(KNOWN_ADAPTERS),
|
||||
"executable_adapters": sorted(EXECUTABLE_ADAPTERS),
|
||||
"subsets": self.subsets,
|
||||
|
||||
@ -28,14 +28,7 @@ logger = logging.getLogger(__name__)
|
||||
HF_TOKEN = os.environ.get("HF_TOKEN", "")
|
||||
|
||||
# Local fallback when HF is unavailable
|
||||
def _resolve_local_queue_dir() -> Path:
|
||||
override = os.environ.get("CLAWBENCH_LOCAL_QUEUE_DIR", "").strip()
|
||||
if override:
|
||||
return Path(override).expanduser()
|
||||
return Path("/data/queue") if Path("/data").exists() else Path("data/queue")
|
||||
|
||||
|
||||
LOCAL_QUEUE_DIR = _resolve_local_queue_dir()
|
||||
LOCAL_QUEUE_DIR = Path("/data/queue") if Path("/data").exists() else Path("data/queue")
|
||||
|
||||
|
||||
class JobStatus(str, Enum):
|
||||
@ -57,7 +50,6 @@ class SubmissionRequest(BaseModel):
|
||||
runs_per_task: int = Field(default=3, ge=1, le=10)
|
||||
max_parallel_lanes: int = Field(default=1, ge=1, le=8)
|
||||
tier: str | None = None # Filter to a specific tier
|
||||
task_ids: list[str] = Field(default_factory=list)
|
||||
scenario: str | None = None
|
||||
prompt_variant: str = "clear"
|
||||
submitter: str = "" # HF username
|
||||
@ -73,7 +65,6 @@ class SubmissionRequest(BaseModel):
|
||||
"runs_per_task": self.runs_per_task,
|
||||
"max_parallel_lanes": self.max_parallel_lanes,
|
||||
"tier": self.tier or "",
|
||||
"task_ids": sorted({task_id.strip() for task_id in self.task_ids if task_id.strip()}),
|
||||
"scenario": self.scenario or "",
|
||||
"prompt_variant": self.prompt_variant,
|
||||
}
|
||||
|
||||
@ -34,13 +34,6 @@ STALE_EVALUATION_SECONDS = max(
|
||||
JOB_HEARTBEAT_INTERVAL_SECONDS * 4,
|
||||
int(os.environ.get("CLAWBENCH_STALE_EVALUATION_SECONDS", "1800")),
|
||||
)
|
||||
OPENCLAW_EVAL_EXEC_HOSTS = {"auto", "gateway", "sandbox", "node"}
|
||||
OPENCLAW_EVAL_SYSTEM_PROMPT = (
|
||||
"You are running an OpenClaw benchmark task. Complete the user's request in the current "
|
||||
"workspace using the available tools when needed. For file, code, browser, shell, or memory "
|
||||
"tasks, make the requested changes directly and verify them when practical. Do not ask "
|
||||
"follow-up questions during the benchmark. Keep any final reply brief."
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
@ -53,12 +46,6 @@ class ParallelLane:
|
||||
state_dir: Path | None = None
|
||||
log_path: Path | None = None
|
||||
|
||||
@property
|
||||
def home_dir(self) -> Path | None:
|
||||
if self.state_dir is None:
|
||||
return None
|
||||
return self.state_dir.parent / "home"
|
||||
|
||||
@property
|
||||
def ws_url(self) -> str:
|
||||
return f"ws://localhost:{self.port}"
|
||||
@ -315,7 +302,6 @@ class EvalWorker:
|
||||
prompt_variant=job.request.prompt_variant,
|
||||
prepare_run=prepare_run,
|
||||
progress_callback=progress_callback,
|
||||
tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
|
||||
)
|
||||
return await harness.run()
|
||||
|
||||
@ -386,7 +372,6 @@ class EvalWorker:
|
||||
tier=job.request.tier,
|
||||
scenario=job.request.scenario,
|
||||
prompt_variant=job.request.prompt_variant,
|
||||
tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
|
||||
)
|
||||
return summary_harness.compose_result_from_task_stats(
|
||||
ordered_stats,
|
||||
@ -400,8 +385,7 @@ class EvalWorker:
|
||||
)
|
||||
finally:
|
||||
self._stop_parallel_gateways()
|
||||
if os.environ.get("CLAWBENCH_KEEP_PARALLEL_LANE_ROOT", "").strip() != "1":
|
||||
shutil.rmtree(job_root, ignore_errors=True)
|
||||
shutil.rmtree(job_root, ignore_errors=True)
|
||||
|
||||
async def _run_parallel_lane(self, job, lane: ParallelLane, progress: JobProgressTracker):
|
||||
gateway_cmd = self._find_gateway_cmd()
|
||||
@ -450,7 +434,6 @@ class EvalWorker:
|
||||
progress_callback=progress_callback,
|
||||
print_report=False,
|
||||
quiet=True,
|
||||
tool_profile_name=os.environ.get("CLAWBENCH_TOOL_PROFILE_NAME", "") or None,
|
||||
)
|
||||
result = await harness.run()
|
||||
await self._sync_job_progress(job.job_id, progress.clear_lane(lane.index))
|
||||
@ -465,9 +448,6 @@ class EvalWorker:
|
||||
return load_all_tasks(
|
||||
tier=job.request.tier,
|
||||
scenario=job.request.scenario,
|
||||
task_ids=list(getattr(job.request, "task_ids", []) or None)
|
||||
if getattr(job.request, "task_ids", None)
|
||||
else None,
|
||||
prompt_variant=job.request.prompt_variant,
|
||||
)
|
||||
|
||||
@ -527,36 +507,10 @@ class EvalWorker:
|
||||
def _materialize_lane_runtime(self, lane: ParallelLane, job_root: Path) -> None:
|
||||
lane_root = job_root / f"lane-{lane.index}"
|
||||
lane.state_dir = lane_root / "state"
|
||||
lane_home = lane.home_dir
|
||||
if lane_home is not None:
|
||||
(lane_home / ".config").mkdir(parents=True, exist_ok=True)
|
||||
lane.log_path = lane_root / "gateway.log"
|
||||
lane.port = GATEWAY_PORT + (lane.index * GATEWAY_PORT_SPACING)
|
||||
self._seed_lane_state_dir(lane.state_dir)
|
||||
|
||||
def _run_lane_prepare_hook(self, lane: ParallelLane) -> None:
|
||||
hook = os.environ.get("CLAWBENCH_LANE_PREPARE_CMD", "").strip()
|
||||
if not hook:
|
||||
return
|
||||
if lane.state_dir is None:
|
||||
raise RuntimeError(f"Lane {lane.index + 1} state dir missing before prepare hook")
|
||||
lane_home = lane.home_dir
|
||||
if lane_home is None:
|
||||
raise RuntimeError(f"Lane {lane.index + 1} home dir missing before prepare hook")
|
||||
(lane_home / ".config").mkdir(parents=True, exist_ok=True)
|
||||
hook_env = {
|
||||
**os.environ,
|
||||
"HOME": str(lane_home),
|
||||
"OPENCLAW_HOME": str(lane_home),
|
||||
"OPENCLAW_STATE_DIR": str(lane.state_dir),
|
||||
"OPENCLAW_CONFIG_PATH": str(lane.state_dir / "openclaw.json"),
|
||||
"XDG_CONFIG_HOME": str(lane_home / ".config"),
|
||||
"CLAWBENCH_LANE_INDEX": str(lane.index),
|
||||
"CLAWBENCH_LANE_PORT": str(lane.port),
|
||||
}
|
||||
logger.info("Running lane %d prepare hook", lane.index + 1)
|
||||
subprocess.run([hook], env=hook_env, check=True)
|
||||
|
||||
def _seed_lane_state_dir(self, target_state_dir: Path) -> None:
|
||||
source_state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR", os.path.expanduser("~/.openclaw")))
|
||||
shutil.rmtree(target_state_dir, ignore_errors=True)
|
||||
@ -675,19 +629,13 @@ class EvalWorker:
|
||||
_set_nested(data, "browser.headless", True)
|
||||
_set_nested(data, "browser.noSandbox", True)
|
||||
_set_nested(data, "agents.defaults.skipBootstrap", True)
|
||||
_set_nested(data, "tools.exec.host", self._openclaw_eval_exec_host())
|
||||
_set_nested(data, "tools.exec.security", "full")
|
||||
_set_nested(data, "tools.exec.ask", "off")
|
||||
_set_nested(data, "approvals.exec.enabled", False)
|
||||
if self._active_model:
|
||||
_set_nested(data, "agents.defaults.model.primary", self._active_model)
|
||||
_set_nested(data, "agents.defaults.subagents.model.primary", self._active_model)
|
||||
self._apply_eval_model_defaults(data, self._active_model)
|
||||
|
||||
tmp_path = cfg_path.with_suffix(".json.tmp")
|
||||
tmp_path.write_text(json.dumps(data, indent=2), encoding="utf-8")
|
||||
tmp_path.replace(cfg_path)
|
||||
self._write_eval_exec_approvals(lane_state_dir)
|
||||
|
||||
def _order_task_stats(self, tasks: list[TaskDefinition], combined_stats: list) -> list:
|
||||
stats_by_id = {}
|
||||
@ -782,7 +730,6 @@ class EvalWorker:
|
||||
"token",
|
||||
"--token",
|
||||
gateway_token,
|
||||
"--compact",
|
||||
],
|
||||
stdout=log_handle,
|
||||
stderr=subprocess.STDOUT,
|
||||
@ -821,12 +768,6 @@ class EvalWorker:
|
||||
f"Gateway /health did not respond within {health_deadline_sec}s. Log:\n{self._read_gateway_log()}"
|
||||
)
|
||||
|
||||
await self._wait_for_gateway_ready_marker(
|
||||
process=self._gateway_process,
|
||||
log_reader=lambda: self._read_gateway_log(limit=20_000),
|
||||
description="Gateway",
|
||||
)
|
||||
|
||||
# Phase B: control-plane probe with retries (see the parallel
|
||||
# variant in _ensure_parallel_gateway for the detailed rationale).
|
||||
gateway_config = GatewayConfig(url=GATEWAY_WS_URL, token=GATEWAY_TOKEN)
|
||||
@ -876,30 +817,21 @@ class EvalWorker:
|
||||
# Re-inject the host config's env + plugins before every restart.
|
||||
if lane.state_dir is not None:
|
||||
self._reinject_host_env_to_lane(lane.state_dir)
|
||||
self._run_lane_prepare_hook(lane)
|
||||
if lane.state_dir is None or lane.log_path is None:
|
||||
raise RuntimeError(f"Lane {lane.index + 1} runtime was not materialized before gateway startup")
|
||||
lane_home = lane.home_dir
|
||||
if lane_home is None:
|
||||
raise RuntimeError(f"Lane {lane.index + 1} home was not materialized before gateway startup")
|
||||
(lane_home / ".config").mkdir(parents=True, exist_ok=True)
|
||||
|
||||
logger.info("Starting lane %d gateway on port %d", lane.index + 1, lane.port)
|
||||
gateway_token = os.environ.get("OPENCLAW_GATEWAY_TOKEN", "clawbench-internal-token")
|
||||
gateway_env = {
|
||||
**os.environ,
|
||||
"HOME": str(lane_home),
|
||||
"OPENCLAW_HOME": str(lane_home),
|
||||
"OPENCLAW_HOME": os.environ.get("OPENCLAW_HOME", os.path.expanduser("~")),
|
||||
"OPENCLAW_STATE_DIR": str(lane.state_dir),
|
||||
"OPENCLAW_CONFIG_PATH": str(lane.state_dir / "openclaw.json"),
|
||||
"XDG_CONFIG_HOME": str(lane_home / ".config"),
|
||||
"OPENCLAW_SKIP_GMAIL_WATCHER": "1",
|
||||
"OPENCLAW_SKIP_CANVAS_HOST": "1",
|
||||
"OPENCLAW_NO_RESPAWN": "1",
|
||||
}
|
||||
self._configure_browser_runtime(gateway_cmd, gateway_env)
|
||||
lane.log_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
lane.log_path.write_text("", encoding="utf-8")
|
||||
log_handle = lane.log_path.open("a", encoding="utf-8")
|
||||
try:
|
||||
process = subprocess.Popen(
|
||||
@ -917,7 +849,6 @@ class EvalWorker:
|
||||
"token",
|
||||
"--token",
|
||||
gateway_token,
|
||||
"--compact",
|
||||
],
|
||||
stdout=log_handle,
|
||||
stderr=subprocess.STDOUT,
|
||||
@ -960,12 +891,6 @@ class EvalWorker:
|
||||
f"Log:\n{self._read_parallel_gateway_log(lane)}"
|
||||
)
|
||||
|
||||
await self._wait_for_gateway_ready_marker(
|
||||
process=process,
|
||||
log_reader=lambda: self._read_parallel_gateway_log(lane, limit=20_000),
|
||||
description=f"Lane {lane.index + 1} gateway",
|
||||
)
|
||||
|
||||
# Phase B: control-plane probe with explicit retries. A healthy
|
||||
# /health response does not guarantee sessions.create works
|
||||
# immediately — plugin registration races can leave the gateway
|
||||
@ -1077,10 +1002,6 @@ class EvalWorker:
|
||||
("agents.defaults.skipBootstrap", True),
|
||||
("browser.headless", True),
|
||||
("browser.noSandbox", True),
|
||||
("tools.exec.host", self._openclaw_eval_exec_host()),
|
||||
("tools.exec.security", "full"),
|
||||
("tools.exec.ask", "off"),
|
||||
("approvals.exec.enabled", False),
|
||||
]
|
||||
if self._active_model:
|
||||
config_pairs.extend(
|
||||
@ -1090,61 +1011,14 @@ class EvalWorker:
|
||||
]
|
||||
)
|
||||
try:
|
||||
state_dir = Path(
|
||||
gateway_env.get("OPENCLAW_STATE_DIR")
|
||||
or os.environ.get("OPENCLAW_STATE_DIR")
|
||||
or os.path.expanduser("~/.openclaw")
|
||||
)
|
||||
config_path = Path(gateway_env.get("OPENCLAW_CONFIG_PATH") or (state_dir / "openclaw.json"))
|
||||
self._patch_openclaw_config(config_pairs, config_path=config_path)
|
||||
self._write_eval_exec_approvals(state_dir)
|
||||
self._patch_openclaw_config(config_pairs)
|
||||
except Exception as exc:
|
||||
logger.warning("Direct openclaw.json patch failed: %s", exc)
|
||||
|
||||
@staticmethod
|
||||
def _openclaw_eval_exec_host() -> str:
|
||||
value = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
|
||||
if value in OPENCLAW_EVAL_EXEC_HOSTS:
|
||||
return value
|
||||
logger.warning("Invalid OPENCLAW_EXEC_HOST=%r; using gateway", value)
|
||||
return "gateway"
|
||||
|
||||
@staticmethod
|
||||
def _write_eval_exec_approvals(state_dir: Path) -> None:
|
||||
state_dir.mkdir(parents=True, exist_ok=True)
|
||||
approvals_path = state_dir / "exec-approvals.json"
|
||||
approvals = {
|
||||
"version": 1,
|
||||
"socket": {
|
||||
"path": str(approvals_path.with_suffix(".sock")),
|
||||
"token": "clawbench-eval-token",
|
||||
},
|
||||
"defaults": {
|
||||
"security": "full",
|
||||
"ask": "off",
|
||||
"askFallback": "full",
|
||||
},
|
||||
"agents": {
|
||||
"*": {
|
||||
"security": "full",
|
||||
"ask": "off",
|
||||
"askFallback": "full",
|
||||
}
|
||||
},
|
||||
}
|
||||
tmp_path = approvals_path.with_suffix(".json.tmp")
|
||||
tmp_path.write_text(json.dumps(approvals, indent=2), encoding="utf-8")
|
||||
tmp_path.replace(approvals_path)
|
||||
|
||||
def _patch_openclaw_config(
|
||||
self,
|
||||
pairs: list[tuple[str, object]],
|
||||
*,
|
||||
config_path: Path | None = None,
|
||||
) -> None:
|
||||
if config_path is None:
|
||||
state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR") or os.path.expanduser("~/.openclaw"))
|
||||
config_path = state_dir / "openclaw.json"
|
||||
def _patch_openclaw_config(pairs: list[tuple[str, object]]) -> None:
|
||||
state_dir = Path(os.environ.get("OPENCLAW_STATE_DIR") or os.path.expanduser("~/.openclaw"))
|
||||
config_path = state_dir / "openclaw.json"
|
||||
if not config_path.exists():
|
||||
logger.warning("openclaw.json not found at %s; skipping direct patch", config_path)
|
||||
return
|
||||
@ -1160,50 +1034,12 @@ class EvalWorker:
|
||||
if cursor.get(parts[-1]) != value:
|
||||
cursor[parts[-1]] = value
|
||||
changed = True
|
||||
if self._active_model:
|
||||
changed = self._apply_eval_model_defaults(data, self._active_model) or changed
|
||||
if not changed:
|
||||
return
|
||||
tmp_path = config_path.with_suffix(".json.tmp")
|
||||
tmp_path.write_text(json.dumps(data, indent=2), encoding="utf-8")
|
||||
tmp_path.replace(config_path)
|
||||
|
||||
@staticmethod
|
||||
def _apply_eval_model_defaults(data: dict, model: str) -> bool:
|
||||
"""Force eval model parameters that keep benchmark turns low-latency."""
|
||||
agents = data.setdefault("agents", {})
|
||||
if not isinstance(agents, dict):
|
||||
data["agents"] = agents = {}
|
||||
defaults = agents.setdefault("defaults", {})
|
||||
if not isinstance(defaults, dict):
|
||||
agents["defaults"] = defaults = {}
|
||||
models = defaults.setdefault("models", {})
|
||||
if not isinstance(models, dict):
|
||||
defaults["models"] = models = {}
|
||||
entry = models.setdefault(model, {})
|
||||
if not isinstance(entry, dict):
|
||||
entry = {}
|
||||
models[model] = entry
|
||||
params = entry.setdefault("params", {})
|
||||
if not isinstance(params, dict):
|
||||
params = {}
|
||||
entry["params"] = params
|
||||
changed = False
|
||||
if defaults.get("systemPromptOverride") != OPENCLAW_EVAL_SYSTEM_PROMPT:
|
||||
defaults["systemPromptOverride"] = OPENCLAW_EVAL_SYSTEM_PROMPT
|
||||
changed = True
|
||||
if params.get("fastMode") is not True:
|
||||
params["fastMode"] = True
|
||||
changed = True
|
||||
if model.startswith("openai/"):
|
||||
if params.get("transport") != "sse":
|
||||
params["transport"] = "sse"
|
||||
changed = True
|
||||
if params.get("openaiWsWarmup") is not False:
|
||||
params["openaiWsWarmup"] = False
|
||||
changed = True
|
||||
return changed
|
||||
|
||||
def _find_gateway_cmd(self) -> list[str] | None:
|
||||
import shutil
|
||||
|
||||
@ -1223,15 +1059,13 @@ class EvalWorker:
|
||||
# Use a generous dedicated config for the probe. A healthy gateway
|
||||
# usually responds to sessions.create in under a second, but plugin
|
||||
# initialization (especially OpenRouter model list fetch) can add
|
||||
# 10-30s after /health reports 200. On cold Docker lanes OpenClaw may
|
||||
# also install provider runtime SDKs during the first sessions.create,
|
||||
# so keep this bound configurable and separate from steady-state RPCs.
|
||||
probe_timeout = float(os.environ.get("CLAWBENCH_GATEWAY_PROBE_TIMEOUT_SECONDS", "180"))
|
||||
# 10-30s after /health reports 200. The 60s outer bound ensures we
|
||||
# don't give up during a cold-start scenario.
|
||||
probe_config = GatewayConfig(
|
||||
url=gateway_config.url,
|
||||
token=gateway_config.token,
|
||||
connect_timeout=gateway_config.connect_timeout,
|
||||
request_timeout=probe_timeout,
|
||||
request_timeout=30.0,
|
||||
)
|
||||
|
||||
async def _probe() -> None:
|
||||
@ -1242,67 +1076,25 @@ class EvalWorker:
|
||||
await client.delete_session(session_key)
|
||||
|
||||
try:
|
||||
await asyncio.wait_for(_probe(), timeout=probe_timeout + 10.0)
|
||||
await asyncio.wait_for(_probe(), timeout=60.0)
|
||||
except asyncio.TimeoutError as exc:
|
||||
raise RuntimeError(
|
||||
f"Gateway control-plane probe timed out after {probe_timeout:.0f}s "
|
||||
"Gateway control-plane probe timed out after 60s "
|
||||
"(sessions.create hung on a freshly-started gateway); "
|
||||
"lane will be retried by the queue."
|
||||
) from exc
|
||||
|
||||
async def _wait_for_gateway_ready_marker(self, process: subprocess.Popen, log_reader, description: str) -> None:
|
||||
# OpenClaw 2026.4.26 can answer /health before channels and sidecars
|
||||
# finish startup. Probing sessions.create during that window can hold the
|
||||
# session write lock for minutes. Some lane gateway modes do not emit
|
||||
# the final ready marker, so wait for it briefly after sidecar startup
|
||||
# and then let the bounded control-plane probe decide.
|
||||
ready_deadline_sec = int(os.environ.get("CLAWBENCH_GATEWAY_READY_TIMEOUT_SECONDS", "420"))
|
||||
marker_grace_sec = int(os.environ.get("CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS", "90"))
|
||||
saw_sidecar_start = False
|
||||
sidecar_start_elapsed: int | None = None
|
||||
for elapsed in range(ready_deadline_sec):
|
||||
if process.poll() is not None:
|
||||
raise RuntimeError(
|
||||
f"{description} exited with code {process.returncode}. Log:\n{log_reader()[-4_000:]}"
|
||||
)
|
||||
|
||||
log_text = log_reader()
|
||||
if "[gateway] ready" in log_text:
|
||||
logger.info("%s ready after %ss", description, elapsed)
|
||||
return
|
||||
if "[gateway] starting channels and sidecars" in log_text:
|
||||
saw_sidecar_start = True
|
||||
if sidecar_start_elapsed is None:
|
||||
sidecar_start_elapsed = elapsed
|
||||
if sidecar_start_elapsed is not None and elapsed - sidecar_start_elapsed >= marker_grace_sec:
|
||||
logger.info(
|
||||
"%s did not emit ready marker %ss after sidecar startup; probing control plane",
|
||||
description,
|
||||
marker_grace_sec,
|
||||
)
|
||||
return
|
||||
if not saw_sidecar_start and elapsed >= 15:
|
||||
return
|
||||
await asyncio.sleep(1)
|
||||
|
||||
logger.warning(
|
||||
"%s did not log ready within %ss; probing control plane anyway. Log:\n%s",
|
||||
description,
|
||||
ready_deadline_sec,
|
||||
log_reader()[-4_000:],
|
||||
)
|
||||
|
||||
def _read_gateway_log(self, limit: int = 4_000) -> str:
|
||||
def _read_gateway_log(self) -> str:
|
||||
try:
|
||||
return Path("/tmp/gateway.log").read_text(encoding="utf-8", errors="replace")[-limit:]
|
||||
return Path("/tmp/gateway.log").read_text(encoding="utf-8", errors="replace")[-4_000:]
|
||||
except Exception:
|
||||
return "(no gateway log)"
|
||||
|
||||
def _read_parallel_gateway_log(self, lane: ParallelLane, limit: int = 4_000) -> str:
|
||||
def _read_parallel_gateway_log(self, lane: ParallelLane) -> str:
|
||||
if lane.log_path is None:
|
||||
return "(no gateway log)"
|
||||
try:
|
||||
return lane.log_path.read_text(encoding="utf-8", errors="replace")[-limit:]
|
||||
return lane.log_path.read_text(encoding="utf-8", errors="replace")[-4_000:]
|
||||
except Exception:
|
||||
return "(no gateway log)"
|
||||
|
||||
|
||||
@ -1,367 +0,0 @@
|
||||
# Running ClawBench on Kubernetes
|
||||
|
||||
ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar
|
||||
connects to the gateway over loopback (`ws://localhost:18789`), runs the
|
||||
19-task eval suite, and optionally logs results to MLflow.
|
||||
|
||||
```
|
||||
┌─── OpenClaw Pod ─────────────────────────────┐
|
||||
│ gateway container (ws://localhost:18789) │
|
||||
│ clawbench sidecar ──► gateway via loopback │
|
||||
└──────────────────────────────────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
Model provider API MLflow (optional)
|
||||
```
|
||||
|
||||
All commands use `scripts/k8s/deploy.sh`. The script has these modes:
|
||||
|
||||
| Flag | What it does |
|
||||
|------|-------------|
|
||||
| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar |
|
||||
| `--openclaw-only` | Deploy OpenClaw gateway only |
|
||||
| `--mlflow-only` | Deploy MLflow only |
|
||||
| `--add-sidecar` | Inject clawbench sidecar (starts eval) |
|
||||
| `--remove-sidecar` | Remove clawbench sidecar |
|
||||
| `--logs` | Tail sidecar logs |
|
||||
| `--teardown` | Delete eval namespace (keeps MLflow) |
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds)
|
||||
- A container image for ClawBench (see [Building images](#building-images))
|
||||
- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)
|
||||
|
||||
For local testing with Kind:
|
||||
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
|
||||
|
||||
---
|
||||
|
||||
## Environment variables
|
||||
|
||||
Set these **before** running `deploy.sh`.
|
||||
|
||||
### Required
|
||||
|
||||
| Variable | Purpose |
|
||||
|----------|---------|
|
||||
| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) |
|
||||
| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) |
|
||||
|
||||
### Optional
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image |
|
||||
| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image |
|
||||
| `OPENCLAW_GATEWAY_TOKEN` | *(generated by script)* | Gateway token; set this when attaching the sidecar to an existing gateway |
|
||||
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate |
|
||||
| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace |
|
||||
| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set |
|
||||
| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID |
|
||||
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
|
||||
| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image |
|
||||
| `ANTHROPIC_API_KEY` | | Added to K8s secret if set |
|
||||
| `OPENROUTER_API_KEY` | | Added to K8s secret if set |
|
||||
| `GEMINI_API_KEY` | | Added to K8s secret if set |
|
||||
| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config |
|
||||
|
||||
### Model routing
|
||||
|
||||
The gateway routes by provider prefix:
|
||||
|
||||
| Model string | Required variables |
|
||||
|-------------|-------------------|
|
||||
| `openai/gpt-5.5` | `OPENAI_API_KEY` |
|
||||
| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` |
|
||||
| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` |
|
||||
| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` |
|
||||
|
||||
For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model
|
||||
server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/`
|
||||
prefix for the model name:
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
|
||||
export OPENAI_API_KEY="none" # dummy value if the endpoint doesn't require auth
|
||||
export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Full deploy (quick start)
|
||||
|
||||
Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_NAMESPACE=clawbench-eval
|
||||
|
||||
# Export API keys before running. The script stores them in a K8s Secret
|
||||
# ("clawbench-secrets") that the gateway and sidecar containers read.
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
|
||||
# Model to evaluate (default: openai/gpt-5.5)
|
||||
# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
|
||||
|
||||
./scripts/k8s/deploy.sh
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
# Should show 2/2 containers (gateway + clawbench)
|
||||
kubectl get pods -n clawbench-eval
|
||||
|
||||
# Follow eval progress
|
||||
./scripts/k8s/deploy.sh --logs
|
||||
```
|
||||
|
||||
When the eval finishes, copy results and clean up:
|
||||
|
||||
```bash
|
||||
# Copy results from the sidecar
|
||||
POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
|
||||
kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json
|
||||
|
||||
# Remove the sidecar (keeps OpenClaw + MLflow running)
|
||||
./scripts/k8s/deploy.sh --remove-sidecar
|
||||
|
||||
# Or tear down everything
|
||||
./scripts/k8s/deploy.sh --teardown
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Existing cluster + existing MLflow
|
||||
|
||||
If you already have an OpenShift or Kubernetes cluster and an MLflow instance,
|
||||
you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup
|
||||
required.
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_NAMESPACE=clawbench-eval
|
||||
|
||||
# API keys — export before running deploy.sh. The script creates a
|
||||
# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
|
||||
# At least one provider key is required.
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
# export ANTHROPIC_API_KEY="sk-ant-..."
|
||||
# export OPENROUTER_API_KEY="sk-or-..."
|
||||
# export GEMINI_API_KEY="..."
|
||||
|
||||
# Model to evaluate (default: openai/gpt-5.5)
|
||||
export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
|
||||
|
||||
# If attaching to an existing OpenClaw gateway, this must match that gateway.
|
||||
# If deploy.sh creates OpenClaw, it generates this token for you.
|
||||
# export OPENCLAW_GATEWAY_TOKEN="..."
|
||||
|
||||
# Point to your existing MLflow
|
||||
export MLFLOW_TRACKING_URI="https://mlflow.example.com"
|
||||
export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5" # or use MLFLOW_EXPERIMENT_ID=42
|
||||
|
||||
# Deploy OpenClaw gateway into your cluster
|
||||
./scripts/k8s/deploy.sh --openclaw-only
|
||||
```
|
||||
|
||||
Verify OpenClaw is running:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n clawbench-eval
|
||||
# Expect: openclaw-xxxx 1/1 Running
|
||||
```
|
||||
|
||||
Then start the eval:
|
||||
|
||||
```bash
|
||||
./scripts/k8s/deploy.sh --add-sidecar
|
||||
./scripts/k8s/deploy.sh --logs
|
||||
```
|
||||
|
||||
The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment
|
||||
and patches the experiment name/ID into the clawbench ConfigMap. When the eval
|
||||
completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that
|
||||
experiment.
|
||||
|
||||
`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist.
|
||||
`MLFLOW_EXPERIMENT_ID` requires an existing experiment.
|
||||
|
||||
---
|
||||
|
||||
## Step-by-step deploy
|
||||
|
||||
Use this when you want to deploy components individually or bring your own
|
||||
OpenClaw/MLflow.
|
||||
|
||||
### Step 1: Deploy OpenClaw gateway
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_NAMESPACE=clawbench-eval
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
./scripts/k8s/deploy.sh --openclaw-only
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n clawbench-eval
|
||||
# Expect: openclaw-xxxx 1/1 Running
|
||||
```
|
||||
|
||||
This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token
|
||||
auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway
|
||||
token and creates the `clawbench-secrets` Secret automatically.
|
||||
|
||||
**Skip this step** if you already have an OpenClaw deployment. Your existing
|
||||
gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`):
|
||||
|
||||
```json
|
||||
{
|
||||
"browser": {
|
||||
"enabled": true,
|
||||
"headless": true,
|
||||
"noSandbox": true,
|
||||
"ssrfPolicy": {
|
||||
"allowedHostnames": ["localhost", "127.0.0.1"]
|
||||
}
|
||||
},
|
||||
"tools": {
|
||||
"profile": "coding",
|
||||
"alsoAllow": ["browser"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Key requirements:
|
||||
- `browser.enabled: true` — activates the bundled browser plugin
|
||||
- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default
|
||||
- `browser.ssrfPolicy` — several eval tasks need localhost access
|
||||
- Gateway must bind to loopback with token auth; export the matching
|
||||
`OPENCLAW_GATEWAY_TOKEN` before running `--add-sidecar`
|
||||
|
||||
### Step 2: Deploy MLflow
|
||||
|
||||
```bash
|
||||
./scripts/k8s/deploy.sh --mlflow-only
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n mlflow
|
||||
# Expect: mlflow-xxxx 1/1 Running
|
||||
```
|
||||
|
||||
Deploys a single-replica MLflow server with SQLite backend into the `mlflow`
|
||||
namespace. The clawbench ConfigMap defaults to
|
||||
`http://mlflow-service.mlflow.svc.cluster.local:5000`.
|
||||
|
||||
**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`:
|
||||
|
||||
```bash
|
||||
export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
|
||||
export MLFLOW_EXPERIMENT_ID=4 # or MLFLOW_EXPERIMENT_NAME
|
||||
```
|
||||
|
||||
### Step 3: Run the eval
|
||||
|
||||
```bash
|
||||
./scripts/k8s/deploy.sh --add-sidecar
|
||||
```
|
||||
|
||||
This patches the OpenClaw deployment to inject a clawbench sidecar that:
|
||||
|
||||
1. Waits for the gateway (TCP check on port 18789, up to 3 min)
|
||||
2. Checks MLflow connectivity if configured
|
||||
3. Runs `clawbench run` with settings from the ConfigMap
|
||||
4. Logs results to MLflow on success
|
||||
5. Sleeps indefinitely so you can retrieve logs and results
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n $CLAWBENCH_NAMESPACE
|
||||
# Expect: openclaw-xxxx 2/2 Running (gateway + clawbench)
|
||||
|
||||
./scripts/k8s/deploy.sh --logs
|
||||
# Should show "Waiting for gateway..." then "Starting eval..."
|
||||
```
|
||||
|
||||
When finished, remove the sidecar:
|
||||
|
||||
```bash
|
||||
./scripts/k8s/deploy.sh --remove-sidecar
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ConfigMap tuning
|
||||
|
||||
The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval
|
||||
behavior. Override at deploy time via env vars, or patch after deploy:
|
||||
|
||||
| Key | Default | What it controls |
|
||||
|-----|---------|-----------------|
|
||||
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test |
|
||||
| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) |
|
||||
| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes |
|
||||
| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) |
|
||||
| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) |
|
||||
| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds |
|
||||
| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds |
|
||||
| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run |
|
||||
| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint |
|
||||
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
|
||||
|
||||
---
|
||||
|
||||
## MLflow integration
|
||||
|
||||
Results are logged via `scripts/log_to_mlflow.py` after a successful eval.
|
||||
|
||||
**What gets logged:**
|
||||
- **Params**: model, provider, benchmark version, OpenClaw version, judge model
|
||||
- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior,
|
||||
reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
|
||||
- **Tags**: submission ID, timestamp, certified flag
|
||||
- **Artifacts**: full benchmark result JSON
|
||||
|
||||
---
|
||||
|
||||
## Building images
|
||||
|
||||
### ClawBench image
|
||||
|
||||
`quay.io/sallyom/clawbench:latest` is public
|
||||
|
||||
For Kubernetes, use the lightweight sidecar image instead — it only includes
|
||||
the eval harness and MLflow client:
|
||||
|
||||
```bash
|
||||
docker build -t clawbench:latest -f scripts/k8s/Dockerfile .
|
||||
|
||||
# For Kind clusters, load directly instead of pushing to a registry:
|
||||
kind load docker-image clawbench:latest --name openclaw
|
||||
|
||||
# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
|
||||
# Ensure you build for the right architecture, usually amd64 for non-local k8s
|
||||
```
|
||||
|
||||
Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it.
|
||||
|
||||
---
|
||||
|
||||
## Cleanup
|
||||
|
||||
```bash
|
||||
# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
|
||||
./scripts/k8s/deploy.sh --remove-sidecar
|
||||
|
||||
# Delete eval namespace (keeps MLflow running)
|
||||
./scripts/k8s/deploy.sh --teardown
|
||||
|
||||
# Delete the Kind cluster entirely
|
||||
kind delete cluster --name openclaw
|
||||
```
|
||||
@ -10,8 +10,7 @@ dependencies = [
|
||||
"pydantic>=2.7,<3",
|
||||
"pyyaml>=6.0,<7",
|
||||
"datasets>=3.0,<4",
|
||||
"gradio>=6.7.0,<7",
|
||||
"pillow>=12.2.0,<13",
|
||||
"gradio>=5.0,<6",
|
||||
"httpx>=0.27,<1",
|
||||
"numpy>=1.26,<3",
|
||||
"rich>=13.0,<14",
|
||||
@ -19,8 +18,8 @@ dependencies = [
|
||||
# Runtime deps for the task completion verifier. The harness shells out
|
||||
# to `pytest -q` / `pytest-asyncio` inside per-task workspaces as the
|
||||
# execution check; the container must have them in PATH.
|
||||
"pytest>=9.0.3,<10",
|
||||
"pytest-asyncio>=1,<2",
|
||||
"pytest>=8.0,<9",
|
||||
"pytest-asyncio>=0.24,<1",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
@ -28,14 +27,11 @@ dev = [
|
||||
# Kept as an alias for historical `pip install .[dev]` invocations.
|
||||
# pytest + pytest-asyncio are now in the base [dependencies] since the
|
||||
# benchmark itself runs pytest in task workspaces.
|
||||
"pytest>=9.0.3,<10",
|
||||
"pytest-asyncio>=1,<2",
|
||||
"pytest>=8.0,<9",
|
||||
"pytest-asyncio>=0.24,<1",
|
||||
"pre-commit>=4.0,<5",
|
||||
"ruff>=0.9,<1",
|
||||
]
|
||||
mlflow = [
|
||||
"mlflow>=2.10,<3",
|
||||
]
|
||||
hermes = [
|
||||
"hermes-agent @ git+https://github.com/NousResearch/hermes-agent.git@main",
|
||||
]
|
||||
|
||||
@ -1,198 +0,0 @@
|
||||
#!/bin/bash
|
||||
# Cherry-pick variant of container_sweep_single.sh: runs ONLY the tasks listed
|
||||
# in $CHERRY_TASKS (comma-separated task IDs), with state-dir isolation.
|
||||
#
|
||||
# Required env vars:
|
||||
# SWEEP_LABEL (e.g. opus47)
|
||||
# SWEEP_MODEL (e.g. anthropic/claude-opus-4-7)
|
||||
# SWEEP_PROFILE (absolute path in container)
|
||||
# SWEEP_LOGDIR (default /data/drift_2026-04-20-cherry)
|
||||
# SWEEP_OUT_TAG (default v2026-4-20-cherry)
|
||||
# CHERRY_TASKS (comma-separated task IDs, e.g. "t2-ctx-pronoun-resolve,t3-fin-budget-monthly")
|
||||
|
||||
set -u
|
||||
|
||||
: "${SWEEP_LABEL:?SWEEP_LABEL required}"
|
||||
: "${SWEEP_MODEL:?SWEEP_MODEL required}"
|
||||
: "${SWEEP_PROFILE:?SWEEP_PROFILE required}"
|
||||
: "${CHERRY_TASKS:?CHERRY_TASKS required (comma-separated task IDs)}"
|
||||
|
||||
: "${SWEEP_LOGDIR:=/data/drift_2026-04-20-cherry}"
|
||||
: "${SWEEP_OUT_TAG:=v2026-4-20-cherry}"
|
||||
|
||||
cd /data
|
||||
|
||||
LOGDIR="$SWEEP_LOGDIR"
|
||||
mkdir -p "$LOGDIR"
|
||||
|
||||
export OPENCLAW_GATEWAY_TOKEN="local-dev-token-for-testing"
|
||||
export CLAWBENCH_RUN_CACHE_DIR="/data/run_cache"
|
||||
mkdir -p "$CLAWBENCH_RUN_CACHE_DIR"
|
||||
export NODE_OPTIONS="--max-old-space-size=4096"
|
||||
# OpenClaw 4.22+ has slower agents.create / sessions.create on cold start
|
||||
# (we observed 72s for opus-4-7). Bump RPC timeouts so the harness doesn't
|
||||
# cancel mid-flight. Override defaults of 30s / 60s respectively.
|
||||
export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-120}"
|
||||
export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
|
||||
export CLAWBENCH_PER_RUN_BUDGET_SECONDS="${CLAWBENCH_PER_RUN_BUDGET_SECONDS:-900}"
|
||||
export HERMES_STEP_TIMEOUT_SECONDS="${HERMES_STEP_TIMEOUT_SECONDS:-180}"
|
||||
|
||||
# State-dir isolation (same as container_sweep_single.sh)
|
||||
SRC_STATE="/home/node/.openclaw"
|
||||
FRESH_STATE="/tmp/openclaw-state-${SWEEP_LABEL}-$$"
|
||||
echo "[state-isolate] cloning config from $SRC_STATE to $FRESH_STATE"
|
||||
mkdir -p "$FRESH_STATE"
|
||||
[ -f "$SRC_STATE/openclaw.json" ] && cp "$SRC_STATE/openclaw.json" "$FRESH_STATE/openclaw.json"
|
||||
[ -f "$SRC_STATE/exec-approvals.json" ] && cp "$SRC_STATE/exec-approvals.json" "$FRESH_STATE/exec-approvals.json"
|
||||
for d in identity devices tasks subagents flows cron; do
|
||||
[ -d "$SRC_STATE/$d" ] && cp -r "$SRC_STATE/$d" "$FRESH_STATE/$d"
|
||||
done
|
||||
mkdir -p "$FRESH_STATE/agents" "$FRESH_STATE/workspace" "$FRESH_STATE/logs" "$FRESH_STATE/memory" "$FRESH_STATE/cache"
|
||||
export OPENCLAW_STATE_DIR="$FRESH_STATE"
|
||||
export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
|
||||
echo "[state-isolate] OPENCLAW_STATE_DIR=$OPENCLAW_STATE_DIR"
|
||||
|
||||
python - <<'PY'
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
|
||||
data = json.loads(cfg_path.read_text(encoding="utf-8")) if cfg_path.exists() else {}
|
||||
|
||||
def set_nested(root, dotted, value):
|
||||
cursor = root
|
||||
parts = dotted.split(".")
|
||||
for part in parts[:-1]:
|
||||
child = cursor.get(part)
|
||||
if not isinstance(child, dict):
|
||||
child = {}
|
||||
cursor[part] = child
|
||||
cursor = child
|
||||
cursor[parts[-1]] = value
|
||||
|
||||
exec_host = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
|
||||
if exec_host not in {"auto", "gateway", "sandbox", "node"}:
|
||||
raise SystemExit(f"invalid OPENCLAW_EXEC_HOST={exec_host!r}")
|
||||
|
||||
set_nested(data, "tools.exec.host", exec_host)
|
||||
set_nested(data, "tools.exec.security", "full")
|
||||
set_nested(data, "tools.exec.ask", "off")
|
||||
set_nested(data, "approvals.exec.enabled", False)
|
||||
cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
approvals_path = cfg_path.with_name("exec-approvals.json")
|
||||
approvals = {
|
||||
"version": 1,
|
||||
"socket": {
|
||||
"path": str(approvals_path.with_suffix(".sock")),
|
||||
"token": "container-cherry-eval-token",
|
||||
},
|
||||
"defaults": {"security": "full", "ask": "off", "askFallback": "full"},
|
||||
"agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
|
||||
}
|
||||
approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
|
||||
PY
|
||||
|
||||
# Map model to cache subdir (for archiving)
|
||||
case "$SWEEP_MODEL" in
|
||||
anthropic/claude-opus-4-7) CACHE_SUB="anthropic_claude-opus-4-7" ;;
|
||||
anthropic/claude-opus-4-6) CACHE_SUB="anthropic_claude-opus-4-6" ;;
|
||||
anthropic/claude-sonnet-4-6) CACHE_SUB="anthropic_claude-sonnet-4-6" ;;
|
||||
openai/gpt-5.5) CACHE_SUB="openai_gpt-5.5" ;;
|
||||
openai/gpt-5.4) CACHE_SUB="openai_gpt-5.4" ;;
|
||||
google/gemini-3.1-pro-preview) CACHE_SUB="google_gemini-3.1-pro-preview" ;;
|
||||
openrouter/z-ai/glm-5.1) CACHE_SUB="openrouter_z-ai_glm-5.1" ;;
|
||||
openrouter/qwen/qwen3.6-plus) CACHE_SUB="openrouter_qwen_qwen3.6-plus" ;;
|
||||
openrouter/minimax/minimax-m2.7) CACHE_SUB="openrouter_minimax_minimax-m2.7" ;;
|
||||
openrouter/moonshotai/kimi-k2.6) CACHE_SUB="openrouter_moonshotai_kimi-k2.6" ;;
|
||||
openrouter/moonshotai/kimi-k2.5) CACHE_SUB="openrouter_moonshotai_kimi-k2.5" ;;
|
||||
openrouter/deepseek/deepseek-v4-pro) CACHE_SUB="openrouter_deepseek_deepseek-v4-pro" ;;
|
||||
deepseek/deepseek-v4-pro) CACHE_SUB="deepseek_deepseek-v4-pro" ;;
|
||||
deepseek/v4-pro) CACHE_SUB="deepseek_v4-pro" ;;
|
||||
*) CACHE_SUB="" ;;
|
||||
esac
|
||||
|
||||
OUT="$LOGDIR/docker_${SWEEP_LABEL}_${SWEEP_OUT_TAG}.json"
|
||||
LOG="$LOGDIR/docker_${SWEEP_LABEL}_${SWEEP_OUT_TAG}.log"
|
||||
GWLOG="$LOGDIR/gateway_${SWEEP_LABEL}.log"
|
||||
|
||||
echo "===== CHERRY-PICK SWEEP $(date '+%Y-%m-%d %H:%M:%S') ====="
|
||||
echo "label: $SWEEP_LABEL"
|
||||
echo "model: $SWEEP_MODEL"
|
||||
echo "tasks: $CHERRY_TASKS"
|
||||
echo "out: $OUT"
|
||||
|
||||
# Force-clear this model's run_cache (including fixed-task slots — so they
|
||||
# actually re-run against the new image instead of hitting old cache).
|
||||
if [ -n "$CACHE_SUB" ] && [ -d "$CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB" ]; then
|
||||
echo "clearing cache: $CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB"
|
||||
rm -rf "$CLAWBENCH_RUN_CACHE_DIR/$CACHE_SUB"
|
||||
fi
|
||||
[ -f "$OUT" ] && rm -f "$OUT"
|
||||
|
||||
# Start gateway with bumped heap
|
||||
echo "Starting gateway on :18789 (heap=4GB) ..."
|
||||
openclaw gateway --port 18789 > "$GWLOG" 2>&1 &
|
||||
GATEWAY_PID=$!
|
||||
ready=0
|
||||
for i in $(seq 1 120); do
|
||||
if curl -sf -H "Authorization: Bearer $OPENCLAW_GATEWAY_TOKEN" http://127.0.0.1:18789/ready > /dev/null 2>&1; then
|
||||
echo "Gateway ready after ${i}s"
|
||||
ready=1
|
||||
break
|
||||
fi
|
||||
sleep 1
|
||||
done
|
||||
if [ $ready -ne 1 ]; then
|
||||
echo "ERROR: gateway failed to become ready within 120s"
|
||||
tail -30 "$GWLOG"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Build -t args from comma-separated list
|
||||
TASK_ARGS=()
|
||||
IFS=',' read -ra TASK_ARR <<< "$CHERRY_TASKS"
|
||||
for t in "${TASK_ARR[@]}"; do
|
||||
TASK_ARGS+=("-t" "$t")
|
||||
done
|
||||
|
||||
echo "===== $(date '+%H:%M:%S') running clawbench with tasks: ${TASK_ARR[*]} ====="
|
||||
# NOTE: --profile intentionally OMITTED. The legacy frontier_*.yaml profile
|
||||
# format is incompatible with OpenClaw 4.22+ (loads n_tools_total=0,
|
||||
# starves the agent of tools, all runs fail with environment_unavailable
|
||||
# or timeout). Running with the default openclaw tool stack — same for
|
||||
# all models, so the comparison stays apples-to-apples.
|
||||
PROFILE_ARG=""
|
||||
if [ -n "${USE_PROFILE:-}" ] && [ -f "$SWEEP_PROFILE" ]; then
|
||||
PROFILE_ARG="--profile $SWEEP_PROFILE"
|
||||
fi
|
||||
clawbench run \
|
||||
--model "$SWEEP_MODEL" \
|
||||
--runs 3 \
|
||||
--concurrency "${CLAWBENCH_CONCURRENCY:-1}" \
|
||||
$PROFILE_ARG \
|
||||
--judge-model "anthropic/claude-sonnet-4-6" \
|
||||
"${TASK_ARGS[@]}" \
|
||||
-o "$OUT" \
|
||||
> "$LOG" 2>&1
|
||||
status=$?
|
||||
|
||||
if [ $status -eq 0 ]; then
|
||||
echo "===== $(date '+%H:%M:%S') done $SWEEP_LABEL (exit 0) ====="
|
||||
else
|
||||
echo "===== $(date '+%H:%M:%S') FAILED $SWEEP_LABEL (exit $status) ====="
|
||||
tail -20 "$LOG"
|
||||
fi
|
||||
|
||||
# Archive cache to v2026-4-20-cherry tag
|
||||
# shellcheck disable=SC1091
|
||||
source "$(dirname "$0")/_archive_cache.sh" 2>/dev/null && archive_run_cache || echo "[archive] helper missing"
|
||||
|
||||
kill $GATEWAY_PID 2>/dev/null
|
||||
wait $GATEWAY_PID 2>/dev/null
|
||||
|
||||
# Clean up isolated state dir
|
||||
[ -n "${FRESH_STATE:-}" ] && [ -d "$FRESH_STATE" ] && rm -rf "$FRESH_STATE"
|
||||
|
||||
exit $status
|
||||
@ -1,231 +0,0 @@
|
||||
#!/bin/bash
|
||||
# Run one OpenClaw model/profile through the HF-style isolated lane worker.
|
||||
set -Eeuo pipefail
|
||||
|
||||
: "${SWEEP_MODEL:?SWEEP_MODEL required}"
|
||||
: "${SWEEP_LABEL:?SWEEP_LABEL required}"
|
||||
: "${SWEEP_OUT_TAG:=lane-container}"
|
||||
: "${SWEEP_LANES:=3}"
|
||||
: "${SWEEP_RUNS:=1}"
|
||||
: "${SWEEP_LOGDIR:=/data/results}"
|
||||
: "${CLAWBENCH_PER_RUN_BUDGET_SECONDS:=900}"
|
||||
: "${CLAWBENCH_PER_TURN_TIMEOUT_SECONDS:=300}"
|
||||
: "${OPENCLAW_EXEC_HOST:=gateway}"
|
||||
|
||||
cd /home/node/app
|
||||
export CLAWBENCH_LOCAL_QUEUE_DIR="${CLAWBENCH_LOCAL_QUEUE_DIR:-/data/queue/$SWEEP_LABEL}"
|
||||
mkdir -p "$SWEEP_LOGDIR" /data/results "$CLAWBENCH_LOCAL_QUEUE_DIR" /data/run_cache /data/lane_runtime
|
||||
|
||||
export HF_TOKEN=""
|
||||
export OPENCLAW_GATEWAY_TOKEN="${OPENCLAW_GATEWAY_TOKEN:-local-dev-token-for-testing}"
|
||||
export OPENCLAW_SKIP_GMAIL_WATCHER=1
|
||||
export OPENCLAW_SKIP_CANVAS_HOST=1
|
||||
export OPENCLAW_NO_RESPAWN=1
|
||||
export CLAWBENCH_DISABLE_GATEWAY_DEVICE_IDENTITY=1
|
||||
export CLAWBENCH_PER_RUN_BUDGET_SECONDS
|
||||
export CLAWBENCH_PER_TURN_TIMEOUT_SECONDS
|
||||
export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-180}"
|
||||
export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
|
||||
export CLAWBENCH_GATEWAY_HEALTH_TIMEOUT_SECONDS="${CLAWBENCH_GATEWAY_HEALTH_TIMEOUT_SECONDS:-240}"
|
||||
export CLAWBENCH_LANE_STARTUP_STAGGER_SECONDS="${CLAWBENCH_LANE_STARTUP_STAGGER_SECONDS:-90}"
|
||||
export CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS="${CLAWBENCH_GATEWAY_READY_MARKER_GRACE_SECONDS:-90}"
|
||||
export CLAWBENCH_KEEP_PARALLEL_LANE_ROOT="${CLAWBENCH_KEEP_PARALLEL_LANE_ROOT:-0}"
|
||||
export CLAWBENCH_PARALLEL_LANE_ROOT="/data/lane_runtime/$SWEEP_LABEL"
|
||||
export CLAWBENCH_TOOL_PROFILE_NAME="${CLAWBENCH_TOOL_PROFILE_NAME:-$SWEEP_LABEL}"
|
||||
export NODE_OPTIONS="${NODE_OPTIONS:-"--max-old-space-size=4096"}"
|
||||
if command -v npm >/dev/null 2>&1; then
|
||||
export NODE_PATH="${NODE_PATH:-$(npm root -g 2>/dev/null || true)}"
|
||||
fi
|
||||
|
||||
SRC_STATE="${OPENCLAW_CONFIG_SOURCE:-/config/openclaw}"
|
||||
if [ ! -d "$SRC_STATE" ]; then
|
||||
SRC_STATE="/home/node/.openclaw"
|
||||
fi
|
||||
|
||||
safe_model="${SWEEP_MODEL//\//_}"
|
||||
safe_model="${safe_model//:/_}"
|
||||
OUT="$SWEEP_LOGDIR/${SWEEP_LABEL}_openclaw_${safe_model}_${SWEEP_OUT_TAG}.json"
|
||||
LOG="$SWEEP_LOGDIR/${SWEEP_LABEL}_openclaw_${safe_model}_${SWEEP_OUT_TAG}.log"
|
||||
export SWEEP_OUTPUT_PATH="$OUT"
|
||||
|
||||
FRESH_HOME="/tmp/openclaw-home-${SWEEP_LABEL}-$$"
|
||||
FRESH_STATE="$FRESH_HOME/.openclaw"
|
||||
rm -rf "$FRESH_HOME" "$CLAWBENCH_PARALLEL_LANE_ROOT"
|
||||
mkdir -p "$FRESH_STATE" "$FRESH_HOME/.config"
|
||||
if [ -f "$SRC_STATE/openclaw.json" ]; then
|
||||
cp "$SRC_STATE/openclaw.json" "$FRESH_STATE/openclaw.json"
|
||||
fi
|
||||
if [ -d "$SRC_STATE/plugins" ]; then
|
||||
mkdir -p "$FRESH_STATE/plugins"
|
||||
cp -R "$SRC_STATE/plugins/." "$FRESH_STATE/plugins/" 2>/dev/null || true
|
||||
fi
|
||||
mkdir -p \
|
||||
"$FRESH_STATE/agents" \
|
||||
"$FRESH_STATE/workspace" \
|
||||
"$FRESH_STATE/logs" \
|
||||
"$FRESH_STATE/memory" \
|
||||
"$FRESH_STATE/cache" \
|
||||
"$FRESH_STATE/identity" \
|
||||
"$FRESH_STATE/devices" \
|
||||
"$FRESH_STATE/tasks" \
|
||||
"$FRESH_STATE/subagents" \
|
||||
"$FRESH_STATE/flows" \
|
||||
"$FRESH_STATE/cron"
|
||||
|
||||
export HOME="$FRESH_HOME"
|
||||
export OPENCLAW_HOME="$FRESH_HOME"
|
||||
export OPENCLAW_STATE_DIR="$FRESH_STATE"
|
||||
export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
|
||||
export XDG_CONFIG_HOME="$FRESH_HOME/.config"
|
||||
|
||||
python - <<'PY'
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
|
||||
if not cfg_path.exists():
|
||||
raise SystemExit("missing openclaw.json")
|
||||
data = json.loads(cfg_path.read_text(encoding="utf-8"))
|
||||
|
||||
def set_nested(root, dotted, value):
|
||||
cursor = root
|
||||
parts = dotted.split(".")
|
||||
for part in parts[:-1]:
|
||||
child = cursor.get(part)
|
||||
if not isinstance(child, dict):
|
||||
child = {}
|
||||
cursor[part] = child
|
||||
cursor = child
|
||||
cursor[parts[-1]] = value
|
||||
|
||||
agents = data.setdefault("agents", {})
|
||||
if isinstance(agents, dict):
|
||||
agents["list"] = []
|
||||
|
||||
channels = data.get("channels")
|
||||
if isinstance(channels, dict):
|
||||
for channel in channels.values():
|
||||
if isinstance(channel, dict):
|
||||
channel["enabled"] = False
|
||||
exec_approvals = channel.get("execApprovals")
|
||||
if not isinstance(exec_approvals, dict):
|
||||
exec_approvals = {}
|
||||
channel["execApprovals"] = exec_approvals
|
||||
exec_approvals["enabled"] = False
|
||||
|
||||
plugins = data.setdefault("plugins", {})
|
||||
stale = {"marxbiotech-git-tools", "lab"}
|
||||
allow = plugins.get("allow")
|
||||
if isinstance(allow, list):
|
||||
plugins["allow"] = [item for item in allow if item not in stale]
|
||||
entries = plugins.get("entries")
|
||||
if isinstance(entries, dict):
|
||||
for item in stale:
|
||||
entries.pop(item, None)
|
||||
|
||||
set_nested(data, "browser.headless", True)
|
||||
set_nested(data, "browser.noSandbox", True)
|
||||
set_nested(data, "gateway.reload.mode", "off")
|
||||
set_nested(data, "agents.defaults.skipBootstrap", True)
|
||||
set_nested(data, "agents.defaults.sandbox.mode", "off")
|
||||
set_nested(data, "agents.defaults.model.primary", os.environ["SWEEP_MODEL"])
|
||||
set_nested(data, "agents.defaults.subagents.model.primary", os.environ["SWEEP_MODEL"])
|
||||
set_nested(
|
||||
data,
|
||||
"agents.defaults.systemPromptOverride",
|
||||
"You are running an OpenClaw benchmark task. Complete the user's request in the current "
|
||||
"workspace using the available tools when needed. For file, code, browser, shell, or memory "
|
||||
"tasks, make the requested changes directly and verify them when practical. Do not ask "
|
||||
"follow-up questions during the benchmark. Keep any final reply brief.",
|
||||
)
|
||||
set_nested(data, "tools.exec.host", os.environ.get("OPENCLAW_EXEC_HOST", "gateway"))
|
||||
set_nested(data, "tools.exec.security", "full")
|
||||
set_nested(data, "tools.exec.ask", "off")
|
||||
set_nested(data, "approvals.exec.enabled", False)
|
||||
|
||||
models = data.setdefault("agents", {}).setdefault("defaults", {}).setdefault("models", {})
|
||||
model_entry = models.setdefault(os.environ["SWEEP_MODEL"], {})
|
||||
params = model_entry.setdefault("params", {})
|
||||
params["fastMode"] = True
|
||||
if os.environ["SWEEP_MODEL"].startswith("openai/"):
|
||||
params["transport"] = "sse"
|
||||
params["openaiWsWarmup"] = False
|
||||
|
||||
cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
approvals_path = cfg_path.with_name("exec-approvals.json")
|
||||
approvals = {
|
||||
"version": 1,
|
||||
"socket": {
|
||||
"path": str(approvals_path.with_suffix(".sock")),
|
||||
"token": "container-lane-eval-token",
|
||||
},
|
||||
"defaults": {"security": "full", "ask": "off", "askFallback": "full"},
|
||||
"agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
|
||||
}
|
||||
approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
|
||||
PY
|
||||
|
||||
echo "===== CONTAINER LANE EVAL START $(date '+%Y-%m-%d %H:%M:%S') ====="
|
||||
echo "label: $SWEEP_LABEL"
|
||||
echo "model: $SWEEP_MODEL"
|
||||
echo "runs: $SWEEP_RUNS"
|
||||
echo "lanes: $SWEEP_LANES"
|
||||
echo "tasks: ${SWEEP_TASKS:-${CHERRY_TASKS:-all}}"
|
||||
echo "out: $OUT"
|
||||
echo "log: $LOG"
|
||||
echo "home: $HOME"
|
||||
echo "state: $OPENCLAW_STATE_DIR"
|
||||
openclaw --version 2>/dev/null || true
|
||||
|
||||
set +e
|
||||
python - <<'PY' > "$LOG" 2>&1
|
||||
import asyncio
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import shutil
|
||||
from pathlib import Path
|
||||
|
||||
from clawbench.queue import JobQueue, JobStatus, SubmissionRequest
|
||||
from clawbench.worker import EvalWorker, RESULTS_DIR
|
||||
|
||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(name)s: %(message)s")
|
||||
|
||||
async def main() -> int:
|
||||
queue = JobQueue()
|
||||
queue._jobs.clear()
|
||||
queue._save_local()
|
||||
task_ids_raw = os.environ.get("SWEEP_TASKS") or os.environ.get("CHERRY_TASKS") or ""
|
||||
task_ids = [item.strip() for item in task_ids_raw.split(",") if item.strip()]
|
||||
request = SubmissionRequest(
|
||||
model=os.environ["SWEEP_MODEL"],
|
||||
runs_per_task=int(os.environ["SWEEP_RUNS"]),
|
||||
max_parallel_lanes=int(os.environ["SWEEP_LANES"]),
|
||||
task_ids=task_ids,
|
||||
prompt_variant=os.environ.get("SWEEP_PROMPT_VARIANT", "clear"),
|
||||
judge_model=os.environ.get("CLAWBENCH_JUDGE_MODEL", ""),
|
||||
notes=os.environ.get("SWEEP_LABEL", ""),
|
||||
)
|
||||
job = await queue.submit(request)
|
||||
worker = EvalWorker(queue)
|
||||
await worker._process_job(job)
|
||||
final = await queue.get_status(job.job_id)
|
||||
print(json.dumps(final.model_dump() if final else {}, indent=2), flush=True)
|
||||
if final is None or final.status != JobStatus.FINISHED or not final.result_id:
|
||||
return 1
|
||||
result_path = RESULTS_DIR / f"{final.result_id}.json"
|
||||
output_path = Path(os.environ["SWEEP_OUTPUT_PATH"])
|
||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||
shutil.copy2(result_path, output_path)
|
||||
return 0
|
||||
|
||||
raise SystemExit(asyncio.run(main()))
|
||||
PY
|
||||
status=$?
|
||||
set -e
|
||||
|
||||
echo "===== lane eval exit=$status $(date '+%Y-%m-%d %H:%M:%S') ====="
|
||||
tail -120 "$LOG" 2>/dev/null || true
|
||||
exit "$status"
|
||||
@ -43,13 +43,6 @@ mkdir -p "$CLAWBENCH_RUN_CACHE_DIR"
|
||||
# OOM fix: give the gateway Node process a 4GB old-space ceiling instead of the default ~2GB.
|
||||
# Scoped via env so we don't stomp on other Node processes (clawbench itself is python).
|
||||
export NODE_OPTIONS="--max-old-space-size=4096"
|
||||
# OpenClaw 4.22+ has slower agents.create / sessions.create on cold start
|
||||
# (we observed 72s for opus-4-7). Bump RPC timeouts so the harness doesn't
|
||||
# cancel mid-flight. Override defaults of 30s / 60s respectively.
|
||||
export CLAWBENCH_CONNECT_TIMEOUT="${CLAWBENCH_CONNECT_TIMEOUT:-120}"
|
||||
export CLAWBENCH_REQUEST_TIMEOUT="${CLAWBENCH_REQUEST_TIMEOUT:-300}"
|
||||
export CLAWBENCH_PER_RUN_BUDGET_SECONDS="${CLAWBENCH_PER_RUN_BUDGET_SECONDS:-900}"
|
||||
export HERMES_STEP_TIMEOUT_SECONDS="${HERMES_STEP_TIMEOUT_SECONDS:-180}"
|
||||
|
||||
# State-dir isolation: the shared /home/node/.openclaw mount accumulates cruft
|
||||
# across sweeps (agents/, workspace/, logs/, memory/, stale openclaw.json.*.tmp)
|
||||
@ -80,68 +73,23 @@ done
|
||||
# Ensure runtime dirs exist but are empty
|
||||
mkdir -p "$FRESH_STATE/agents" "$FRESH_STATE/workspace" "$FRESH_STATE/logs" "$FRESH_STATE/memory" "$FRESH_STATE/cache"
|
||||
export OPENCLAW_STATE_DIR="$FRESH_STATE"
|
||||
export OPENCLAW_CONFIG_PATH="$FRESH_STATE/openclaw.json"
|
||||
echo "[state-isolate] OPENCLAW_STATE_DIR=$OPENCLAW_STATE_DIR"
|
||||
du -sh "$FRESH_STATE" 2>/dev/null | sed 's/^/[state-isolate] size: /'
|
||||
|
||||
python - <<'PY'
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
cfg_path = Path(os.environ["OPENCLAW_CONFIG_PATH"])
|
||||
data = json.loads(cfg_path.read_text(encoding="utf-8")) if cfg_path.exists() else {}
|
||||
|
||||
def set_nested(root, dotted, value):
|
||||
cursor = root
|
||||
parts = dotted.split(".")
|
||||
for part in parts[:-1]:
|
||||
child = cursor.get(part)
|
||||
if not isinstance(child, dict):
|
||||
child = {}
|
||||
cursor[part] = child
|
||||
cursor = child
|
||||
cursor[parts[-1]] = value
|
||||
|
||||
exec_host = os.environ.get("OPENCLAW_EXEC_HOST", "gateway").strip().lower()
|
||||
if exec_host not in {"auto", "gateway", "sandbox", "node"}:
|
||||
raise SystemExit(f"invalid OPENCLAW_EXEC_HOST={exec_host!r}")
|
||||
|
||||
set_nested(data, "tools.exec.host", exec_host)
|
||||
set_nested(data, "tools.exec.security", "full")
|
||||
set_nested(data, "tools.exec.ask", "off")
|
||||
set_nested(data, "approvals.exec.enabled", False)
|
||||
cfg_path.write_text(json.dumps(data, indent=2) + "\n", encoding="utf-8")
|
||||
|
||||
approvals_path = cfg_path.with_name("exec-approvals.json")
|
||||
approvals = {
|
||||
"version": 1,
|
||||
"socket": {
|
||||
"path": str(approvals_path.with_suffix(".sock")),
|
||||
"token": "container-single-eval-token",
|
||||
},
|
||||
"defaults": {"security": "full", "ask": "off", "askFallback": "full"},
|
||||
"agents": {"*": {"security": "full", "ask": "off", "askFallback": "full"}},
|
||||
}
|
||||
approvals_path.write_text(json.dumps(approvals, indent=2) + "\n", encoding="utf-8")
|
||||
PY
|
||||
|
||||
# Map label -> cache subdir (matches what clawbench writes)
|
||||
case "$SWEEP_MODEL" in
|
||||
anthropic/claude-opus-4-7) CACHE_SUB="anthropic_claude-opus-4-7" ;;
|
||||
anthropic/claude-sonnet-4-7) CACHE_SUB="anthropic_claude-sonnet-4-7" ;;
|
||||
anthropic/claude-opus-4-6) CACHE_SUB="anthropic_claude-opus-4-6" ;;
|
||||
anthropic/claude-sonnet-4-6) CACHE_SUB="anthropic_claude-sonnet-4-6" ;;
|
||||
openai/gpt-5.5) CACHE_SUB="openai_gpt-5.5" ;;
|
||||
openai/gpt-5.4) CACHE_SUB="openai_gpt-5.4" ;;
|
||||
openai/gpt-5.2) CACHE_SUB="openai_gpt-5.2" ;;
|
||||
google/gemini-3.1-pro-preview) CACHE_SUB="google_gemini-3.1-pro-preview" ;;
|
||||
openrouter/z-ai/glm-5.1) CACHE_SUB="openrouter_z-ai_glm-5.1" ;;
|
||||
openrouter/qwen/qwen3.6-plus) CACHE_SUB="openrouter_qwen_qwen3.6-plus" ;;
|
||||
openrouter/minimax/minimax-m2.7) CACHE_SUB="openrouter_minimax_minimax-m2.7" ;;
|
||||
openrouter/moonshotai/kimi-k2.6) CACHE_SUB="openrouter_moonshotai_kimi-k2.6" ;;
|
||||
openrouter/moonshotai/kimi-k2.5) CACHE_SUB="openrouter_moonshotai_kimi-k2.5" ;;
|
||||
deepseek/v4-pro) CACHE_SUB="deepseek_v4-pro" ;;
|
||||
# kimi-k2.6 is not yet supported in the openclaw version under test — skip.
|
||||
*) CACHE_SUB="" ;;
|
||||
esac
|
||||
|
||||
@ -191,19 +139,11 @@ if [ $ready -ne 1 ]; then
|
||||
fi
|
||||
|
||||
echo "===== $(date '+%H:%M:%S') starting $SWEEP_LABEL ($SWEEP_MODEL) ====="
|
||||
# NOTE: --profile intentionally OMITTED unless USE_PROFILE=1 is set. The
|
||||
# legacy frontier_*.yaml profile format is incompatible with OpenClaw
|
||||
# 4.22+ (loads n_tools_total=0). Running with the default openclaw tool
|
||||
# stack — identical across all models, so comparisons stay valid.
|
||||
PROFILE_ARG=""
|
||||
if [ -n "${USE_PROFILE:-}" ] && [ -f "$SWEEP_PROFILE" ]; then
|
||||
PROFILE_ARG="--profile $SWEEP_PROFILE"
|
||||
fi
|
||||
clawbench run \
|
||||
--model "$SWEEP_MODEL" \
|
||||
--runs 3 \
|
||||
--concurrency "${CLAWBENCH_CONCURRENCY:-1}" \
|
||||
$PROFILE_ARG \
|
||||
--concurrency 4 \
|
||||
--profile "$SWEEP_PROFILE" \
|
||||
--judge-model "anthropic/claude-sonnet-4-6" \
|
||||
-o "$OUT" \
|
||||
> "$LOG" 2>&1
|
||||
|
||||
@ -1,33 +0,0 @@
|
||||
# Lightweight ClawBench image for Kubernetes sidecar use.
|
||||
# Does NOT include the full OpenClaw server or Chromium — the gateway runs
|
||||
# in a separate container. Node.js is copied from the OpenClaw image for
|
||||
# the device-identity handshake required by the gateway protocol.
|
||||
FROM ghcr.io/openclaw/openclaw:latest AS openclaw
|
||||
|
||||
FROM python:3.12-slim
|
||||
|
||||
COPY --from=openclaw /usr/local/bin/node /usr/local/bin/node
|
||||
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends git && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./
|
||||
COPY clawbench/ clawbench/
|
||||
COPY tasks-public/ tasks-public/
|
||||
COPY tasks-domain/ tasks-domain/
|
||||
COPY profiles/ profiles/
|
||||
COPY baselines/ baselines/
|
||||
COPY scripts/ scripts/
|
||||
|
||||
RUN pip install --no-cache-dir ".[mlflow]"
|
||||
|
||||
RUN mkdir -p /results && chmod 777 /results
|
||||
|
||||
RUN useradd -m -d /home/node clawbench
|
||||
USER clawbench
|
||||
ENV HOME=/home/node
|
||||
|
||||
ENTRYPOINT ["clawbench"]
|
||||
@ -1,486 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
# Deploy ClawBench evals on Kubernetes (works on OpenShift too).
|
||||
#
|
||||
# 0-to-hero pipeline:
|
||||
# Step 0: Create a cluster (see --help for Kind instructions)
|
||||
# Step 1: Deploy OpenClaw gateway (optional — bring your own)
|
||||
# Step 2: Deploy MLflow tracking server (optional — bring your own)
|
||||
# Step 3: Run evals via sidecar (add / remove)
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/k8s/deploy.sh # Full deploy: OpenClaw + MLflow + eval
|
||||
# ./scripts/k8s/deploy.sh --openclaw-only # Step 1: deploy OpenClaw gateway
|
||||
# ./scripts/k8s/deploy.sh --mlflow-only # Step 2: deploy MLflow
|
||||
# ./scripts/k8s/deploy.sh --add-sidecar # Step 3: add eval sidecar (starts eval)
|
||||
# ./scripts/k8s/deploy.sh --remove-sidecar # Step 3: remove eval sidecar
|
||||
# ./scripts/k8s/deploy.sh --logs # Tail clawbench sidecar logs
|
||||
# ./scripts/k8s/deploy.sh --teardown # Delete eval namespace (keeps MLflow)
|
||||
#
|
||||
# Environment (required):
|
||||
# CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval
|
||||
# OPENAI_API_KEY Model provider API key (or another provider key)
|
||||
#
|
||||
# Environment (optional):
|
||||
# CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest)
|
||||
# OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
|
||||
# OPENCLAW_GATEWAY_TOKEN Existing gateway token (generated if unset)
|
||||
# CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5)
|
||||
# MLFLOW_NAMESPACE MLflow namespace (default: mlflow)
|
||||
# MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy if set)
|
||||
# MLFLOW_EXPERIMENT_ID MLflow experiment ID
|
||||
# MLFLOW_EXPERIMENT_NAME MLflow experiment name
|
||||
# MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
|
||||
# ANTHROPIC_API_KEY Anthropic key (added to secret if set)
|
||||
# OPENROUTER_API_KEY OpenRouter key (added to secret if set)
|
||||
# GEMINI_API_KEY Gemini key (added to secret if set)
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
NS="${CLAWBENCH_NAMESPACE:-}"
|
||||
MLFLOW_NS="${MLFLOW_NAMESPACE:-mlflow}"
|
||||
CLAWBENCH_IMG="${CLAWBENCH_IMAGE:-quay.io/sallyom/clawbench:latest}"
|
||||
OPENCLAW_IMG="${OPENCLAW_IMAGE:-ghcr.io/openclaw/openclaw:latest}"
|
||||
MLFLOW_IMG="${MLFLOW_IMAGE:-ghcr.io/mlflow/mlflow:v2.21.3}"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
|
||||
cat <<'HELP'
|
||||
ClawBench Kubernetes Deployment
|
||||
===============================
|
||||
|
||||
0-to-hero pipeline for running ClawBench evals on Kubernetes.
|
||||
|
||||
Step 0: Create a cluster
|
||||
For local testing with Kind, see:
|
||||
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
|
||||
|
||||
Step 1: Deploy OpenClaw gateway (optional — skip if you have one)
|
||||
Step 2: Deploy MLflow tracking server (optional — skip if you have one)
|
||||
Step 3: Run evals via sidecar (add/remove to OpenClaw deployment)
|
||||
|
||||
Usage:
|
||||
./scripts/k8s/deploy.sh Full deploy (steps 1+2+3)
|
||||
./scripts/k8s/deploy.sh --openclaw-only Step 1: OpenClaw only
|
||||
./scripts/k8s/deploy.sh --mlflow-only Step 2: MLflow only
|
||||
./scripts/k8s/deploy.sh --add-sidecar Step 3: add eval sidecar (starts eval)
|
||||
./scripts/k8s/deploy.sh --remove-sidecar Step 3: remove eval sidecar
|
||||
./scripts/k8s/deploy.sh --logs Tail clawbench sidecar logs
|
||||
./scripts/k8s/deploy.sh --teardown Delete eval namespace (keeps MLflow)
|
||||
|
||||
Required environment:
|
||||
CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval
|
||||
OPENAI_API_KEY Model provider API key (or ANTHROPIC_API_KEY, etc.)
|
||||
|
||||
Optional environment:
|
||||
CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest)
|
||||
OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
|
||||
OPENCLAW_GATEWAY_TOKEN Existing gateway token (generated if unset)
|
||||
CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5)
|
||||
MLFLOW_NAMESPACE MLflow namespace (default: mlflow)
|
||||
MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy)
|
||||
MLFLOW_EXPERIMENT_ID MLflow experiment ID
|
||||
MLFLOW_EXPERIMENT_NAME MLflow experiment name
|
||||
MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
|
||||
ANTHROPIC_API_KEY Anthropic key (added to secret if set)
|
||||
OPENROUTER_API_KEY OpenRouter key (added to secret if set)
|
||||
GEMINI_API_KEY Gemini key (added to secret if set)
|
||||
|
||||
Works on Kubernetes and OpenShift.
|
||||
HELP
|
||||
exit 0
|
||||
fi
|
||||
|
||||
command -v kubectl &>/dev/null || { echo "Missing: kubectl" >&2; exit 1; }
|
||||
|
||||
if [[ -z "$NS" ]]; then
|
||||
echo "CLAWBENCH_NAMESPACE is required." >&2
|
||||
echo " export CLAWBENCH_NAMESPACE=clawbench-eval" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODE="full"
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--openclaw-only) MODE="openclaw-only" ;;
|
||||
--mlflow-only) MODE="mlflow-only" ;;
|
||||
--add-sidecar) MODE="add-sidecar" ;;
|
||||
--remove-sidecar) MODE="remove-sidecar" ;;
|
||||
--logs) MODE="logs" ;;
|
||||
--teardown) MODE="teardown" ;;
|
||||
*) echo "Unknown option: $1" >&2; exit 1 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
kubectl cluster-info &>/dev/null || { echo "Cannot connect to cluster. Check kubeconfig." >&2; exit 1; }
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# --logs
|
||||
# ---------------------------------------------------------------------------
|
||||
if [[ "$MODE" == "logs" ]]; then
|
||||
kubectl logs deploy/openclaw -c clawbench -n "$NS" -f
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# --teardown
|
||||
# ---------------------------------------------------------------------------
|
||||
if [[ "$MODE" == "teardown" ]]; then
|
||||
echo "Deleting namespace '$NS'..."
|
||||
kubectl delete namespace "$NS" --ignore-not-found
|
||||
echo "Done. MLflow namespace '$MLFLOW_NS' was not deleted."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# --remove-sidecar
|
||||
# ---------------------------------------------------------------------------
|
||||
if [[ "$MODE" == "remove-sidecar" ]]; then
|
||||
echo "Removing clawbench sidecar from openclaw in namespace '$NS'..."
|
||||
INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next((i for i,c in enumerate(cs) if c['name']=='clawbench'),-1))")
|
||||
if [[ "$INDEX" == "-1" ]]; then
|
||||
echo "No clawbench sidecar found."
|
||||
else
|
||||
kubectl patch deploy/openclaw -n "$NS" --type=json \
|
||||
-p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]"
|
||||
echo "Sidecar removed."
|
||||
fi
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Create namespace + secret
|
||||
# ---------------------------------------------------------------------------
|
||||
ensure_namespace_and_secret() {
|
||||
if ! kubectl get namespace "$NS" &>/dev/null; then
|
||||
echo "Creating namespace '$NS'..."
|
||||
kubectl create namespace "$NS"
|
||||
fi
|
||||
|
||||
if ! kubectl get secret clawbench-secrets -n "$NS" &>/dev/null; then
|
||||
echo "Creating clawbench-secrets..."
|
||||
if [[ -n "${OPENCLAW_GATEWAY_TOKEN:-}" ]]; then
|
||||
GATEWAY_TOKEN="$OPENCLAW_GATEWAY_TOKEN"
|
||||
GATEWAY_TOKEN_SOURCE="from OPENCLAW_GATEWAY_TOKEN"
|
||||
else
|
||||
GATEWAY_TOKEN=$(python3 -c "import secrets,base64; print(base64.b64encode(secrets.token_bytes(32)).decode())")
|
||||
GATEWAY_TOKEN_SOURCE="generated"
|
||||
fi
|
||||
|
||||
SECRET_ARGS=(
|
||||
--from-literal=OPENCLAW_GATEWAY_TOKEN="$GATEWAY_TOKEN"
|
||||
)
|
||||
[[ -n "${OPENAI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENAI_API_KEY="$OPENAI_API_KEY")
|
||||
[[ -n "${ANTHROPIC_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY")
|
||||
[[ -n "${OPENROUTER_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENROUTER_API_KEY="$OPENROUTER_API_KEY")
|
||||
[[ -n "${GEMINI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=GEMINI_API_KEY="$GEMINI_API_KEY")
|
||||
|
||||
if [[ ${#SECRET_ARGS[@]} -eq 1 ]]; then
|
||||
echo "Warning: No API keys provided. Set OPENAI_API_KEY or another provider key." >&2
|
||||
fi
|
||||
|
||||
kubectl create secret generic clawbench-secrets -n "$NS" "${SECRET_ARGS[@]}"
|
||||
echo " Gateway token: $GATEWAY_TOKEN_SOURCE"
|
||||
[[ -n "${OPENAI_API_KEY:-}" ]] && echo " OPENAI_API_KEY: set"
|
||||
[[ -n "${ANTHROPIC_API_KEY:-}" ]] && echo " ANTHROPIC_API_KEY: set"
|
||||
[[ -n "${OPENROUTER_API_KEY:-}" ]] && echo " OPENROUTER_API_KEY: set"
|
||||
[[ -n "${GEMINI_API_KEY:-}" ]] && echo " GEMINI_API_KEY: set"
|
||||
else
|
||||
echo "Secret clawbench-secrets already exists in '$NS'."
|
||||
fi
|
||||
return 0
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 1: Deploy OpenClaw
|
||||
# ---------------------------------------------------------------------------
|
||||
deploy_openclaw() {
|
||||
echo ""
|
||||
echo "Step 1: Deploying OpenClaw gateway (image: $OPENCLAW_IMG)..."
|
||||
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/configmap.yaml" -n "$NS"
|
||||
|
||||
# Patch gateway config with custom OpenAI-compatible base URL
|
||||
if [[ -n "${OPENAI_API_BASE:-}" ]]; then
|
||||
echo " Patching gateway config: models.providers.openai.baseUrl = $OPENAI_API_BASE"
|
||||
EXISTING_JSON=$(kubectl get configmap openclaw-config -n "$NS" -o jsonpath='{.data.openclaw\.json}')
|
||||
PATCHED_JSON=$(echo "$EXISTING_JSON" | python3 -c "
|
||||
import json, sys, os
|
||||
cfg = json.load(sys.stdin)
|
||||
openai_cfg = cfg.setdefault('models', {}).setdefault('providers', {}).setdefault('openai', {})
|
||||
openai_cfg['baseUrl'] = os.environ['OPENAI_API_BASE']
|
||||
openai_cfg.setdefault('models', [])
|
||||
json.dump(cfg, sys.stdout, indent=2)
|
||||
")
|
||||
kubectl create configmap openclaw-config -n "$NS" \
|
||||
--from-literal="openclaw.json=$PATCHED_JSON" \
|
||||
--dry-run=client -o yaml | kubectl apply -f - -n "$NS" >/dev/null
|
||||
fi
|
||||
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/pvc.yaml" -n "$NS"
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/service.yaml" -n "$NS"
|
||||
|
||||
if [[ "$OPENCLAW_IMG" != "ghcr.io/openclaw/openclaw:latest" ]]; then
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
|
||||
kubectl set image "deploy/openclaw" "gateway=$OPENCLAW_IMG" -n "$NS"
|
||||
else
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
|
||||
fi
|
||||
|
||||
echo "Waiting for OpenClaw rollout..."
|
||||
kubectl rollout status deploy/openclaw -n "$NS" --timeout=180s || \
|
||||
echo " (rollout still in progress)"
|
||||
echo "OpenClaw deployed."
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 2: Deploy MLflow
|
||||
# ---------------------------------------------------------------------------
|
||||
deploy_mlflow() {
|
||||
if [[ -n "${MLFLOW_TRACKING_URI:-}" ]]; then
|
||||
echo ""
|
||||
echo "Step 2: Skipping MLflow deploy (MLFLOW_TRACKING_URI is set: $MLFLOW_TRACKING_URI)"
|
||||
return
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Step 2: Deploying MLflow (namespace: $MLFLOW_NS, image: $MLFLOW_IMG)..."
|
||||
|
||||
if ! kubectl get namespace "$MLFLOW_NS" &>/dev/null; then
|
||||
kubectl create namespace "$MLFLOW_NS"
|
||||
fi
|
||||
|
||||
kubectl apply -f "$SCRIPT_DIR/mlflow/pvc.yaml" -n "$MLFLOW_NS"
|
||||
kubectl apply -f "$SCRIPT_DIR/mlflow/service.yaml" -n "$MLFLOW_NS"
|
||||
|
||||
if [[ "$MLFLOW_IMG" != "ghcr.io/mlflow/mlflow:v2.21.3" ]]; then
|
||||
kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
|
||||
kubectl set image "deploy/mlflow" "mlflow=$MLFLOW_IMG" -n "$MLFLOW_NS"
|
||||
else
|
||||
kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
|
||||
fi
|
||||
|
||||
echo "Waiting for MLflow rollout..."
|
||||
kubectl rollout status deploy/mlflow -n "$MLFLOW_NS" --timeout=120s || \
|
||||
echo " (rollout still in progress)"
|
||||
|
||||
MLFLOW_TRACKING_URI="http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000"
|
||||
echo "MLflow deployed: $MLFLOW_TRACKING_URI"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 3: Add clawbench sidecar (starts eval)
|
||||
# ---------------------------------------------------------------------------
|
||||
add_sidecar() {
|
||||
echo ""
|
||||
echo "Step 3: Adding clawbench eval sidecar..."
|
||||
|
||||
echo "Applying clawbench ConfigMap..."
|
||||
kubectl apply -f "$SCRIPT_DIR/manifests/configmap.yaml" -n "$NS" >/dev/null
|
||||
|
||||
if [[ -n "${CLAWBENCH_MODEL:-}" ]]; then
|
||||
kubectl patch configmap clawbench-config -n "$NS" \
|
||||
--type merge -p "{\"data\":{\"CLAWBENCH_MODEL\":\"$CLAWBENCH_MODEL\"}}" >/dev/null
|
||||
echo " Model: $CLAWBENCH_MODEL"
|
||||
fi
|
||||
|
||||
if [[ -n "${OPENAI_API_BASE:-}" ]]; then
|
||||
kubectl patch configmap clawbench-config -n "$NS" \
|
||||
--type merge -p "{\"data\":{\"OPENAI_API_BASE\":\"$OPENAI_API_BASE\"}}" >/dev/null
|
||||
echo " OpenAI API base: $OPENAI_API_BASE"
|
||||
fi
|
||||
|
||||
# Patch MLflow settings into ConfigMap
|
||||
PATCH_DATA=""
|
||||
MLFLOW_URI="${MLFLOW_TRACKING_URI:-http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000}"
|
||||
PATCH_DATA="\"MLFLOW_TRACKING_URI\":\"$MLFLOW_URI\""
|
||||
if [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]]; then
|
||||
PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_ID\":\"$MLFLOW_EXPERIMENT_ID\""
|
||||
fi
|
||||
if [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]]; then
|
||||
PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_NAME\":\"$MLFLOW_EXPERIMENT_NAME\""
|
||||
fi
|
||||
kubectl patch configmap clawbench-config -n "$NS" \
|
||||
--type merge -p "{\"data\":{$PATCH_DATA}}" >/dev/null
|
||||
echo " MLflow URI: $MLFLOW_URI"
|
||||
[[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]] && echo " MLflow experiment ID: $MLFLOW_EXPERIMENT_ID"
|
||||
[[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]] && echo " MLflow experiment name: $MLFLOW_EXPERIMENT_NAME"
|
||||
|
||||
# Check if sidecar already exists
|
||||
HAS_SIDECAR=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print('yes' if any(c['name']=='clawbench' for c in cs) else 'no')")
|
||||
|
||||
if [[ "$HAS_SIDECAR" == "yes" ]]; then
|
||||
echo "Removing existing clawbench sidecar..."
|
||||
INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next(i for i,c in enumerate(cs) if c['name']=='clawbench'))")
|
||||
kubectl patch deploy/openclaw -n "$NS" --type=json \
|
||||
-p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]" >/dev/null
|
||||
fi
|
||||
|
||||
# Find the OpenClaw home volume, and capture existing volumes so add-sidecar
|
||||
# also works with bring-your-own deployments that lack this repo's PVC layout.
|
||||
VOLUME_INFO=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "
|
||||
import json, sys
|
||||
spec = json.load(sys.stdin)['spec']['template']['spec']
|
||||
volume_names = [v.get('name') for v in spec.get('volumes', []) if v.get('name')]
|
||||
home_volume = 'openclaw-home'
|
||||
for c in spec['containers']:
|
||||
if c['name'] == 'gateway':
|
||||
for vm in c.get('volumeMounts', []):
|
||||
if vm['mountPath'] == '/home/node/.openclaw':
|
||||
home_volume = vm['name']
|
||||
break
|
||||
print(json.dumps({
|
||||
'home_volume': home_volume,
|
||||
'volumes_present': 'volumes' in spec,
|
||||
'volume_names': volume_names,
|
||||
}))
|
||||
")
|
||||
|
||||
echo "Adding clawbench sidecar (image: $CLAWBENCH_IMG)..."
|
||||
|
||||
PATCH=$(VOLUME_INFO="$VOLUME_INFO" CLAWBENCH_IMG="$CLAWBENCH_IMG" python3 - <<'PY'
|
||||
import json
|
||||
import os
|
||||
|
||||
info = json.loads(os.environ["VOLUME_INFO"])
|
||||
home_volume = info["home_volume"]
|
||||
|
||||
command = r"""echo "Waiting for gateway on localhost:18789..."
|
||||
for i in $(seq 1 90); do
|
||||
python3 -c "import socket; s=socket.create_connection((\"127.0.0.1\",18789),2); s.close()" 2>/dev/null && echo "Gateway ready" && break
|
||||
sleep 2
|
||||
done
|
||||
|
||||
if [ -n "${MLFLOW_TRACKING_URI:-}" ]; then
|
||||
echo "Checking MLflow at ${MLFLOW_TRACKING_URI}..."
|
||||
python3 -c "import httpx,os; r=httpx.get(os.environ[\"MLFLOW_TRACKING_URI\"]+\"/health\"); print(\"MLflow OK:\",r.status_code)" 2>&1 || echo "MLflow pre-check failed (will retry at log time)"
|
||||
fi
|
||||
|
||||
echo "Starting eval..."
|
||||
clawbench run \
|
||||
--model "${CLAWBENCH_MODEL}" \
|
||||
--gateway-token "${OPENCLAW_GATEWAY_TOKEN}" \
|
||||
--runs "${CLAWBENCH_RUNS}" \
|
||||
--concurrency "${CLAWBENCH_CONCURRENCY}" \
|
||||
${CLAWBENCH_JUDGE_MODEL:+--judge-model "${CLAWBENCH_JUDGE_MODEL}"} \
|
||||
$([ -n "${CLAWBENCH_TASKS:-}" ] && for t in ${CLAWBENCH_TASKS}; do printf -- "-t %s " "$t"; done) \
|
||||
-o /results/benchmark.json
|
||||
RC=$?
|
||||
if [ $RC -eq 0 ] && [ -n "${MLFLOW_TRACKING_URI:-}" ]; then
|
||||
python scripts/log_to_mlflow.py /results/benchmark.json
|
||||
fi
|
||||
echo "ClawBench finished (exit=$RC)"
|
||||
sleep infinity"""
|
||||
|
||||
container = {
|
||||
"name": "clawbench",
|
||||
"image": os.environ["CLAWBENCH_IMG"],
|
||||
"imagePullPolicy": "IfNotPresent",
|
||||
"command": ["/bin/bash", "-c", command],
|
||||
"envFrom": [{"configMapRef": {"name": "clawbench-config"}}],
|
||||
"env": [
|
||||
{
|
||||
"name": "OPENCLAW_GATEWAY_TOKEN",
|
||||
"valueFrom": {
|
||||
"secretKeyRef": {
|
||||
"name": "clawbench-secrets",
|
||||
"key": "OPENCLAW_GATEWAY_TOKEN",
|
||||
}
|
||||
},
|
||||
}
|
||||
],
|
||||
"resources": {
|
||||
"requests": {"memory": "1Gi", "cpu": "500m"},
|
||||
"limits": {"memory": "4Gi", "cpu": "2"},
|
||||
},
|
||||
"volumeMounts": [
|
||||
{"name": home_volume, "mountPath": "/home/node/.openclaw"},
|
||||
{"name": "clawbench-results", "mountPath": "/results"},
|
||||
{"name": "tmp-volume", "mountPath": "/tmp"},
|
||||
],
|
||||
"securityContext": {
|
||||
"allowPrivilegeEscalation": False,
|
||||
"capabilities": {"drop": ["ALL"]},
|
||||
},
|
||||
}
|
||||
|
||||
patch = [{"op": "add", "path": "/spec/template/spec/containers/-", "value": container}]
|
||||
|
||||
existing_volumes = set(info["volume_names"])
|
||||
required_volumes = [
|
||||
{"name": home_volume, "emptyDir": {}},
|
||||
{"name": "clawbench-results", "emptyDir": {}},
|
||||
{"name": "tmp-volume", "emptyDir": {}},
|
||||
]
|
||||
missing_volumes = []
|
||||
for volume in required_volumes:
|
||||
if volume["name"] not in existing_volumes and volume["name"] not in {
|
||||
item["name"] for item in missing_volumes
|
||||
}:
|
||||
missing_volumes.append(volume)
|
||||
|
||||
if missing_volumes:
|
||||
if info["volumes_present"]:
|
||||
patch.extend(
|
||||
{"op": "add", "path": "/spec/template/spec/volumes/-", "value": volume}
|
||||
for volume in missing_volumes
|
||||
)
|
||||
else:
|
||||
patch.append(
|
||||
{"op": "add", "path": "/spec/template/spec/volumes", "value": missing_volumes}
|
||||
)
|
||||
|
||||
print(json.dumps(patch))
|
||||
PY
|
||||
)
|
||||
|
||||
kubectl patch deploy/openclaw -n "$NS" --type=json -p "$PATCH" >/dev/null
|
||||
|
||||
echo ""
|
||||
echo "Waiting for rollout..."
|
||||
kubectl rollout status deploy/openclaw -n "$NS" --timeout=300s 2>/dev/null || \
|
||||
echo " (rollout timeout — eval runs for 30-60 min)"
|
||||
|
||||
echo ""
|
||||
echo "Eval is running. Follow logs with:"
|
||||
echo " ./scripts/k8s/deploy.sh --logs"
|
||||
echo ""
|
||||
echo "When finished, remove the sidecar with:"
|
||||
echo " ./scripts/k8s/deploy.sh --remove-sidecar"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Execute
|
||||
# ---------------------------------------------------------------------------
|
||||
case "$MODE" in
|
||||
full)
|
||||
ensure_namespace_and_secret
|
||||
deploy_openclaw
|
||||
deploy_mlflow
|
||||
add_sidecar
|
||||
;;
|
||||
openclaw-only)
|
||||
ensure_namespace_and_secret
|
||||
deploy_openclaw
|
||||
echo ""
|
||||
echo "OpenClaw is running. Next steps:"
|
||||
echo " ./scripts/k8s/deploy.sh --mlflow-only # Deploy MLflow"
|
||||
echo " ./scripts/k8s/deploy.sh --add-sidecar # Start eval"
|
||||
;;
|
||||
mlflow-only)
|
||||
deploy_mlflow
|
||||
;;
|
||||
add-sidecar)
|
||||
if ! kubectl get deploy/openclaw -n "$NS" &>/dev/null; then
|
||||
echo "Deployment 'openclaw' not found in namespace '$NS'." >&2
|
||||
echo "Deploy OpenClaw first with: ./scripts/k8s/deploy.sh --openclaw-only" >&2
|
||||
exit 1
|
||||
fi
|
||||
ensure_namespace_and_secret
|
||||
add_sidecar
|
||||
;;
|
||||
esac
|
||||
@ -1,18 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: clawbench-config
|
||||
labels:
|
||||
app: clawbench
|
||||
data:
|
||||
CLAWBENCH_MODEL: "openai/gpt-5.5"
|
||||
OPENAI_API_BASE: ""
|
||||
CLAWBENCH_RUNS: "3"
|
||||
CLAWBENCH_CONCURRENCY: "4"
|
||||
CLAWBENCH_JUDGE_MODEL: ""
|
||||
CLAWBENCH_TASKS: ""
|
||||
CLAWBENCH_CONNECT_TIMEOUT: "120"
|
||||
CLAWBENCH_REQUEST_TIMEOUT: "300"
|
||||
CLAWBENCH_PER_RUN_BUDGET_SECONDS: "600"
|
||||
MLFLOW_TRACKING_URI: "http://mlflow-service.mlflow.svc.cluster.local:5000"
|
||||
MLFLOW_EXPERIMENT_NAME: "clawbench"
|
||||
@ -1,15 +0,0 @@
|
||||
# Reference template — do NOT apply directly.
|
||||
# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
|
||||
# from exported environment variables (OPENAI_API_KEY, etc.).
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: clawbench-secrets
|
||||
labels:
|
||||
app: clawbench
|
||||
type: Opaque
|
||||
stringData:
|
||||
OPENAI_API_KEY: "REPLACE_ME"
|
||||
# Add other provider keys as needed:
|
||||
# ANTHROPIC_API_KEY: "REPLACE_ME"
|
||||
# OPENROUTER_API_KEY: "REPLACE_ME"
|
||||
@ -1,68 +0,0 @@
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: mlflow
|
||||
labels:
|
||||
app: mlflow
|
||||
spec:
|
||||
replicas: 1
|
||||
strategy:
|
||||
type: Recreate
|
||||
selector:
|
||||
matchLabels:
|
||||
app: mlflow
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: mlflow
|
||||
spec:
|
||||
containers:
|
||||
- name: mlflow
|
||||
image: ghcr.io/mlflow/mlflow:v2.21.3
|
||||
command:
|
||||
- mlflow
|
||||
- server
|
||||
- --host
|
||||
- "0.0.0.0"
|
||||
- --port
|
||||
- "5000"
|
||||
- --backend-store-uri
|
||||
- sqlite:///mlflow/mlflow.db
|
||||
- --default-artifact-root
|
||||
- /mlflow/artifacts
|
||||
- --serve-artifacts
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 5000
|
||||
protocol: TCP
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 5000
|
||||
initialDelaySeconds: 15
|
||||
periodSeconds: 30
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 5000
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities:
|
||||
drop:
|
||||
- ALL
|
||||
volumeMounts:
|
||||
- name: mlflow-data
|
||||
mountPath: /mlflow
|
||||
volumes:
|
||||
- name: mlflow-data
|
||||
persistentVolumeClaim:
|
||||
claimName: mlflow-data-pvc
|
||||
@ -1,12 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: mlflow-data-pvc
|
||||
labels:
|
||||
app: mlflow
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
@ -1,15 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: mlflow-service
|
||||
labels:
|
||||
app: mlflow
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector:
|
||||
app: mlflow
|
||||
ports:
|
||||
- name: http
|
||||
port: 5000
|
||||
targetPort: 5000
|
||||
protocol: TCP
|
||||
@ -1,36 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: openclaw-config
|
||||
labels:
|
||||
app: openclaw
|
||||
data:
|
||||
openclaw.json: |
|
||||
{
|
||||
"gateway": {
|
||||
"mode": "local",
|
||||
"bind": "loopback",
|
||||
"port": 18789,
|
||||
"auth": {
|
||||
"mode": "token"
|
||||
}
|
||||
},
|
||||
"browser": {
|
||||
"enabled": true,
|
||||
"headless": true,
|
||||
"noSandbox": true,
|
||||
"ssrfPolicy": {
|
||||
"allowedHostnames": ["localhost", "127.0.0.1"]
|
||||
}
|
||||
},
|
||||
"tools": {
|
||||
"profile": "coding",
|
||||
"alsoAllow": ["browser"]
|
||||
},
|
||||
"agents": {
|
||||
"defaults": {
|
||||
"workspace": "~/.openclaw/workspace"
|
||||
}
|
||||
},
|
||||
"cron": { "enabled": false }
|
||||
}
|
||||
@ -1,146 +0,0 @@
|
||||
# OpenClaw gateway deployment for ClawBench evals.
|
||||
#
|
||||
# Build the image with browser support:
|
||||
# docker build --build-arg OPENCLAW_INSTALL_BROWSER=1 \
|
||||
# -t quay.io/yourorg/openclaw:eval .
|
||||
#
|
||||
# Or use upstream without browser (browser eval tasks will score 0):
|
||||
# image: ghcr.io/openclaw/openclaw:latest
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: openclaw
|
||||
labels:
|
||||
app: openclaw
|
||||
spec:
|
||||
replicas: 1
|
||||
strategy:
|
||||
type: Recreate
|
||||
selector:
|
||||
matchLabels:
|
||||
app: openclaw
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: openclaw
|
||||
spec:
|
||||
initContainers:
|
||||
- name: init-config
|
||||
image: registry.access.redhat.com/ubi9-minimal:latest
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
cp /config/openclaw.json /home/node/.openclaw/openclaw.json
|
||||
chmod 666 /home/node/.openclaw/openclaw.json
|
||||
mkdir -p /home/node/.openclaw/workspace
|
||||
mkdir -p /home/node/.openclaw/agents
|
||||
chmod 777 /home/node/.openclaw /home/node/.openclaw/workspace /home/node/.openclaw/agents
|
||||
echo "Config initialized"
|
||||
volumeMounts:
|
||||
- name: openclaw-home
|
||||
mountPath: /home/node/.openclaw
|
||||
- name: config-template
|
||||
mountPath: /config
|
||||
resources:
|
||||
limits:
|
||||
cpu: 200m
|
||||
memory: 128Mi
|
||||
requests:
|
||||
cpu: 50m
|
||||
memory: 64Mi
|
||||
containers:
|
||||
- name: gateway
|
||||
image: ghcr.io/openclaw/openclaw:latest
|
||||
imagePullPolicy: IfNotPresent
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- umask 007 && exec node dist/index.js gateway run --bind loopback --port 18789 --allow-unconfigured
|
||||
env:
|
||||
- name: HOME
|
||||
value: /home/node
|
||||
- name: NODE_ENV
|
||||
value: production
|
||||
- name: OPENCLAW_CONFIG_DIR
|
||||
value: /home/node/.openclaw
|
||||
- name: OPENCLAW_STATE_DIR
|
||||
value: /home/node/.openclaw
|
||||
- name: OPENCLAW_GATEWAY_TOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: OPENCLAW_GATEWAY_TOKEN
|
||||
- name: OPENAI_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: OPENAI_API_KEY
|
||||
optional: true
|
||||
- name: ANTHROPIC_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: ANTHROPIC_API_KEY
|
||||
optional: true
|
||||
- name: OPENROUTER_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: OPENROUTER_API_KEY
|
||||
optional: true
|
||||
- name: GEMINI_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: GEMINI_API_KEY
|
||||
optional: true
|
||||
ports:
|
||||
- name: gateway
|
||||
containerPort: 18789
|
||||
protocol: TCP
|
||||
livenessProbe:
|
||||
exec:
|
||||
command:
|
||||
- node
|
||||
- -e
|
||||
- "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 30
|
||||
timeoutSeconds: 10
|
||||
readinessProbe:
|
||||
exec:
|
||||
command:
|
||||
- node
|
||||
- -e
|
||||
- "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 1Gi
|
||||
limits:
|
||||
cpu: "2"
|
||||
memory: 4Gi
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities:
|
||||
drop:
|
||||
- ALL
|
||||
volumeMounts:
|
||||
- name: openclaw-home
|
||||
mountPath: /home/node/.openclaw
|
||||
- name: tmp-volume
|
||||
mountPath: /tmp
|
||||
terminationGracePeriodSeconds: 30
|
||||
volumes:
|
||||
- name: openclaw-home
|
||||
persistentVolumeClaim:
|
||||
claimName: openclaw-home-pvc
|
||||
- name: config-template
|
||||
configMap:
|
||||
name: openclaw-config
|
||||
- name: tmp-volume
|
||||
emptyDir: {}
|
||||
@ -1,12 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: openclaw-home-pvc
|
||||
labels:
|
||||
app: openclaw
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
@ -1,17 +0,0 @@
|
||||
# Reference template — do NOT apply directly.
|
||||
# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
|
||||
# from exported environment variables (OPENAI_API_KEY, etc.).
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: clawbench-secrets
|
||||
labels:
|
||||
app: openclaw
|
||||
type: Opaque
|
||||
stringData:
|
||||
OPENCLAW_GATEWAY_TOKEN: "REPLACE_ME"
|
||||
OPENAI_API_KEY: "REPLACE_ME"
|
||||
# Add other provider keys as needed:
|
||||
# ANTHROPIC_API_KEY: "REPLACE_ME"
|
||||
# OPENROUTER_API_KEY: "REPLACE_ME"
|
||||
# GEMINI_API_KEY: "REPLACE_ME"
|
||||
@ -1,15 +0,0 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: openclaw
|
||||
labels:
|
||||
app: openclaw
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector:
|
||||
app: openclaw
|
||||
ports:
|
||||
- name: gateway
|
||||
port: 18789
|
||||
targetPort: 18789
|
||||
protocol: TCP
|
||||
@ -1,125 +0,0 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Log a ClawBench BenchmarkResult to MLflow.
|
||||
|
||||
Standalone script -- not imported by the clawbench package.
|
||||
Requires: pip install mlflow (or pip install clawbench[mlflow])
|
||||
|
||||
Usage:
|
||||
python scripts/log_to_mlflow.py /results/benchmark.json
|
||||
|
||||
Environment:
|
||||
MLFLOW_TRACKING_URI MLflow tracking server (default: http://localhost:5000)
|
||||
MLFLOW_EXPERIMENT_NAME Experiment name (default: clawbench)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
|
||||
def main(result_path: str) -> None:
|
||||
try:
|
||||
import mlflow
|
||||
except ImportError:
|
||||
print(
|
||||
"mlflow is not installed. Install with: pip install mlflow"
|
||||
" (or pip install clawbench[mlflow])",
|
||||
file=sys.stderr,
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
from clawbench.schemas import BenchmarkResult
|
||||
|
||||
with open(result_path, encoding="utf-8") as f:
|
||||
result = BenchmarkResult(**json.load(f))
|
||||
|
||||
experiment_id = os.environ.get("MLFLOW_EXPERIMENT_ID")
|
||||
if experiment_id:
|
||||
experiment = mlflow.set_experiment(experiment_id=experiment_id)
|
||||
else:
|
||||
experiment = mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT_NAME", "clawbench"))
|
||||
|
||||
run_name = f"{result.model}-{result.submission_id[:8]}"
|
||||
with mlflow.start_run(run_name=run_name):
|
||||
mlflow.log_params(
|
||||
{
|
||||
"model": result.model,
|
||||
"provider": result.provider,
|
||||
"benchmark_version": result.benchmark_version,
|
||||
"openclaw_version": result.openclaw_version or "unknown",
|
||||
"judge_model": result.judge_model or "none",
|
||||
"task_snapshot_fingerprint": result.task_snapshot_fingerprint or "unknown",
|
||||
}
|
||||
)
|
||||
|
||||
mlflow.log_metrics(
|
||||
{
|
||||
"overall_score": result.overall_score,
|
||||
"overall_completion": result.overall_completion,
|
||||
"overall_trajectory": result.overall_trajectory,
|
||||
"overall_behavior": result.overall_behavior,
|
||||
"overall_reliability": result.overall_reliability,
|
||||
"overall_pass_hat_k": result.overall_pass_hat_k,
|
||||
"overall_judge_score": result.overall_judge_score,
|
||||
"overall_judge_confidence": result.overall_judge_confidence,
|
||||
"overall_judge_pass_rate": result.overall_judge_pass_rate,
|
||||
"judge_task_coverage": result.judge_task_coverage,
|
||||
"overall_weighted_query_score": result.overall_weighted_query_score,
|
||||
"overall_median_latency_ms": result.overall_median_latency_ms,
|
||||
"overall_p95_latency_ms": result.overall_p95_latency_ms,
|
||||
"overall_total_tokens": result.overall_total_tokens,
|
||||
"overall_cost_usd": result.overall_cost_usd,
|
||||
"overall_tokens_per_pass": result.overall_tokens_per_pass,
|
||||
"overall_cost_per_pass": result.overall_cost_per_pass,
|
||||
"overall_ci_lower": result.overall_ci_lower,
|
||||
"overall_ci_upper": result.overall_ci_upper,
|
||||
}
|
||||
)
|
||||
|
||||
for tier in result.tier_results:
|
||||
mlflow.log_metrics(
|
||||
{
|
||||
f"{tier.tier}/score": tier.mean_task_score,
|
||||
f"{tier.tier}/completion": tier.mean_completion,
|
||||
f"{tier.tier}/trajectory": tier.mean_trajectory,
|
||||
f"{tier.tier}/behavior": tier.mean_behavior,
|
||||
f"{tier.tier}/reliability": tier.mean_reliability,
|
||||
}
|
||||
)
|
||||
|
||||
for i, task in enumerate(result.task_results):
|
||||
mlflow.log_metrics(
|
||||
{
|
||||
f"task/{task.task_id}/score": task.mean_task_score,
|
||||
f"task/{task.task_id}/reliability": task.reliability_score,
|
||||
},
|
||||
step=i,
|
||||
)
|
||||
|
||||
mlflow.set_tags(
|
||||
{
|
||||
"submission_id": result.submission_id,
|
||||
"timestamp": result.timestamp,
|
||||
"certified": str(result.certified),
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
mlflow.log_artifact(result_path)
|
||||
except Exception as e:
|
||||
print(f"Warning: artifact upload failed: {e}", file=sys.stderr)
|
||||
print("Metrics and params were logged successfully.", file=sys.stderr)
|
||||
|
||||
print(f"Logged to MLflow: experiment={experiment.name} run={run_name}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 2:
|
||||
print(f"Usage: {sys.argv[0]} <result.json>", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
main(sys.argv[1])
|
||||
@ -5,23 +5,13 @@ from __future__ import annotations
|
||||
import os
|
||||
from http.server import BaseHTTPRequestHandler, HTTPServer
|
||||
from pathlib import Path
|
||||
from urllib.parse import unquote, urlsplit
|
||||
|
||||
ROOT = Path(__file__).parent / "articles"
|
||||
ARTICLES = {path.stem: path for path in ROOT.glob("*.html") if path.is_file()}
|
||||
|
||||
|
||||
def article_for_request_path(request_path: str) -> Path | None:
|
||||
path = unquote(urlsplit(request_path).path)
|
||||
if not path.startswith("/article/"):
|
||||
return None
|
||||
slug = path.removeprefix("/article/")
|
||||
return ARTICLES.get(slug)
|
||||
|
||||
|
||||
class Handler(BaseHTTPRequestHandler):
|
||||
def do_GET(self) -> None: # noqa: N802
|
||||
path = unquote(urlsplit(self.path).path)
|
||||
path = self.path.split("?")[0]
|
||||
if path == "/health":
|
||||
self.send_response(200)
|
||||
self.send_header("Content-Type", "application/json")
|
||||
@ -32,8 +22,9 @@ class Handler(BaseHTTPRequestHandler):
|
||||
self._index()
|
||||
return
|
||||
if path.startswith("/article/"):
|
||||
article = article_for_request_path(self.path)
|
||||
if article is not None:
|
||||
slug = path.split("/", 2)[2]
|
||||
article = ROOT / f"{slug}.html"
|
||||
if article.exists():
|
||||
self._html(article.read_bytes())
|
||||
return
|
||||
self.send_response(404)
|
||||
@ -42,7 +33,8 @@ class Handler(BaseHTTPRequestHandler):
|
||||
|
||||
def _index(self) -> None:
|
||||
items = []
|
||||
for slug in sorted(ARTICLES):
|
||||
for f in sorted(ROOT.glob("*.html")):
|
||||
slug = f.stem
|
||||
items.append(f'<li><a href="/article/{slug}">{slug}</a></li>')
|
||||
body = (
|
||||
"<!doctype html><html><body>"
|
||||
|
||||
@ -20,46 +20,6 @@ def test_testbox_workflow_hydrates_secrets_and_dotfiles():
|
||||
assert "CLAWBENCH_CODEX_AUTH_JSON" in workflow
|
||||
|
||||
|
||||
def test_crabbox_config_uses_actions_hydration():
|
||||
config = Path(".crabbox.yaml").read_text(encoding="utf-8")
|
||||
|
||||
assert "profile: clawbench-check" in config
|
||||
assert "provider: aws" in config
|
||||
assert "workflow: .github/workflows/crabbox-hydrate.yml" in config
|
||||
assert "job: hydrate" in config
|
||||
assert "baseRef: main" in config
|
||||
assert "- clawbench" in config
|
||||
assert "- CLAWBENCH_*" in config
|
||||
assert "- OPENCLAW_*" in config
|
||||
|
||||
|
||||
def test_crabbox_workflow_hydrates_secrets_dotfiles_and_ready_marker():
|
||||
workflow = Path(".github/workflows/crabbox-hydrate.yml").read_text(encoding="utf-8")
|
||||
|
||||
assert "crabbox_id:" in workflow
|
||||
assert "crabbox_runner_label:" in workflow
|
||||
assert 'runs-on: [self-hosted, "${{ inputs.crabbox_runner_label }}"]' in workflow
|
||||
assert "actions/setup-python@v5" in workflow
|
||||
assert "python -m pip install -e ." in workflow
|
||||
assert "scripts/ci-hydrate-testbox-env.sh" in workflow
|
||||
assert "HF_TOKEN" in workflow
|
||||
assert "OPENCLAW_CODEX_AUTH_JSON" in workflow
|
||||
assert "CLAWBENCH_CODEX_AUTH_JSON" in workflow
|
||||
assert "/usr/local/bin/clawbench-testbox-env" in workflow
|
||||
assert "$HOME/.crabbox/actions/${{ inputs.crabbox_id }}.env" in workflow
|
||||
assert "crabbox_keep_alive_minutes" in workflow
|
||||
|
||||
|
||||
def test_crabbox_skill_documents_clawbench_flow():
|
||||
skill = Path(".agents/skills/crabbox/SKILL.md").read_text(encoding="utf-8")
|
||||
|
||||
assert "openclaw/crabbox" in skill
|
||||
assert ".crabbox.yaml" in skill
|
||||
assert "crabbox actions hydrate" in skill
|
||||
assert "clawbench-testbox-env" in skill
|
||||
assert ".github/workflows/crabbox-hydrate.yml" in skill
|
||||
|
||||
|
||||
def test_testbox_helper_sources_hydrated_profile():
|
||||
script = Path("scripts/ci-hydrate-testbox-env.sh").read_text(encoding="utf-8")
|
||||
|
||||
|
||||
@ -107,7 +107,7 @@ async def test_gateway_client_retries_transient_drain_errors(monkeypatch: pytest
|
||||
async def fake_wait_event(self, event_name: str, *, timeout: float):
|
||||
return {"payload": {"nonce": ""}}
|
||||
|
||||
async def fake_rpc(self, method: str, params=None, **kwargs):
|
||||
async def fake_rpc(self, method: str, params=None):
|
||||
return {"payload": {"type": "hello-ok", "protocol": 3}}
|
||||
|
||||
async def fake_listener(self):
|
||||
@ -144,7 +144,7 @@ async def test_gateway_client_retries_half_closed_handshake_errors(
|
||||
async def fake_wait_event(self, event_name: str, *, timeout: float):
|
||||
return {"payload": {"nonce": ""}}
|
||||
|
||||
async def fake_rpc(self, method: str, params=None, **kwargs):
|
||||
async def fake_rpc(self, method: str, params=None):
|
||||
return {"payload": {"type": "hello-ok", "protocol": 3}}
|
||||
|
||||
async def fake_listener(self):
|
||||
@ -226,71 +226,3 @@ async def test_rpc_timeout_cleans_pending_request():
|
||||
|
||||
assert sent_frames[0]["method"] == "sessions.create"
|
||||
assert client._pending == {}
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_send_and_wait_passes_gateway_timeout_and_waits_for_run():
|
||||
client = GatewayClient(GatewayConfig(request_timeout=1))
|
||||
session_key = "session-1"
|
||||
calls: list[tuple[str, dict | None, dict]] = []
|
||||
|
||||
async def fake_rpc(method: str, params=None, **kwargs):
|
||||
calls.append((method, params, kwargs))
|
||||
if method == "sessions.send":
|
||||
return {"ok": True, "payload": {"runId": "run-1"}}
|
||||
if method == "agent.wait":
|
||||
return {"ok": True, "payload": {"runId": "run-1", "status": "completed"}}
|
||||
if method == "sessions.get":
|
||||
return {
|
||||
"ok": True,
|
||||
"payload": {
|
||||
"messages": [
|
||||
{
|
||||
"role": "assistant",
|
||||
"content": [{"type": "text", "text": "Done."}],
|
||||
}
|
||||
]
|
||||
},
|
||||
}
|
||||
return {"ok": True, "payload": {}}
|
||||
|
||||
client._rpc = fake_rpc # type: ignore[method-assign]
|
||||
|
||||
transcript = await client.send_and_wait(session_key, "hello", timeout=1.5)
|
||||
|
||||
send_call = next(call for call in calls if call[0] == "sessions.send")
|
||||
assert send_call[1] == {
|
||||
"key": session_key,
|
||||
"message": "hello",
|
||||
"idempotencyKey": send_call[1]["idempotencyKey"],
|
||||
"timeoutMs": 1500,
|
||||
}
|
||||
wait_call = next(call for call in calls if call[0] == "agent.wait")
|
||||
assert wait_call[1] == {"runId": "run-1", "timeoutMs": 1500}
|
||||
assert wait_call[2]["timeout"] == 11.5
|
||||
assert [message.text for message in transcript.assistant_messages] == ["Done."]
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_send_and_wait_aborts_run_when_no_terminal_state_arrives():
|
||||
client = GatewayClient(GatewayConfig(request_timeout=1))
|
||||
session_key = "session-1"
|
||||
calls: list[tuple[str, dict | None, dict]] = []
|
||||
|
||||
async def fake_rpc(method: str, params=None, **kwargs):
|
||||
calls.append((method, params, kwargs))
|
||||
if method == "sessions.send":
|
||||
return {"ok": True, "payload": {"runId": "run-timeout"}}
|
||||
if method == "agent.wait":
|
||||
await asyncio.sleep(60)
|
||||
if method == "sessions.abort":
|
||||
return {"ok": True, "payload": {"status": "aborted"}}
|
||||
if method == "sessions.get":
|
||||
return {"ok": True, "payload": {"messages": []}}
|
||||
return {"ok": True, "payload": {}}
|
||||
|
||||
client._rpc = fake_rpc # type: ignore[method-assign]
|
||||
|
||||
await client.send_and_wait(session_key, "hello", timeout=0.01)
|
||||
|
||||
assert ("sessions.abort", {"key": session_key, "runId": "run-timeout"}, {"timeout": 1}) in calls
|
||||
|
||||
@ -20,13 +20,6 @@ def test_submission_request_defaults_to_single_parallel_lane():
|
||||
assert request.max_parallel_lanes == 1
|
||||
assert request.runs_per_task == 3
|
||||
assert request.judge_affects_score is False
|
||||
assert request.task_ids == []
|
||||
|
||||
|
||||
def test_local_queue_dir_honors_env_override(tmp_path, monkeypatch):
|
||||
monkeypatch.setenv("CLAWBENCH_LOCAL_QUEUE_DIR", str(tmp_path / "queue"))
|
||||
|
||||
assert queue_module._resolve_local_queue_dir() == tmp_path / "queue"
|
||||
|
||||
|
||||
def test_submission_request_fingerprint_includes_judge_score_gate():
|
||||
@ -40,29 +33,6 @@ def test_submission_request_fingerprint_includes_judge_score_gate():
|
||||
assert advisory.active_fingerprint() != weighted.active_fingerprint()
|
||||
|
||||
|
||||
def test_submission_request_fingerprint_includes_task_ids():
|
||||
all_tasks = SubmissionRequest(model="anthropic/claude-sonnet-4-6")
|
||||
subset = SubmissionRequest(
|
||||
model="anthropic/claude-sonnet-4-6",
|
||||
task_ids=["t1-fs-quick-note"],
|
||||
)
|
||||
|
||||
assert all_tasks.active_fingerprint() != subset.active_fingerprint()
|
||||
|
||||
|
||||
def test_submission_request_fingerprint_canonicalizes_task_ids():
|
||||
first = SubmissionRequest(
|
||||
model="anthropic/claude-sonnet-4-6",
|
||||
task_ids=[" t2-demo ", "t1-demo", "t2-demo"],
|
||||
)
|
||||
second = SubmissionRequest(
|
||||
model="anthropic/claude-sonnet-4-6",
|
||||
task_ids=["t1-demo", "t2-demo"],
|
||||
)
|
||||
|
||||
assert first.active_fingerprint() == second.active_fingerprint()
|
||||
|
||||
|
||||
def test_save_local_replaces_queue_file_atomically(tmp_path, monkeypatch):
|
||||
monkeypatch.setattr(queue_module, "LOCAL_QUEUE_DIR", tmp_path)
|
||||
monkeypatch.setattr(queue_module, "HF_TOKEN", "")
|
||||
|
||||
@ -1,25 +0,0 @@
|
||||
from importlib import util
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def load_serve_module():
|
||||
serve_path = (
|
||||
Path(__file__).resolve().parents[1]
|
||||
/ "tasks-public"
|
||||
/ "assets"
|
||||
/ "t3_web_research_and_cite"
|
||||
/ "serve.py"
|
||||
)
|
||||
spec = util.spec_from_file_location("t3_web_research_serve", serve_path)
|
||||
module = util.module_from_spec(spec)
|
||||
assert spec.loader is not None
|
||||
spec.loader.exec_module(module)
|
||||
return module
|
||||
|
||||
|
||||
def test_article_paths_resolve_only_known_article_slugs():
|
||||
serve = load_serve_module()
|
||||
|
||||
assert serve.article_for_request_path("/article/01_grid_basics").name == "01_grid_basics.html"
|
||||
assert serve.article_for_request_path("/article/../../serve.py") is None
|
||||
assert serve.article_for_request_path("/article/%2e%2e/%2e%2e/serve.py") is None
|
||||
@ -6,14 +6,7 @@ from types import SimpleNamespace
|
||||
import pytest
|
||||
|
||||
from clawbench.queue import Job, JobQueue, JobStatus, SubmissionRequest
|
||||
from clawbench.worker import (
|
||||
GATEWAY_PORT,
|
||||
GATEWAY_PORT_SPACING,
|
||||
OPENCLAW_EVAL_SYSTEM_PROMPT,
|
||||
EvalWorker,
|
||||
JobProgressTracker,
|
||||
ParallelLane,
|
||||
)
|
||||
from clawbench.worker import GATEWAY_PORT, GATEWAY_PORT_SPACING, EvalWorker, JobProgressTracker, ParallelLane
|
||||
|
||||
|
||||
class DummyTask:
|
||||
@ -98,12 +91,7 @@ def test_configure_browser_runtime_sets_benchmark_safe_openclaw_config(monkeypat
|
||||
assert json.loads(config_path.read_text(encoding="utf-8")) == {
|
||||
"agents": {"defaults": {"skipBootstrap": True}},
|
||||
"browser": {"headless": True, "noSandbox": True},
|
||||
"tools": {"exec": {"host": "gateway", "security": "full", "ask": "off"}},
|
||||
"approvals": {"exec": {"enabled": False}},
|
||||
}
|
||||
approvals = json.loads((state_dir / "exec-approvals.json").read_text(encoding="utf-8"))
|
||||
assert approvals["defaults"] == {"security": "full", "ask": "off", "askFallback": "full"}
|
||||
assert approvals["agents"]["*"] == {"security": "full", "ask": "off", "askFallback": "full"}
|
||||
|
||||
|
||||
def test_configure_browser_runtime_pins_subagents_to_active_model(monkeypatch):
|
||||
@ -126,56 +114,10 @@ def test_configure_browser_runtime_pins_subagents_to_active_model(monkeypatch):
|
||||
"defaults": {
|
||||
"skipBootstrap": True,
|
||||
"model": {"primary": "openai-codex/gpt-5.4"},
|
||||
"models": {"openai-codex/gpt-5.4": {"params": {"fastMode": True}}},
|
||||
"systemPromptOverride": OPENCLAW_EVAL_SYSTEM_PROMPT,
|
||||
"subagents": {"model": {"primary": "openai-codex/gpt-5.4"}},
|
||||
}
|
||||
},
|
||||
"browser": {"headless": True, "noSandbox": True},
|
||||
"tools": {"exec": {"host": "gateway", "security": "full", "ask": "off"}},
|
||||
"approvals": {"exec": {"enabled": False}},
|
||||
}
|
||||
|
||||
|
||||
def test_configure_browser_runtime_uses_gateway_env_config_path(tmp_path: Path, monkeypatch):
|
||||
worker = EvalWorker(JobQueue())
|
||||
worker.set_active_model("openai-codex/gpt-5.4")
|
||||
parent_state = tmp_path / "parent"
|
||||
lane_state = tmp_path / "lane"
|
||||
parent_state.mkdir()
|
||||
lane_state.mkdir()
|
||||
parent_config = parent_state / "openclaw.json"
|
||||
lane_config = lane_state / "openclaw.json"
|
||||
parent_config.write_text("{}", encoding="utf-8")
|
||||
lane_config.write_text("{}", encoding="utf-8")
|
||||
monkeypatch.setenv("OPENCLAW_STATE_DIR", str(parent_state))
|
||||
|
||||
worker._configure_browser_runtime(
|
||||
["node", "/openclaw/dist/cli.js"],
|
||||
{
|
||||
"OPENCLAW_STATE_DIR": str(lane_state),
|
||||
"OPENCLAW_CONFIG_PATH": str(lane_config),
|
||||
},
|
||||
)
|
||||
|
||||
assert json.loads(parent_config.read_text(encoding="utf-8")) == {}
|
||||
lane_data = json.loads(lane_config.read_text(encoding="utf-8"))
|
||||
assert lane_data["agents"]["defaults"]["model"]["primary"] == "openai-codex/gpt-5.4"
|
||||
assert lane_data["tools"]["exec"] == {"host": "gateway", "security": "full", "ask": "off"}
|
||||
assert (lane_state / "exec-approvals.json").exists()
|
||||
assert not (parent_state / "exec-approvals.json").exists()
|
||||
|
||||
|
||||
def test_eval_model_defaults_pin_openai_to_sse_transport() -> None:
|
||||
data: dict[str, object] = {}
|
||||
|
||||
changed = EvalWorker._apply_eval_model_defaults(data, "openai/gpt-5.5")
|
||||
|
||||
assert changed is True
|
||||
assert data["agents"]["defaults"]["models"]["openai/gpt-5.5"]["params"] == {
|
||||
"fastMode": True,
|
||||
"transport": "sse",
|
||||
"openaiWsWarmup": False,
|
||||
}
|
||||
|
||||
|
||||
@ -273,11 +215,6 @@ def test_materialize_lane_runtime_spaces_ports_and_copies_auth(tmp_path: Path, m
|
||||
assert lane1.port == GATEWAY_PORT + GATEWAY_PORT_SPACING
|
||||
assert lane1.state_dir is not None
|
||||
assert (lane1.state_dir / "agents" / "main" / "agent" / "auth-profiles.json").exists()
|
||||
lane_cfg = json.loads((lane1.state_dir / "openclaw.json").read_text(encoding="utf-8"))
|
||||
assert lane_cfg["tools"]["exec"] == {"host": "gateway", "security": "full", "ask": "off"}
|
||||
assert lane_cfg["approvals"]["exec"] == {"enabled": False}
|
||||
lane_approvals = json.loads((lane1.state_dir / "exec-approvals.json").read_text(encoding="utf-8"))
|
||||
assert lane_approvals["defaults"] == {"security": "full", "ask": "off", "askFallback": "full"}
|
||||
|
||||
|
||||
@pytest.mark.asyncio
|
||||
|
||||
Loading…
Reference in New Issue
Block a user