# Running ClawBench on Kubernetes ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar connects to the gateway over loopback (`ws://localhost:18789`), runs the 19-task eval suite, and optionally logs results to MLflow. ``` ┌─── OpenClaw Pod ─────────────────────────────┐ │ gateway container (ws://localhost:18789) │ │ clawbench sidecar ──► gateway via loopback │ └──────────────────────────────────────────────┘ │ │ ▼ ▼ Model provider API MLflow (optional) ``` All commands use `scripts/k8s/deploy.sh`. The script has these modes: | Flag | What it does | |------|-------------| | *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar | | `--openclaw-only` | Deploy OpenClaw gateway only | | `--mlflow-only` | Deploy MLflow only | | `--add-sidecar` | Inject clawbench sidecar (starts eval) | | `--remove-sidecar` | Remove clawbench sidecar | | `--logs` | Tail sidecar logs | | `--teardown` | Delete eval namespace (keeps MLflow) | --- ## Prerequisites - `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds) - A container image for ClawBench (see [Building images](#building-images)) - At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.) For local testing with Kind: https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind --- ## Environment variables Set these **before** running `deploy.sh`. ### Required | Variable | Purpose | |----------|---------| | `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) | | `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) | ### Optional | Variable | Default | Purpose | |----------|---------|---------| | `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image | | `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image | | `OPENCLAW_GATEWAY_TOKEN` | *(generated by script)* | Gateway token; set this when attaching the sidecar to an existing gateway | | `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate | | `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace | | `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set | | `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID | | `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name | | `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image | | `ANTHROPIC_API_KEY` | | Added to K8s secret if set | | `OPENROUTER_API_KEY` | | Added to K8s secret if set | | `GEMINI_API_KEY` | | Added to K8s secret if set | | `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config | ### Model routing The gateway routes by provider prefix: | Model string | Required variables | |-------------|-------------------| | `openai/gpt-5.5` | `OPENAI_API_KEY` | | `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` | | `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` | | `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` | For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/` prefix for the model name: ```bash export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B" export OPENAI_API_KEY="none" # dummy value if the endpoint doesn't require auth export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1" ``` --- ## Full deploy (quick start) Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command. ```bash export CLAWBENCH_NAMESPACE=clawbench-eval # Export API keys before running. The script stores them in a K8s Secret # ("clawbench-secrets") that the gateway and sidecar containers read. export OPENAI_API_KEY="sk-..." # Model to evaluate (default: openai/gpt-5.5) # export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6" ./scripts/k8s/deploy.sh ``` Verify: ```bash # Should show 2/2 containers (gateway + clawbench) kubectl get pods -n clawbench-eval # Follow eval progress ./scripts/k8s/deploy.sh --logs ``` When the eval finishes, copy results and clean up: ```bash # Copy results from the sidecar POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}') kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json # Remove the sidecar (keeps OpenClaw + MLflow running) ./scripts/k8s/deploy.sh --remove-sidecar # Or tear down everything ./scripts/k8s/deploy.sh --teardown ``` --- ## Existing cluster + existing MLflow If you already have an OpenShift or Kubernetes cluster and an MLflow instance, you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup required. ```bash export CLAWBENCH_NAMESPACE=clawbench-eval # API keys — export before running deploy.sh. The script creates a # Kubernetes Secret ("clawbench-secrets") from whichever keys are set. # At least one provider key is required. export OPENAI_API_KEY="sk-..." # export ANTHROPIC_API_KEY="sk-ant-..." # export OPENROUTER_API_KEY="sk-or-..." # export GEMINI_API_KEY="..." # Model to evaluate (default: openai/gpt-5.5) export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6" # If attaching to an existing OpenClaw gateway, this must match that gateway. # If deploy.sh creates OpenClaw, it generates this token for you. # export OPENCLAW_GATEWAY_TOKEN="..." # Point to your existing MLflow export MLFLOW_TRACKING_URI="https://mlflow.example.com" export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5" # or use MLFLOW_EXPERIMENT_ID=42 # Deploy OpenClaw gateway into your cluster ./scripts/k8s/deploy.sh --openclaw-only ``` Verify OpenClaw is running: ```bash kubectl get pods -n clawbench-eval # Expect: openclaw-xxxx 1/1 Running ``` Then start the eval: ```bash ./scripts/k8s/deploy.sh --add-sidecar ./scripts/k8s/deploy.sh --logs ``` The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment and patches the experiment name/ID into the clawbench ConfigMap. When the eval completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that experiment. `MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist. `MLFLOW_EXPERIMENT_ID` requires an existing experiment. --- ## Step-by-step deploy Use this when you want to deploy components individually or bring your own OpenClaw/MLflow. ### Step 1: Deploy OpenClaw gateway ```bash export CLAWBENCH_NAMESPACE=clawbench-eval export OPENAI_API_KEY="sk-..." ./scripts/k8s/deploy.sh --openclaw-only ``` Verify: ```bash kubectl get pods -n clawbench-eval # Expect: openclaw-xxxx 1/1 Running ``` This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway token and creates the `clawbench-secrets` Secret automatically. **Skip this step** if you already have an OpenClaw deployment. Your existing gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`): ```json { "browser": { "enabled": true, "headless": true, "noSandbox": true, "ssrfPolicy": { "allowedHostnames": ["localhost", "127.0.0.1"] } }, "tools": { "profile": "coding", "alsoAllow": ["browser"] } } ``` Key requirements: - `browser.enabled: true` — activates the bundled browser plugin - `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default - `browser.ssrfPolicy` — several eval tasks need localhost access - Gateway must bind to loopback with token auth; export the matching `OPENCLAW_GATEWAY_TOKEN` before running `--add-sidecar` ### Step 2: Deploy MLflow ```bash ./scripts/k8s/deploy.sh --mlflow-only ``` Verify: ```bash kubectl get pods -n mlflow # Expect: mlflow-xxxx 1/1 Running ``` Deploys a single-replica MLflow server with SQLite backend into the `mlflow` namespace. The clawbench ConfigMap defaults to `http://mlflow-service.mlflow.svc.cluster.local:5000`. **Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`: ```bash export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000 export MLFLOW_EXPERIMENT_ID=4 # or MLFLOW_EXPERIMENT_NAME ``` ### Step 3: Run the eval ```bash ./scripts/k8s/deploy.sh --add-sidecar ``` This patches the OpenClaw deployment to inject a clawbench sidecar that: 1. Waits for the gateway (TCP check on port 18789, up to 3 min) 2. Checks MLflow connectivity if configured 3. Runs `clawbench run` with settings from the ConfigMap 4. Logs results to MLflow on success 5. Sleeps indefinitely so you can retrieve logs and results Verify: ```bash kubectl get pods -n $CLAWBENCH_NAMESPACE # Expect: openclaw-xxxx 2/2 Running (gateway + clawbench) ./scripts/k8s/deploy.sh --logs # Should show "Waiting for gateway..." then "Starting eval..." ``` When finished, remove the sidecar: ```bash ./scripts/k8s/deploy.sh --remove-sidecar ``` --- ## ConfigMap tuning The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval behavior. Override at deploy time via env vars, or patch after deploy: | Key | Default | What it controls | |-----|---------|-----------------| | `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test | | `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) | | `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes | | `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) | | `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) | | `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds | | `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds | | `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run | | `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint | | `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name | --- ## MLflow integration Results are logged via `scripts/log_to_mlflow.py` after a successful eval. **What gets logged:** - **Params**: model, provider, benchmark version, OpenClaw version, judge model - **Metrics**: overall score, per-axis scores (completion, trajectory, behavior, reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores - **Tags**: submission ID, timestamp, certified flag - **Artifacts**: full benchmark result JSON --- ## Building images ### ClawBench image `quay.io/sallyom/clawbench:latest` is public For Kubernetes, use the lightweight sidecar image instead — it only includes the eval harness and MLflow client: ```bash docker build -t clawbench:latest -f scripts/k8s/Dockerfile . # For Kind clusters, load directly instead of pushing to a registry: kind load docker-image clawbench:latest --name openclaw # For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly # Ensure you build for the right architecture, usually amd64 for non-local k8s ``` Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it. --- ## Cleanup ```bash # Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval) ./scripts/k8s/deploy.sh --remove-sidecar # Delete eval namespace (keeps MLflow running) ./scripts/k8s/deploy.sh --teardown # Delete the Kind cluster entirely kind delete cluster --name openclaw ```