clawbench/docs/kubernetes.md

# Running ClawBench on Kubernetes

ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar
connects to the gateway over loopback (`ws://localhost:18789`), runs the
19-task eval suite, and optionally logs results to MLflow.

```
┌─── OpenClaw Pod ─────────────────────────────┐
│  gateway container  (ws://localhost:18789)   │
│  clawbench sidecar  ──► gateway via loopback │
└──────────────────────────────────────────────┘
         │                          │
         ▼                          ▼
   Model provider API         MLflow (optional)
```

All commands use `scripts/k8s/deploy.sh`. The script has these modes:

| Flag | What it does |
|------|-------------|
| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar |
| `--openclaw-only` | Deploy OpenClaw gateway only |
| `--mlflow-only` | Deploy MLflow only |
| `--add-sidecar` | Inject clawbench sidecar (starts eval) |
| `--remove-sidecar` | Remove clawbench sidecar |
| `--logs` | Tail sidecar logs |
| `--teardown` | Delete eval namespace (keeps MLflow) |

---

## Prerequisites

- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds)
- A container image for ClawBench (see [Building images](#building-images))
- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)

For local testing with Kind:
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind

---

## Environment variables

Set these **before** running `deploy.sh`.

### Required

| Variable | Purpose |
|----------|---------|
| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) |
| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) |

### Optional

| Variable | Default | Purpose |
|----------|---------|---------|
| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image |
| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image |
| `OPENCLAW_GATEWAY_TOKEN` | *(generated by script)* | Gateway token; set this when attaching the sidecar to an existing gateway |
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate |
| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace |
| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set |
| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID |
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image |
| `ANTHROPIC_API_KEY` | | Added to K8s secret if set |
| `OPENROUTER_API_KEY` | | Added to K8s secret if set |
| `GEMINI_API_KEY` | | Added to K8s secret if set |
| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config |

### Model routing

The gateway routes by provider prefix:

| Model string | Required variables |
|-------------|-------------------|
| `openai/gpt-5.5` | `OPENAI_API_KEY` |
| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` |
| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` |
| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` |

For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model
server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/`
prefix for the model name:

```bash
export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
export OPENAI_API_KEY="none"  # dummy value if the endpoint doesn't require auth
export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"
```

---

## Full deploy (quick start)

Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.

```bash
export CLAWBENCH_NAMESPACE=clawbench-eval

# Export API keys before running. The script stores them in a K8s Secret
# ("clawbench-secrets") that the gateway and sidecar containers read.
export OPENAI_API_KEY="sk-..."

# Model to evaluate (default: openai/gpt-5.5)
# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"

./scripts/k8s/deploy.sh
```

Verify:

```bash
# Should show 2/2 containers (gateway + clawbench)
kubectl get pods -n clawbench-eval

# Follow eval progress
./scripts/k8s/deploy.sh --logs
```

When the eval finishes, copy results and clean up:

```bash
# Copy results from the sidecar
POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json

# Remove the sidecar (keeps OpenClaw + MLflow running)
./scripts/k8s/deploy.sh --remove-sidecar

# Or tear down everything
./scripts/k8s/deploy.sh --teardown
```

---

## Existing cluster + existing MLflow

If you already have an OpenShift or Kubernetes cluster and an MLflow instance,
you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup
required.

```bash
export CLAWBENCH_NAMESPACE=clawbench-eval

# API keys — export before running deploy.sh. The script creates a
# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
# At least one provider key is required.
export OPENAI_API_KEY="sk-..."
# export ANTHROPIC_API_KEY="sk-ant-..."
# export OPENROUTER_API_KEY="sk-or-..."
# export GEMINI_API_KEY="..."

# Model to evaluate (default: openai/gpt-5.5)
export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"

# If attaching to an existing OpenClaw gateway, this must match that gateway.
# If deploy.sh creates OpenClaw, it generates this token for you.
# export OPENCLAW_GATEWAY_TOKEN="..."

# Point to your existing MLflow
export MLFLOW_TRACKING_URI="https://mlflow.example.com"
export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5"  # or use MLFLOW_EXPERIMENT_ID=42

# Deploy OpenClaw gateway into your cluster
./scripts/k8s/deploy.sh --openclaw-only
```

Verify OpenClaw is running:

```bash
kubectl get pods -n clawbench-eval
# Expect: openclaw-xxxx  1/1  Running
```

Then start the eval:

```bash
./scripts/k8s/deploy.sh --add-sidecar
./scripts/k8s/deploy.sh --logs
```

The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment
and patches the experiment name/ID into the clawbench ConfigMap. When the eval
completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that
experiment.

`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist.
`MLFLOW_EXPERIMENT_ID` requires an existing experiment.

---

## Step-by-step deploy

Use this when you want to deploy components individually or bring your own
OpenClaw/MLflow.

### Step 1: Deploy OpenClaw gateway

```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
export OPENAI_API_KEY="sk-..."
./scripts/k8s/deploy.sh --openclaw-only
```

Verify:

```bash
kubectl get pods -n clawbench-eval
# Expect: openclaw-xxxx  1/1  Running
```

This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token
auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway
token and creates the `clawbench-secrets` Secret automatically.

**Skip this step** if you already have an OpenClaw deployment. Your existing
gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`):

```json
{
  "browser": {
    "enabled": true,
    "headless": true,
    "noSandbox": true,
    "ssrfPolicy": {
      "allowedHostnames": ["localhost", "127.0.0.1"]
    }
  },
  "tools": {
    "profile": "coding",
    "alsoAllow": ["browser"]
  }
}
```

Key requirements:
- `browser.enabled: true` — activates the bundled browser plugin
- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default
- `browser.ssrfPolicy` — several eval tasks need localhost access
- Gateway must bind to loopback with token auth; export the matching
  `OPENCLAW_GATEWAY_TOKEN` before running `--add-sidecar`

### Step 2: Deploy MLflow

```bash
./scripts/k8s/deploy.sh --mlflow-only
```

Verify:

```bash
kubectl get pods -n mlflow
# Expect: mlflow-xxxx  1/1  Running
```

Deploys a single-replica MLflow server with SQLite backend into the `mlflow`
namespace. The clawbench ConfigMap defaults to
`http://mlflow-service.mlflow.svc.cluster.local:5000`.

**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`:

```bash
export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
export MLFLOW_EXPERIMENT_ID=4  # or MLFLOW_EXPERIMENT_NAME
```

### Step 3: Run the eval

```bash
./scripts/k8s/deploy.sh --add-sidecar
```

This patches the OpenClaw deployment to inject a clawbench sidecar that:

1. Waits for the gateway (TCP check on port 18789, up to 3 min)
2. Checks MLflow connectivity if configured
3. Runs `clawbench run` with settings from the ConfigMap
4. Logs results to MLflow on success
5. Sleeps indefinitely so you can retrieve logs and results

Verify:

```bash
kubectl get pods -n $CLAWBENCH_NAMESPACE
# Expect: openclaw-xxxx  2/2  Running  (gateway + clawbench)

./scripts/k8s/deploy.sh --logs
# Should show "Waiting for gateway..." then "Starting eval..."
```

When finished, remove the sidecar:

```bash
./scripts/k8s/deploy.sh --remove-sidecar
```

---

## ConfigMap tuning

The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval
behavior. Override at deploy time via env vars, or patch after deploy:

| Key | Default | What it controls |
|-----|---------|-----------------|
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test |
| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) |
| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes |
| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) |
| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) |
| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds |
| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds |
| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run |
| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint |
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |

---

## MLflow integration

Results are logged via `scripts/log_to_mlflow.py` after a successful eval.

**What gets logged:**
- **Params**: model, provider, benchmark version, OpenClaw version, judge model
- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior,
  reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
- **Tags**: submission ID, timestamp, certified flag
- **Artifacts**: full benchmark result JSON

---

## Building images

### ClawBench image

`quay.io/sallyom/clawbench:latest` is public

For Kubernetes, use the lightweight sidecar image instead — it only includes
the eval harness and MLflow client:

```bash
docker build -t clawbench:latest -f scripts/k8s/Dockerfile .

# For Kind clusters, load directly instead of pushing to a registry:
kind load docker-image clawbench:latest --name openclaw

# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
# Ensure you build for the right architecture, usually amd64 for non-local k8s
```

Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it.

---

## Cleanup

```bash
# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
./scripts/k8s/deploy.sh --remove-sidecar

# Delete eval namespace (keeps MLflow running)
./scripts/k8s/deploy.sh --teardown

# Delete the Kind cluster entirely
kind delete cluster --name openclaw
```