Signed-off-by: sallyom <somalley@redhat.com>

2026-05-06 08:19:58 -04:00

11 KiB

Raw Blame History

Running ClawBench on Kubernetes

ClawBench runs as a sidecar in the OpenClaw gateway pod. The sidecar connects to the gateway over loopback (ws://localhost:18789), runs the 19-task eval suite, and optionally logs results to MLflow.

┌─── OpenClaw Pod ─────────────────────────────┐
│  gateway container  (ws://localhost:18789)   │
│  clawbench sidecar  ──► gateway via loopback │
└──────────────────────────────────────────────┘
         │                          │
         ▼                          ▼
   Model provider API         MLflow (optional)

All commands use scripts/k8s/deploy.sh. The script has these modes:

Flag	What it does
(none)	Full deploy: OpenClaw + MLflow + eval sidecar
`--openclaw-only`	Deploy OpenClaw gateway only
`--mlflow-only`	Deploy MLflow only
`--add-sidecar`	Inject clawbench sidecar (starts eval)
`--remove-sidecar`	Remove clawbench sidecar
`--logs`	Tail sidecar logs
`--teardown`	Delete eval namespace (keeps MLflow)

Prerequisites

kubectl on PATH, connected to a cluster (kubectl cluster-info succeeds)
A container image for ClawBench (see Building images)
At least one model provider API key (OPENAI_API_KEY, ANTHROPIC_API_KEY, etc.)

For local testing with Kind: https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind

Environment variables

Set these before running deploy.sh.

Required

Variable	Purpose
`CLAWBENCH_NAMESPACE`	Namespace for OpenClaw + eval (e.g. `clawbench-eval`)
`OPENAI_API_KEY`	Model provider key (or use another provider — see table below)

Optional

Variable	Default	Purpose
`CLAWBENCH_IMAGE`	`quay.io/sallyom/clawbench:latest`	ClawBench sidecar image
`OPENCLAW_IMAGE`	`ghcr.io/openclaw/openclaw:latest`	OpenClaw gateway image
`CLAWBENCH_MODEL`	`openai/gpt-5.5`	Model to evaluate
`MLFLOW_NAMESPACE`	`mlflow`	MLflow namespace
`MLFLOW_TRACKING_URI`	(deployed by script)	External MLflow URI — skips MLflow deploy if set
`MLFLOW_EXPERIMENT_ID`		MLflow experiment ID
`MLFLOW_EXPERIMENT_NAME`	`clawbench`	MLflow experiment name
`MLFLOW_IMAGE`	`ghcr.io/mlflow/mlflow:v2.21.3`	MLflow server image
`ANTHROPIC_API_KEY`		Added to K8s secret if set
`OPENROUTER_API_KEY`		Added to K8s secret if set
`GEMINI_API_KEY`		Added to K8s secret if set
`OPENAI_API_BASE`		Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config

Model routing

The gateway routes by provider prefix:

Model string	Required variables
`openai/gpt-5.5`	`OPENAI_API_KEY`
`anthropic/claude-sonnet-4-6`	`ANTHROPIC_API_KEY`
`openrouter/anthropic/claude-sonnet-4-6`	`OPENROUTER_API_KEY`
`openai/my-local-model`	`OPENAI_API_KEY` + `OPENAI_API_BASE`

For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model server), set OPENAI_API_BASE to the endpoint URL and use the openai/ prefix for the model name:

export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
export OPENAI_API_KEY="none"  # dummy value if the endpoint doesn't require auth
export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"

Full deploy (quick start)

Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.

export CLAWBENCH_NAMESPACE=clawbench-eval

# Export API keys before running. The script stores them in a K8s Secret
# ("clawbench-secrets") that the gateway and sidecar containers read.
export OPENAI_API_KEY="sk-..."

# Model to evaluate (default: openai/gpt-5.5)
# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"

./scripts/k8s/deploy.sh

Verify:

# Should show 2/2 containers (gateway + clawbench)
kubectl get pods -n clawbench-eval

# Follow eval progress
./scripts/k8s/deploy.sh --logs

When the eval finishes, copy results and clean up:

# Copy results from the sidecar
POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json

# Remove the sidecar (keeps OpenClaw + MLflow running)
./scripts/k8s/deploy.sh --remove-sidecar

# Or tear down everything
./scripts/k8s/deploy.sh --teardown

Existing cluster + existing MLflow

If you already have an OpenShift or Kubernetes cluster and an MLflow instance, you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup required.

export CLAWBENCH_NAMESPACE=clawbench-eval

# API keys — export before running deploy.sh. The script creates a
# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
# At least one provider key is required.
export OPENAI_API_KEY="sk-..."
# export ANTHROPIC_API_KEY="sk-ant-..."
# export OPENROUTER_API_KEY="sk-or-..."
# export GEMINI_API_KEY="..."

# Model to evaluate (default: openai/gpt-5.5)
export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"

# Point to your existing MLflow
export MLFLOW_TRACKING_URI="https://mlflow.example.com"
export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5"  # or use MLFLOW_EXPERIMENT_ID=42

# Deploy OpenClaw gateway into your cluster
./scripts/k8s/deploy.sh --openclaw-only

Verify OpenClaw is running:

kubectl get pods -n clawbench-eval
# Expect: openclaw-xxxx  1/1  Running

Then start the eval:

./scripts/k8s/deploy.sh --add-sidecar
./scripts/k8s/deploy.sh --logs

The deploy script sets MLFLOW_TRACKING_URI to skip its own MLflow deployment and patches the experiment name/ID into the clawbench ConfigMap. When the eval completes, scripts/log_to_mlflow.py logs results to your MLflow under that experiment.

MLFLOW_EXPERIMENT_NAME creates the experiment if it doesn't exist. MLFLOW_EXPERIMENT_ID requires an existing experiment.

Step-by-step deploy

Use this when you want to deploy components individually or bring your own OpenClaw/MLflow.

Step 1: Deploy OpenClaw gateway

export CLAWBENCH_NAMESPACE=clawbench-eval
export OPENAI_API_KEY="sk-..."
./scripts/k8s/deploy.sh --openclaw-only

Verify:

kubectl get pods -n clawbench-eval
# Expect: openclaw-xxxx  1/1  Running

This deploys from scripts/k8s/openclaw/: a single gateway pod with token auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway token and creates the clawbench-secrets Secret automatically.

Skip this step if you already have an OpenClaw deployment. Your existing gateway must have this config (see scripts/k8s/openclaw/configmap.yaml):

{
  "browser": {
    "enabled": true,
    "headless": true,
    "noSandbox": true,
    "ssrfPolicy": {
      "allowedHostnames": ["localhost", "127.0.0.1"]
    }
  },
  "tools": {
    "profile": "coding",
    "alsoAllow": ["browser"]
  }
}

Key requirements:

browser.enabled: true — activates the bundled browser plugin
tools.alsoAllow: ["browser"] — the coding profile does NOT include browser by default
browser.ssrfPolicy — several eval tasks need localhost access
Gateway must bind to loopback with token auth

Step 2: Deploy MLflow

./scripts/k8s/deploy.sh --mlflow-only

Verify:

kubectl get pods -n mlflow
# Expect: mlflow-xxxx  1/1  Running

Deploys a single-replica MLflow server with SQLite backend into the mlflow namespace. The clawbench ConfigMap defaults to http://mlflow-service.mlflow.svc.cluster.local:5000.

Skip this step if you have an external MLflow — set MLFLOW_TRACKING_URI:

export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
export MLFLOW_EXPERIMENT_ID=4  # or MLFLOW_EXPERIMENT_NAME

Step 3: Run the eval

./scripts/k8s/deploy.sh --add-sidecar

This patches the OpenClaw deployment to inject a clawbench sidecar that:

Waits for the gateway (TCP check on port 18789, up to 3 min)
Checks MLflow connectivity if configured
Runs clawbench run with settings from the ConfigMap
Logs results to MLflow on success
Sleeps indefinitely so you can retrieve logs and results

Verify:

kubectl get pods -n $CLAWBENCH_NAMESPACE
# Expect: openclaw-xxxx  2/2  Running  (gateway + clawbench)

./scripts/k8s/deploy.sh --logs
# Should show "Waiting for gateway..." then "Starting eval..."

When finished, remove the sidecar:

./scripts/k8s/deploy.sh --remove-sidecar

ConfigMap tuning

The clawbench ConfigMap (scripts/k8s/manifests/configmap.yaml) controls eval behavior. Override at deploy time via env vars, or patch after deploy:

Key	Default	What it controls
`CLAWBENCH_MODEL`	`openai/gpt-5.5`	Model under test
`CLAWBENCH_RUNS`	`3`	Runs per task (19 tasks x 3 = 57 total)
`CLAWBENCH_CONCURRENCY`	`4`	Parallel eval lanes
`CLAWBENCH_JUDGE_MODEL`	(empty)	Separate judge model (optional)
`CLAWBENCH_TASKS`	(empty — runs all)	Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`)
`CLAWBENCH_CONNECT_TIMEOUT`	`120`	Gateway connect timeout in seconds
`CLAWBENCH_REQUEST_TIMEOUT`	`300`	Per-request timeout in seconds
`CLAWBENCH_PER_RUN_BUDGET_SECONDS`	`600`	Max wall time per run
`MLFLOW_TRACKING_URI`	`http://mlflow-service.mlflow.svc.cluster.local:5000`	MLflow endpoint
`MLFLOW_EXPERIMENT_NAME`	`clawbench`	MLflow experiment name

MLflow integration

Results are logged via scripts/log_to_mlflow.py after a successful eval.

What gets logged:

Params: model, provider, benchmark version, OpenClaw version, judge model
Metrics: overall score, per-axis scores (completion, trajectory, behavior, reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
Tags: submission ID, timestamp, certified flag
Artifacts: full benchmark result JSON

Building images

ClawBench image

quay.io/sallyom/clawbench:latest is public

For Kubernetes, use the lightweight sidecar image instead — it only includes the eval harness and MLflow client:

docker build -t clawbench:latest -f scripts/k8s/Dockerfile .

# For Kind clusters, load directly instead of pushing to a registry:
kind load docker-image clawbench:latest --name openclaw

# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
# Ensure you build for the right architecture, usually amd64 for non-local k8s

Set CLAWBENCH_IMAGE=clawbench:latest when running deploy.sh to use it.

Cleanup

# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
./scripts/k8s/deploy.sh --remove-sidecar

# Delete eval namespace (keeps MLflow running)
./scripts/k8s/deploy.sh --teardown

# Delete the Kind cluster entirely
kind delete cluster --name openclaw

11 KiB Raw Blame History

Running ClawBench on Kubernetes

Prerequisites

Environment variables

Required

Optional

Model routing

Full deploy (quick start)

Existing cluster + existing MLflow

Step-by-step deploy

Step 1: Deploy OpenClaw gateway

Step 2: Deploy MLflow

Step 3: Run the eval

ConfigMap tuning

MLflow integration

Building images

ClawBench image

Cleanup

11 KiB

Raw Blame History