add docs, manifests for k8s

Signed-off-by: sallyom <somalley@redhat.com>
2026-05-05 21:36:44 -04:00 · 2026-05-05 21:36:44 -04:00 · 7d75d99643
commit 7d75d99643
parent d57e4a697d
16 changed files with 1290 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -461,6 +461,26 @@ python3 scripts/run_posterior_dynamics_pipeline.py \
 clawbench diagnose profiles/local_ollama_gpt_oss.yaml
 ```

+### Running on Kubernetes
+
+See [`docs/kubernetes.md`](docs/kubernetes.md) for the full runbook. The short
+version:
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+export OPENAI_API_KEY="sk-..."       # or ANTHROPIC_API_KEY, OPENROUTER_API_KEY, etc.
+export CLAWBENCH_MODEL="openai/gpt-5.5"
+# export MLFLOW_NAMESPACE="mlflow"   # MLflow deploys in a separate namespace (default: mlflow)
+
+./scripts/k8s/deploy.sh              # deploys OpenClaw + MLflow + starts eval
+./scripts/k8s/deploy.sh --logs       # follow progress
+./scripts/k8s/deploy.sh --teardown   # tear down openclaw & eval (does not delete MLflow)
+```
+
+API keys are stored in a Kubernetes Secret created by the deploy script.
+MLflow is deployed in its own namespace (default: `mlflow`, configurable via
+`MLFLOW_NAMESPACE`).
+
 ---

 ## Partner Trace Spec
--- a/docs/kubernetes.md
+++ b/docs/kubernetes.md
@ -0,0 +1,361 @@
+# Running ClawBench on Kubernetes
+
+ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar
+connects to the gateway over loopback (`ws://localhost:18789`), runs the
+19-task eval suite, and optionally logs results to MLflow.
+
+```
+┌─── OpenClaw Pod ─────────────────────────────┐
+│  gateway container  (ws://localhost:18789)   │
+│  clawbench sidecar  ──► gateway via loopback │
+└──────────────────────────────────────────────┘
+         │                          │
+         ▼                          ▼
+   Model provider API         MLflow (optional)
+```
+
+All commands use `scripts/k8s/deploy.sh`. The script has these modes:
+
+| Flag | What it does |
+|------|-------------|
+| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar |
+| `--openclaw-only` | Deploy OpenClaw gateway only |
+| `--mlflow-only` | Deploy MLflow only |
+| `--add-sidecar` | Inject clawbench sidecar (starts eval) |
+| `--remove-sidecar` | Remove clawbench sidecar |
+| `--logs` | Tail sidecar logs |
+| `--teardown` | Delete eval namespace (keeps MLflow) |
+
+---
+
+## Prerequisites
+
+- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds)
+- A container image for ClawBench (see [Building images](#building-images))
+- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)
+
+For local testing with Kind:
+https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
+
+---
+
+## Environment variables
+
+Set these **before** running `deploy.sh`.
+
+### Required
+
+| Variable | Purpose |
+|----------|---------|
+| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) |
+| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) |
+
+### Optional
+
+| Variable | Default | Purpose |
+|----------|---------|---------|
+| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image |
+| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image |
+| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate |
+| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace |
+| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set |
+| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID |
+| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
+| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image |
+| `ANTHROPIC_API_KEY` | | Added to K8s secret if set |
+| `OPENROUTER_API_KEY` | | Added to K8s secret if set |
+| `GEMINI_API_KEY` | | Added to K8s secret if set |
+| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config |
+
+### Model routing
+
+The gateway routes by provider prefix:
+
+| Model string | Required variables |
+|-------------|-------------------|
+| `openai/gpt-5.5` | `OPENAI_API_KEY` |
+| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` |
+| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` |
+| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` |
+
+For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model
+server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/`
+prefix for the model name:
+
+```bash
+export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
+export OPENAI_API_KEY="none"  # dummy value if the endpoint doesn't require auth
+export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"
+```
+
+---
+
+## Full deploy (quick start)
+
+Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+
+# Export API keys before running. The script stores them in a K8s Secret
+# ("clawbench-secrets") that the gateway and sidecar containers read.
+export OPENAI_API_KEY="sk-..."
+
+# Model to evaluate (default: openai/gpt-5.5)
+# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
+
+./scripts/k8s/deploy.sh
+```
+
+Verify:
+
+```bash
+# Should show 2/2 containers (gateway + clawbench)
+kubectl get pods -n clawbench-eval
+
+# Follow eval progress
+./scripts/k8s/deploy.sh --logs
+```
+
+When the eval finishes, copy results and clean up:
+
+```bash
+# Copy results from the sidecar
+POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
+kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json
+
+# Remove the sidecar (keeps OpenClaw + MLflow running)
+./scripts/k8s/deploy.sh --remove-sidecar
+
+# Or tear down everything
+./scripts/k8s/deploy.sh --teardown
+```
+
+---
+
+## Existing cluster + existing MLflow
+
+If you already have an OpenShift or Kubernetes cluster and an MLflow instance,
+you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup
+required.
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+
+# API keys — export before running deploy.sh. The script creates a
+# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
+# At least one provider key is required.
+export OPENAI_API_KEY="sk-..."
+# export ANTHROPIC_API_KEY="sk-ant-..."
+# export OPENROUTER_API_KEY="sk-or-..."
+# export GEMINI_API_KEY="..."
+
+# Model to evaluate (default: openai/gpt-5.5)
+export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
+
+# Point to your existing MLflow
+export MLFLOW_TRACKING_URI="https://mlflow.example.com"
+export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5"  # or use MLFLOW_EXPERIMENT_ID=42
+
+# Deploy OpenClaw gateway into your cluster
+./scripts/k8s/deploy.sh --openclaw-only
+```
+
+Verify OpenClaw is running:
+
+```bash
+kubectl get pods -n clawbench-eval
+# Expect: openclaw-xxxx  1/1  Running
+```
+
+Then start the eval:
+
+```bash
+./scripts/k8s/deploy.sh --add-sidecar
+./scripts/k8s/deploy.sh --logs
+```
+
+The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment
+and patches the experiment name/ID into the clawbench ConfigMap. When the eval
+completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that
+experiment.
+
+`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist.
+`MLFLOW_EXPERIMENT_ID` requires an existing experiment.
+
+---
+
+## Step-by-step deploy
+
+Use this when you want to deploy components individually or bring your own
+OpenClaw/MLflow.
+
+### Step 1: Deploy OpenClaw gateway
+
+```bash
+export CLAWBENCH_NAMESPACE=clawbench-eval
+export OPENAI_API_KEY="sk-..."
+./scripts/k8s/deploy.sh --openclaw-only
+```
+
+Verify:
+
+```bash
+kubectl get pods -n clawbench-eval
+# Expect: openclaw-xxxx  1/1  Running
+```
+
+This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token
+auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway
+token and creates the `clawbench-secrets` Secret automatically.
+
+**Skip this step** if you already have an OpenClaw deployment. Your existing
+gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`):
+
+```json
+{
+  "browser": {
+    "enabled": true,
+    "headless": true,
+    "noSandbox": true,
+    "ssrfPolicy": {
+      "allowedHostnames": ["localhost", "127.0.0.1"]
+    }
+  },
+  "tools": {
+    "profile": "coding",
+    "alsoAllow": ["browser"]
+  }
+}
+```
+
+Key requirements:
+- `browser.enabled: true` — activates the bundled browser plugin
+- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default
+- `browser.ssrfPolicy` — several eval tasks need localhost access
+- Gateway must bind to loopback with token auth
+
+### Step 2: Deploy MLflow
+
+```bash
+./scripts/k8s/deploy.sh --mlflow-only
+```
+
+Verify:
+
+```bash
+kubectl get pods -n mlflow
+# Expect: mlflow-xxxx  1/1  Running
+```
+
+Deploys a single-replica MLflow server with SQLite backend into the `mlflow`
+namespace. The clawbench ConfigMap defaults to
+`http://mlflow-service.mlflow.svc.cluster.local:5000`.
+
+**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`:
+
+```bash
+export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
+export MLFLOW_EXPERIMENT_ID=4  # or MLFLOW_EXPERIMENT_NAME
+```
+
+### Step 3: Run the eval
+
+```bash
+./scripts/k8s/deploy.sh --add-sidecar
+```
+
+This patches the OpenClaw deployment to inject a clawbench sidecar that:
+
+1. Waits for the gateway (TCP check on port 18789, up to 3 min)
+2. Checks MLflow connectivity if configured
+3. Runs `clawbench run` with settings from the ConfigMap
+4. Logs results to MLflow on success
+5. Sleeps indefinitely so you can retrieve logs and results
+
+Verify:
+
+```bash
+kubectl get pods -n $CLAWBENCH_NAMESPACE
+# Expect: openclaw-xxxx  2/2  Running  (gateway + clawbench)
+
+./scripts/k8s/deploy.sh --logs
+# Should show "Waiting for gateway..." then "Starting eval..."
+```
+
+When finished, remove the sidecar:
+
+```bash
+./scripts/k8s/deploy.sh --remove-sidecar
+```
+
+---
+
+## ConfigMap tuning
+
+The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval
+behavior. Override at deploy time via env vars, or patch after deploy:
+
+| Key | Default | What it controls |
+|-----|---------|-----------------|
+| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test |
+| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) |
+| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes |
+| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) |
+| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) |
+| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds |
+| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds |
+| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run |
+| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint |
+| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
+
+---
+
+## MLflow integration
+
+Results are logged via `scripts/log_to_mlflow.py` after a successful eval.
+
+**What gets logged:**
+- **Params**: model, provider, benchmark version, OpenClaw version, judge model
+- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior,
+  reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
+- **Tags**: submission ID, timestamp, certified flag
+- **Artifacts**: full benchmark result JSON
+
+---
+
+## Building images
+
+### ClawBench image
+
+`quay.io/sallyom/clawbench:latest` is public
+
+For Kubernetes, use the lightweight sidecar image instead — it only includes
+the eval harness and MLflow client:
+
+```bash
+docker build -t clawbench:latest -f scripts/k8s/Dockerfile .
+
+# For Kind clusters, load directly instead of pushing to a registry:
+kind load docker-image clawbench:latest --name openclaw
+
+# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
+# Ensure you build for the right architecture, usually amd64 for non-local k8s
+```
+
+Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it.
+
+---
+
+## Cleanup
+
+```bash
+# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
+./scripts/k8s/deploy.sh --remove-sidecar
+
+# Delete eval namespace (keeps MLflow running)
+./scripts/k8s/deploy.sh --teardown
+
+# Delete the Kind cluster entirely
+kind delete cluster --name openclaw
+```
--- a/pyproject.toml
+++ b/pyproject.toml
@ -33,6 +33,9 @@ dev = [
    "pre-commit>=4.0,<5",
    "ruff>=0.9,<1",
 ]
+mlflow = [
+    "mlflow>=2.10,<3",
+]
 hermes = [
    "hermes-agent @ git+https://github.com/NousResearch/hermes-agent.git@main",
 ]
--- a/scripts/k8s/Dockerfile
+++ b/scripts/k8s/Dockerfile
@ -0,0 +1,33 @@
+# Lightweight ClawBench image for Kubernetes sidecar use.
+# Does NOT include the full OpenClaw server or Chromium — the gateway runs
+# in a separate container. Node.js is copied from the OpenClaw image for
+# the device-identity handshake required by the gateway protocol.
+FROM ghcr.io/openclaw/openclaw:latest AS openclaw
+
+FROM python:3.12-slim
+
+COPY --from=openclaw /usr/local/bin/node /usr/local/bin/node
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+
+WORKDIR /app
+
+COPY pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./
+COPY clawbench/ clawbench/
+COPY tasks-public/ tasks-public/
+COPY tasks-domain/ tasks-domain/
+COPY profiles/ profiles/
+COPY baselines/ baselines/
+COPY scripts/ scripts/
+
+RUN pip install --no-cache-dir ".[mlflow]"
+
+RUN mkdir -p /results && chmod 777 /results
+
+RUN useradd -m -d /home/node clawbench
+USER clawbench
+ENV HOME=/home/node
+
+ENTRYPOINT ["clawbench"]
--- a/scripts/k8s/deploy.sh
+++ b/scripts/k8s/deploy.sh
@ -0,0 +1,394 @@
+#!/usr/bin/env bash
+# Deploy ClawBench evals on Kubernetes (works on OpenShift too).
+#
+# 0-to-hero pipeline:
+#   Step 0: Create a cluster (see --help for Kind instructions)
+#   Step 1: Deploy OpenClaw gateway         (optional — bring your own)
+#   Step 2: Deploy MLflow tracking server   (optional — bring your own)
+#   Step 3: Run evals via sidecar           (add / remove)
+#
+# Usage:
+#   ./scripts/k8s/deploy.sh                        # Full deploy: OpenClaw + MLflow + eval
+#   ./scripts/k8s/deploy.sh --openclaw-only         # Step 1: deploy OpenClaw gateway
+#   ./scripts/k8s/deploy.sh --mlflow-only           # Step 2: deploy MLflow
+#   ./scripts/k8s/deploy.sh --add-sidecar           # Step 3: add eval sidecar (starts eval)
+#   ./scripts/k8s/deploy.sh --remove-sidecar        # Step 3: remove eval sidecar
+#   ./scripts/k8s/deploy.sh --logs                  # Tail clawbench sidecar logs
+#   ./scripts/k8s/deploy.sh --teardown              # Delete eval namespace (keeps MLflow)
+#
+# Environment (required):
+#   CLAWBENCH_NAMESPACE            Namespace for OpenClaw + eval
+#   OPENAI_API_KEY                 Model provider API key (or another provider key)
+#
+# Environment (optional):
+#   CLAWBENCH_IMAGE                Clawbench image (default: quay.io/sallyom/clawbench:latest)
+#   OPENCLAW_IMAGE                 OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
+#   CLAWBENCH_MODEL                Model to eval (default: openai/gpt-5.5)
+#   MLFLOW_NAMESPACE               MLflow namespace (default: mlflow)
+#   MLFLOW_TRACKING_URI            External MLflow URI (skips MLflow deploy if set)
+#   MLFLOW_EXPERIMENT_ID           MLflow experiment ID
+#   MLFLOW_EXPERIMENT_NAME         MLflow experiment name
+#   MLFLOW_IMAGE                   MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
+#   ANTHROPIC_API_KEY              Anthropic key (added to secret if set)
+#   OPENROUTER_API_KEY             OpenRouter key (added to secret if set)
+#   GEMINI_API_KEY                 Gemini key (added to secret if set)
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+NS="${CLAWBENCH_NAMESPACE:-}"
+MLFLOW_NS="${MLFLOW_NAMESPACE:-mlflow}"
+CLAWBENCH_IMG="${CLAWBENCH_IMAGE:-quay.io/sallyom/clawbench:latest}"
+OPENCLAW_IMG="${OPENCLAW_IMAGE:-ghcr.io/openclaw/openclaw:latest}"
+MLFLOW_IMG="${MLFLOW_IMAGE:-ghcr.io/mlflow/mlflow:v2.21.3}"
+
+command -v kubectl &>/dev/null || { echo "Missing: kubectl" >&2; exit 1; }
+kubectl cluster-info &>/dev/null || { echo "Cannot connect to cluster. Check kubeconfig." >&2; exit 1; }
+
+# ---------------------------------------------------------------------------
+if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
+  cat <<'HELP'
+ClawBench Kubernetes Deployment
+===============================
+
+0-to-hero pipeline for running ClawBench evals on Kubernetes.
+
+  Step 0: Create a cluster
+          For local testing with Kind, see:
+          https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
+
+  Step 1: Deploy OpenClaw gateway (optional — skip if you have one)
+  Step 2: Deploy MLflow tracking server (optional — skip if you have one)
+  Step 3: Run evals via sidecar (add/remove to OpenClaw deployment)
+
+Usage:
+  ./scripts/k8s/deploy.sh                    Full deploy (steps 1+2+3)
+  ./scripts/k8s/deploy.sh --openclaw-only     Step 1: OpenClaw only
+  ./scripts/k8s/deploy.sh --mlflow-only       Step 2: MLflow only
+  ./scripts/k8s/deploy.sh --add-sidecar       Step 3: add eval sidecar (starts eval)
+  ./scripts/k8s/deploy.sh --remove-sidecar    Step 3: remove eval sidecar
+  ./scripts/k8s/deploy.sh --logs              Tail clawbench sidecar logs
+  ./scripts/k8s/deploy.sh --teardown          Delete eval namespace (keeps MLflow)
+
+Required environment:
+  CLAWBENCH_NAMESPACE          Namespace for OpenClaw + eval
+  OPENAI_API_KEY               Model provider API key (or ANTHROPIC_API_KEY, etc.)
+
+Optional environment:
+  CLAWBENCH_IMAGE              Clawbench image (default: quay.io/sallyom/clawbench:latest)
+  OPENCLAW_IMAGE               OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
+  CLAWBENCH_MODEL              Model to eval (default: openai/gpt-5.5)
+  MLFLOW_NAMESPACE             MLflow namespace (default: mlflow)
+  MLFLOW_TRACKING_URI          External MLflow URI (skips MLflow deploy)
+  MLFLOW_EXPERIMENT_ID         MLflow experiment ID
+  MLFLOW_EXPERIMENT_NAME       MLflow experiment name
+  MLFLOW_IMAGE                 MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
+  ANTHROPIC_API_KEY            Anthropic key (added to secret if set)
+  OPENROUTER_API_KEY           OpenRouter key (added to secret if set)
+  GEMINI_API_KEY               Gemini key (added to secret if set)
+
+Works on Kubernetes and OpenShift.
+HELP
+  exit 0
+fi
+
+if [[ -z "$NS" ]]; then
+  echo "CLAWBENCH_NAMESPACE is required." >&2
+  echo "  export CLAWBENCH_NAMESPACE=clawbench-eval" >&2
+  exit 1
+fi
+
+MODE="full"
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --openclaw-only)   MODE="openclaw-only" ;;
+    --mlflow-only)     MODE="mlflow-only" ;;
+    --add-sidecar)     MODE="add-sidecar" ;;
+    --remove-sidecar)  MODE="remove-sidecar" ;;
+    --logs)            MODE="logs" ;;
+    --teardown)        MODE="teardown" ;;
+    *) echo "Unknown option: $1" >&2; exit 1 ;;
+  esac
+  shift
+done
+
+# ---------------------------------------------------------------------------
+# --logs
+# ---------------------------------------------------------------------------
+if [[ "$MODE" == "logs" ]]; then
+  kubectl logs deploy/openclaw -c clawbench -n "$NS" -f
+  exit 0
+fi
+
+# ---------------------------------------------------------------------------
+# --teardown
+# ---------------------------------------------------------------------------
+if [[ "$MODE" == "teardown" ]]; then
+  echo "Deleting namespace '$NS'..."
+  kubectl delete namespace "$NS" --ignore-not-found
+  echo "Done. MLflow namespace '$MLFLOW_NS' was not deleted."
+  exit 0
+fi
+
+# ---------------------------------------------------------------------------
+# --remove-sidecar
+# ---------------------------------------------------------------------------
+if [[ "$MODE" == "remove-sidecar" ]]; then
+  echo "Removing clawbench sidecar from openclaw in namespace '$NS'..."
+  INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
+    | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next((i for i,c in enumerate(cs) if c['name']=='clawbench'),-1))")
+  if [[ "$INDEX" == "-1" ]]; then
+    echo "No clawbench sidecar found."
+  else
+    kubectl patch deploy/openclaw -n "$NS" --type=json \
+      -p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]"
+    echo "Sidecar removed."
+  fi
+  exit 0
+fi
+
+# ---------------------------------------------------------------------------
+# Create namespace + secret
+# ---------------------------------------------------------------------------
+ensure_namespace_and_secret() {
+  if ! kubectl get namespace "$NS" &>/dev/null; then
+    echo "Creating namespace '$NS'..."
+    kubectl create namespace "$NS"
+  fi
+
+  if ! kubectl get secret clawbench-secrets -n "$NS" &>/dev/null; then
+    echo "Creating clawbench-secrets..."
+    GATEWAY_TOKEN=$(python3 -c "import secrets,base64; print(base64.b64encode(secrets.token_bytes(32)).decode())")
+
+    SECRET_ARGS=(
+      --from-literal=OPENCLAW_GATEWAY_TOKEN="$GATEWAY_TOKEN"
+    )
+    [[ -n "${OPENAI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENAI_API_KEY="$OPENAI_API_KEY")
+    [[ -n "${ANTHROPIC_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY")
+    [[ -n "${OPENROUTER_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENROUTER_API_KEY="$OPENROUTER_API_KEY")
+    [[ -n "${GEMINI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=GEMINI_API_KEY="$GEMINI_API_KEY")
+
+    if [[ ${#SECRET_ARGS[@]} -eq 1 ]]; then
+      echo "Warning: No API keys provided. Set OPENAI_API_KEY or another provider key." >&2
+    fi
+
+    kubectl create secret generic clawbench-secrets -n "$NS" "${SECRET_ARGS[@]}"
+    echo "  Gateway token: generated"
+    [[ -n "${OPENAI_API_KEY:-}" ]] && echo "  OPENAI_API_KEY: set"
+    [[ -n "${ANTHROPIC_API_KEY:-}" ]] && echo "  ANTHROPIC_API_KEY: set"
+    [[ -n "${OPENROUTER_API_KEY:-}" ]] && echo "  OPENROUTER_API_KEY: set"
+    [[ -n "${GEMINI_API_KEY:-}" ]] && echo "  GEMINI_API_KEY: set"
+  else
+    echo "Secret clawbench-secrets already exists in '$NS'."
+  fi
+}
+
+# ---------------------------------------------------------------------------
+# Step 1: Deploy OpenClaw
+# ---------------------------------------------------------------------------
+deploy_openclaw() {
+  echo ""
+  echo "Step 1: Deploying OpenClaw gateway (image: $OPENCLAW_IMG)..."
+
+  kubectl apply -f "$SCRIPT_DIR/openclaw/configmap.yaml" -n "$NS"
+
+  # Patch gateway config with custom OpenAI-compatible base URL
+  if [[ -n "${OPENAI_API_BASE:-}" ]]; then
+    echo "  Patching gateway config: models.providers.openai.baseUrl = $OPENAI_API_BASE"
+    EXISTING_JSON=$(kubectl get configmap openclaw-config -n "$NS" -o jsonpath='{.data.openclaw\.json}')
+    PATCHED_JSON=$(echo "$EXISTING_JSON" | python3 -c "
+import json, sys, os
+cfg = json.load(sys.stdin)
+openai_cfg = cfg.setdefault('models', {}).setdefault('providers', {}).setdefault('openai', {})
+openai_cfg['baseUrl'] = os.environ['OPENAI_API_BASE']
+openai_cfg.setdefault('models', [])
+json.dump(cfg, sys.stdout, indent=2)
+")
+    kubectl create configmap openclaw-config -n "$NS" \
+      --from-literal="openclaw.json=$PATCHED_JSON" \
+      --dry-run=client -o yaml | kubectl apply -f - -n "$NS" >/dev/null
+  fi
+
+  kubectl apply -f "$SCRIPT_DIR/openclaw/pvc.yaml" -n "$NS"
+  kubectl apply -f "$SCRIPT_DIR/openclaw/service.yaml" -n "$NS"
+
+  if [[ "$OPENCLAW_IMG" != "ghcr.io/openclaw/openclaw:latest" ]]; then
+    kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
+    kubectl set image "deploy/openclaw" "gateway=$OPENCLAW_IMG" -n "$NS"
+  else
+    kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
+  fi
+
+  echo "Waiting for OpenClaw rollout..."
+  kubectl rollout status deploy/openclaw -n "$NS" --timeout=180s || \
+    echo "  (rollout still in progress)"
+  echo "OpenClaw deployed."
+}
+
+# ---------------------------------------------------------------------------
+# Step 2: Deploy MLflow
+# ---------------------------------------------------------------------------
+deploy_mlflow() {
+  if [[ -n "${MLFLOW_TRACKING_URI:-}" ]]; then
+    echo ""
+    echo "Step 2: Skipping MLflow deploy (MLFLOW_TRACKING_URI is set: $MLFLOW_TRACKING_URI)"
+    return
+  fi
+
+  echo ""
+  echo "Step 2: Deploying MLflow (namespace: $MLFLOW_NS, image: $MLFLOW_IMG)..."
+
+  if ! kubectl get namespace "$MLFLOW_NS" &>/dev/null; then
+    kubectl create namespace "$MLFLOW_NS"
+  fi
+
+  kubectl apply -f "$SCRIPT_DIR/mlflow/pvc.yaml" -n "$MLFLOW_NS"
+  kubectl apply -f "$SCRIPT_DIR/mlflow/service.yaml" -n "$MLFLOW_NS"
+
+  if [[ "$MLFLOW_IMG" != "ghcr.io/mlflow/mlflow:v2.21.3" ]]; then
+    kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
+    kubectl set image "deploy/mlflow" "mlflow=$MLFLOW_IMG" -n "$MLFLOW_NS"
+  else
+    kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
+  fi
+
+  echo "Waiting for MLflow rollout..."
+  kubectl rollout status deploy/mlflow -n "$MLFLOW_NS" --timeout=120s || \
+    echo "  (rollout still in progress)"
+
+  MLFLOW_TRACKING_URI="http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000"
+  echo "MLflow deployed: $MLFLOW_TRACKING_URI"
+}
+
+# ---------------------------------------------------------------------------
+# Step 3: Add clawbench sidecar (starts eval)
+# ---------------------------------------------------------------------------
+add_sidecar() {
+  echo ""
+  echo "Step 3: Adding clawbench eval sidecar..."
+
+  echo "Applying clawbench ConfigMap..."
+  kubectl apply -f "$SCRIPT_DIR/manifests/configmap.yaml" -n "$NS" >/dev/null
+
+  if [[ -n "${CLAWBENCH_MODEL:-}" ]]; then
+    kubectl patch configmap clawbench-config -n "$NS" \
+      --type merge -p "{\"data\":{\"CLAWBENCH_MODEL\":\"$CLAWBENCH_MODEL\"}}" >/dev/null
+    echo "  Model: $CLAWBENCH_MODEL"
+  fi
+
+  if [[ -n "${OPENAI_API_BASE:-}" ]]; then
+    kubectl patch configmap clawbench-config -n "$NS" \
+      --type merge -p "{\"data\":{\"OPENAI_API_BASE\":\"$OPENAI_API_BASE\"}}" >/dev/null
+    echo "  OpenAI API base: $OPENAI_API_BASE"
+  fi
+
+  # Patch MLflow settings into ConfigMap
+  PATCH_DATA=""
+  MLFLOW_URI="${MLFLOW_TRACKING_URI:-http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000}"
+  PATCH_DATA="\"MLFLOW_TRACKING_URI\":\"$MLFLOW_URI\""
+  if [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]]; then
+    PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_ID\":\"$MLFLOW_EXPERIMENT_ID\""
+  fi
+  if [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]]; then
+    PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_NAME\":\"$MLFLOW_EXPERIMENT_NAME\""
+  fi
+  kubectl patch configmap clawbench-config -n "$NS" \
+    --type merge -p "{\"data\":{$PATCH_DATA}}" >/dev/null
+  echo "  MLflow URI: $MLFLOW_URI"
+  [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]] && echo "  MLflow experiment ID: $MLFLOW_EXPERIMENT_ID"
+  [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]] && echo "  MLflow experiment name: $MLFLOW_EXPERIMENT_NAME"
+
+  # Check if sidecar already exists
+  HAS_SIDECAR=$(kubectl get deploy/openclaw -n "$NS" -o json \
+    | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print('yes' if any(c['name']=='clawbench' for c in cs) else 'no')")
+
+  if [[ "$HAS_SIDECAR" == "yes" ]]; then
+    echo "Removing existing clawbench sidecar..."
+    INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
+      | python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next(i for i,c in enumerate(cs) if c['name']=='clawbench'))")
+    kubectl patch deploy/openclaw -n "$NS" --type=json \
+      -p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]" >/dev/null
+  fi
+
+  # Find openclaw-home volume name
+  HOME_VOLUME=$(kubectl get deploy/openclaw -n "$NS" -o json \
+    | python3 -c "
+import json, sys
+spec = json.load(sys.stdin)['spec']['template']['spec']
+for c in spec['containers']:
+    if c['name'] == 'gateway':
+        for vm in c.get('volumeMounts', []):
+            if vm['mountPath'] == '/home/node/.openclaw':
+                print(vm['name'])
+                sys.exit(0)
+print('openclaw-home')
+")
+
+  echo "Adding clawbench sidecar (image: $CLAWBENCH_IMG)..."
+
+  # Check if results volume already exists
+  HAS_RESULTS_VOL=$(kubectl get deploy/openclaw -n "$NS" -o json \
+    | python3 -c "import json,sys; vs=json.load(sys.stdin)['spec']['template']['spec'].get('volumes',[]); print('yes' if any(v['name']=='clawbench-results' for v in vs) else 'no')")
+
+  PATCH='[{"op":"add","path":"/spec/template/spec/containers/-","value":{'
+  PATCH+='"name":"clawbench",'
+  PATCH+='"image":"'"$CLAWBENCH_IMG"'",'
+  PATCH+='"imagePullPolicy":"IfNotPresent",'
+  PATCH+='"command":["/bin/bash","-c","echo \"Waiting for gateway on localhost:18789...\"\nfor i in $(seq 1 90); do\n  python3 -c \"import socket; s=socket.create_connection((\\\"127.0.0.1\\\",18789),2); s.close()\" 2>/dev/null && echo \"Gateway ready\" && break\n  sleep 2\ndone\n\nif [ -n \"${MLFLOW_TRACKING_URI:-}\" ]; then\n  echo \"Checking MLflow at ${MLFLOW_TRACKING_URI}...\"\n  python3 -c \"import httpx,os; r=httpx.get(os.environ[\\\"MLFLOW_TRACKING_URI\\\"]+\\\"/health\\\"); print(\\\"MLflow OK:\\\",r.status_code)\" 2>&1 || echo \"MLflow pre-check failed (will retry at log time)\"\nfi\n\necho \"Starting eval...\"\nclawbench run \\\n  --model \"${CLAWBENCH_MODEL}\" \\\n  --gateway-token \"${OPENCLAW_GATEWAY_TOKEN}\" \\\n  --runs \"${CLAWBENCH_RUNS}\" \\\n  --concurrency \"${CLAWBENCH_CONCURRENCY}\" \\\n  ${CLAWBENCH_JUDGE_MODEL:+--judge-model \"${CLAWBENCH_JUDGE_MODEL}\"} \\\n  $([ -n \"${CLAWBENCH_TASKS:-}\" ] && for t in ${CLAWBENCH_TASKS}; do printf -- \"-t %s \" \"$t\"; done) \\\n  -o /results/benchmark.json\nRC=$?\nif [ $RC -eq 0 ] && [ -n \"${MLFLOW_TRACKING_URI:-}\" ]; then\n  python scripts/log_to_mlflow.py /results/benchmark.json\nfi\necho \"ClawBench finished (exit=$RC)\"\nsleep infinity"],'
+  PATCH+='"envFrom":[{"configMapRef":{"name":"clawbench-config"}}],'
+  PATCH+='"env":[{"name":"OPENCLAW_GATEWAY_TOKEN","valueFrom":{"secretKeyRef":{"name":"clawbench-secrets","key":"OPENCLAW_GATEWAY_TOKEN"}}}],'
+  PATCH+='"resources":{"requests":{"memory":"1Gi","cpu":"500m"},"limits":{"memory":"4Gi","cpu":"2"}},'
+  PATCH+='"volumeMounts":[{"name":"'"$HOME_VOLUME"'","mountPath":"/home/node/.openclaw"},{"name":"clawbench-results","mountPath":"/results"},{"name":"tmp-volume","mountPath":"/tmp"}],'
+  PATCH+='"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]}}'
+  PATCH+='}}'
+
+  if [[ "$HAS_RESULTS_VOL" == "no" ]]; then
+    PATCH+=',{"op":"add","path":"/spec/template/spec/volumes/-","value":{"name":"clawbench-results","emptyDir":{}}}'
+  fi
+
+  PATCH+=']'
+
+  kubectl patch deploy/openclaw -n "$NS" --type=json -p "$PATCH" >/dev/null
+
+  echo ""
+  echo "Waiting for rollout..."
+  kubectl rollout status deploy/openclaw -n "$NS" --timeout=300s 2>/dev/null || \
+    echo "  (rollout timeout — eval runs for 30-60 min)"
+
+  echo ""
+  echo "Eval is running. Follow logs with:"
+  echo "  ./scripts/k8s/deploy.sh --logs"
+  echo ""
+  echo "When finished, remove the sidecar with:"
+  echo "  ./scripts/k8s/deploy.sh --remove-sidecar"
+}
+
+# ---------------------------------------------------------------------------
+# Execute
+# ---------------------------------------------------------------------------
+case "$MODE" in
+  full)
+    ensure_namespace_and_secret
+    deploy_openclaw
+    deploy_mlflow
+    add_sidecar
+    ;;
+  openclaw-only)
+    ensure_namespace_and_secret
+    deploy_openclaw
+    echo ""
+    echo "OpenClaw is running. Next steps:"
+    echo "  ./scripts/k8s/deploy.sh --mlflow-only       # Deploy MLflow"
+    echo "  ./scripts/k8s/deploy.sh --add-sidecar       # Start eval"
+    ;;
+  mlflow-only)
+    deploy_mlflow
+    ;;
+  add-sidecar)
+    if ! kubectl get deploy/openclaw -n "$NS" &>/dev/null; then
+      echo "Deployment 'openclaw' not found in namespace '$NS'." >&2
+      echo "Deploy OpenClaw first with: ./scripts/k8s/deploy.sh --openclaw-only" >&2
+      exit 1
+    fi
+    add_sidecar
+    ;;
+esac
--- a/scripts/k8s/manifests/configmap.yaml
+++ b/scripts/k8s/manifests/configmap.yaml
@ -0,0 +1,18 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: clawbench-config
+  labels:
+    app: clawbench
+data:
+  CLAWBENCH_MODEL: "openai/gpt-5.5"
+  OPENAI_API_BASE: ""
+  CLAWBENCH_RUNS: "3"
+  CLAWBENCH_CONCURRENCY: "4"
+  CLAWBENCH_JUDGE_MODEL: ""
+  CLAWBENCH_TASKS: ""
+  CLAWBENCH_CONNECT_TIMEOUT: "120"
+  CLAWBENCH_REQUEST_TIMEOUT: "300"
+  CLAWBENCH_PER_RUN_BUDGET_SECONDS: "600"
+  MLFLOW_TRACKING_URI: "http://mlflow-service.mlflow.svc.cluster.local:5000"
+  MLFLOW_EXPERIMENT_NAME: "clawbench"
--- a/scripts/k8s/manifests/secret.yaml
+++ b/scripts/k8s/manifests/secret.yaml
@ -0,0 +1,15 @@
+# Reference template — do NOT apply directly.
+# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
+# from exported environment variables (OPENAI_API_KEY, etc.).
+apiVersion: v1
+kind: Secret
+metadata:
+  name: clawbench-secrets
+  labels:
+    app: clawbench
+type: Opaque
+stringData:
+  OPENAI_API_KEY: "REPLACE_ME"
+  # Add other provider keys as needed:
+  # ANTHROPIC_API_KEY: "REPLACE_ME"
+  # OPENROUTER_API_KEY: "REPLACE_ME"
--- a/scripts/k8s/mlflow/deployment.yaml
+++ b/scripts/k8s/mlflow/deployment.yaml
@ -0,0 +1,68 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: mlflow
+  labels:
+    app: mlflow
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate
+  selector:
+    matchLabels:
+      app: mlflow
+  template:
+    metadata:
+      labels:
+        app: mlflow
+    spec:
+      containers:
+        - name: mlflow
+          image: ghcr.io/mlflow/mlflow:v2.21.3
+          command:
+            - mlflow
+            - server
+            - --host
+            - "0.0.0.0"
+            - --port
+            - "5000"
+            - --backend-store-uri
+            - sqlite:///mlflow/mlflow.db
+            - --default-artifact-root
+            - /mlflow/artifacts
+            - --serve-artifacts
+          ports:
+            - name: http
+              containerPort: 5000
+              protocol: TCP
+          livenessProbe:
+            httpGet:
+              path: /health
+              port: 5000
+            initialDelaySeconds: 15
+            periodSeconds: 30
+          readinessProbe:
+            httpGet:
+              path: /health
+              port: 5000
+            initialDelaySeconds: 5
+            periodSeconds: 10
+          resources:
+            requests:
+              cpu: 100m
+              memory: 256Mi
+            limits:
+              cpu: 500m
+              memory: 1Gi
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop:
+                - ALL
+          volumeMounts:
+            - name: mlflow-data
+              mountPath: /mlflow
+      volumes:
+        - name: mlflow-data
+          persistentVolumeClaim:
+            claimName: mlflow-data-pvc
--- a/scripts/k8s/mlflow/pvc.yaml
+++ b/scripts/k8s/mlflow/pvc.yaml
@ -0,0 +1,12 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: mlflow-data-pvc
+  labels:
+    app: mlflow
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 5Gi
--- a/scripts/k8s/mlflow/service.yaml
+++ b/scripts/k8s/mlflow/service.yaml
@ -0,0 +1,15 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: mlflow-service
+  labels:
+    app: mlflow
+spec:
+  type: ClusterIP
+  selector:
+    app: mlflow
+  ports:
+    - name: http
+      port: 5000
+      targetPort: 5000
+      protocol: TCP
--- a/scripts/k8s/openclaw/configmap.yaml
+++ b/scripts/k8s/openclaw/configmap.yaml
@ -0,0 +1,36 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: openclaw-config
+  labels:
+    app: openclaw
+data:
+  openclaw.json: |
+    {
+      "gateway": {
+        "mode": "local",
+        "bind": "loopback",
+        "port": 18789,
+        "auth": {
+          "mode": "token"
+        }
+      },
+      "browser": {
+        "enabled": true,
+        "headless": true,
+        "noSandbox": true,
+        "ssrfPolicy": {
+          "allowedHostnames": ["localhost", "127.0.0.1"]
+        }
+      },
+      "tools": {
+        "profile": "coding",
+        "alsoAllow": ["browser"]
+      },
+      "agents": {
+        "defaults": {
+          "workspace": "~/.openclaw/workspace"
+        }
+      },
+      "cron": { "enabled": false }
+    }
--- a/scripts/k8s/openclaw/deployment.yaml
+++ b/scripts/k8s/openclaw/deployment.yaml
@ -0,0 +1,146 @@
+# OpenClaw gateway deployment for ClawBench evals.
+#
+# Build the image with browser support:
+#   docker build --build-arg OPENCLAW_INSTALL_BROWSER=1 \
+#     -t quay.io/yourorg/openclaw:eval .
+#
+# Or use upstream without browser (browser eval tasks will score 0):
+#   image: ghcr.io/openclaw/openclaw:latest
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: openclaw
+  labels:
+    app: openclaw
+spec:
+  replicas: 1
+  strategy:
+    type: Recreate
+  selector:
+    matchLabels:
+      app: openclaw
+  template:
+    metadata:
+      labels:
+        app: openclaw
+    spec:
+      initContainers:
+        - name: init-config
+          image: registry.access.redhat.com/ubi9-minimal:latest
+          command:
+            - sh
+            - -c
+            - |
+              cp /config/openclaw.json /home/node/.openclaw/openclaw.json
+              chmod 666 /home/node/.openclaw/openclaw.json
+              mkdir -p /home/node/.openclaw/workspace
+              mkdir -p /home/node/.openclaw/agents
+              chmod 777 /home/node/.openclaw /home/node/.openclaw/workspace /home/node/.openclaw/agents
+              echo "Config initialized"
+          volumeMounts:
+            - name: openclaw-home
+              mountPath: /home/node/.openclaw
+            - name: config-template
+              mountPath: /config
+          resources:
+            limits:
+              cpu: 200m
+              memory: 128Mi
+            requests:
+              cpu: 50m
+              memory: 64Mi
+      containers:
+        - name: gateway
+          image: ghcr.io/openclaw/openclaw:latest
+          imagePullPolicy: IfNotPresent
+          command:
+            - sh
+            - -c
+            - umask 007 && exec node dist/index.js gateway run --bind loopback --port 18789 --allow-unconfigured
+          env:
+            - name: HOME
+              value: /home/node
+            - name: NODE_ENV
+              value: production
+            - name: OPENCLAW_CONFIG_DIR
+              value: /home/node/.openclaw
+            - name: OPENCLAW_STATE_DIR
+              value: /home/node/.openclaw
+            - name: OPENCLAW_GATEWAY_TOKEN
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: OPENCLAW_GATEWAY_TOKEN
+            - name: OPENAI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: OPENAI_API_KEY
+                  optional: true
+            - name: ANTHROPIC_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: ANTHROPIC_API_KEY
+                  optional: true
+            - name: OPENROUTER_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: OPENROUTER_API_KEY
+                  optional: true
+            - name: GEMINI_API_KEY
+              valueFrom:
+                secretKeyRef:
+                  name: clawbench-secrets
+                  key: GEMINI_API_KEY
+                  optional: true
+          ports:
+            - name: gateway
+              containerPort: 18789
+              protocol: TCP
+          livenessProbe:
+            exec:
+              command:
+                - node
+                - -e
+                - "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
+            initialDelaySeconds: 60
+            periodSeconds: 30
+            timeoutSeconds: 10
+          readinessProbe:
+            exec:
+              command:
+                - node
+                - -e
+                - "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
+            initialDelaySeconds: 30
+            periodSeconds: 10
+            timeoutSeconds: 5
+          resources:
+            requests:
+              cpu: 250m
+              memory: 1Gi
+            limits:
+              cpu: "2"
+              memory: 4Gi
+          securityContext:
+            allowPrivilegeEscalation: false
+            capabilities:
+              drop:
+                - ALL
+          volumeMounts:
+            - name: openclaw-home
+              mountPath: /home/node/.openclaw
+            - name: tmp-volume
+              mountPath: /tmp
+      terminationGracePeriodSeconds: 30
+      volumes:
+        - name: openclaw-home
+          persistentVolumeClaim:
+            claimName: openclaw-home-pvc
+        - name: config-template
+          configMap:
+            name: openclaw-config
+        - name: tmp-volume
+          emptyDir: {}
--- a/scripts/k8s/openclaw/pvc.yaml
+++ b/scripts/k8s/openclaw/pvc.yaml
@ -0,0 +1,12 @@
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: openclaw-home-pvc
+  labels:
+    app: openclaw
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 10Gi
--- a/scripts/k8s/openclaw/secret.yaml
+++ b/scripts/k8s/openclaw/secret.yaml
@ -0,0 +1,17 @@
+# Reference template — do NOT apply directly.
+# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
+# from exported environment variables (OPENAI_API_KEY, etc.).
+apiVersion: v1
+kind: Secret
+metadata:
+  name: clawbench-secrets
+  labels:
+    app: openclaw
+type: Opaque
+stringData:
+  OPENCLAW_GATEWAY_TOKEN: "REPLACE_ME"
+  OPENAI_API_KEY: "REPLACE_ME"
+  # Add other provider keys as needed:
+  # ANTHROPIC_API_KEY: "REPLACE_ME"
+  # OPENROUTER_API_KEY: "REPLACE_ME"
+  # GEMINI_API_KEY: "REPLACE_ME"
--- a/scripts/k8s/openclaw/service.yaml
+++ b/scripts/k8s/openclaw/service.yaml
@ -0,0 +1,15 @@
+apiVersion: v1
+kind: Service
+metadata:
+  name: openclaw
+  labels:
+    app: openclaw
+spec:
+  type: ClusterIP
+  selector:
+    app: openclaw
+  ports:
+    - name: gateway
+      port: 18789
+      targetPort: 18789
+      protocol: TCP
--- a/scripts/log_to_mlflow.py
+++ b/scripts/log_to_mlflow.py
@ -0,0 +1,125 @@
+#!/usr/bin/env python3
+"""Log a ClawBench BenchmarkResult to MLflow.
+
+Standalone script -- not imported by the clawbench package.
+Requires: pip install mlflow  (or pip install clawbench[mlflow])
+
+Usage:
+    python scripts/log_to_mlflow.py /results/benchmark.json
+
+Environment:
+    MLFLOW_TRACKING_URI      MLflow tracking server (default: http://localhost:5000)
+    MLFLOW_EXPERIMENT_NAME   Experiment name (default: clawbench)
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import sys
+from pathlib import Path
+
+sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
+
+
+def main(result_path: str) -> None:
+    try:
+        import mlflow
+    except ImportError:
+        print(
+            "mlflow is not installed. Install with: pip install mlflow"
+            "  (or pip install clawbench[mlflow])",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+
+    from clawbench.schemas import BenchmarkResult
+
+    with open(result_path, encoding="utf-8") as f:
+        result = BenchmarkResult(**json.load(f))
+
+    experiment_id = os.environ.get("MLFLOW_EXPERIMENT_ID")
+    if experiment_id:
+        experiment = mlflow.set_experiment(experiment_id=experiment_id)
+    else:
+        experiment = mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT_NAME", "clawbench"))
+
+    run_name = f"{result.model}-{result.submission_id[:8]}"
+    with mlflow.start_run(run_name=run_name):
+        mlflow.log_params(
+            {
+                "model": result.model,
+                "provider": result.provider,
+                "benchmark_version": result.benchmark_version,
+                "openclaw_version": result.openclaw_version or "unknown",
+                "judge_model": result.judge_model or "none",
+                "task_snapshot_fingerprint": result.task_snapshot_fingerprint or "unknown",
+            }
+        )
+
+        mlflow.log_metrics(
+            {
+                "overall_score": result.overall_score,
+                "overall_completion": result.overall_completion,
+                "overall_trajectory": result.overall_trajectory,
+                "overall_behavior": result.overall_behavior,
+                "overall_reliability": result.overall_reliability,
+                "overall_pass_hat_k": result.overall_pass_hat_k,
+                "overall_judge_score": result.overall_judge_score,
+                "overall_judge_confidence": result.overall_judge_confidence,
+                "overall_judge_pass_rate": result.overall_judge_pass_rate,
+                "judge_task_coverage": result.judge_task_coverage,
+                "overall_weighted_query_score": result.overall_weighted_query_score,
+                "overall_median_latency_ms": result.overall_median_latency_ms,
+                "overall_p95_latency_ms": result.overall_p95_latency_ms,
+                "overall_total_tokens": result.overall_total_tokens,
+                "overall_cost_usd": result.overall_cost_usd,
+                "overall_tokens_per_pass": result.overall_tokens_per_pass,
+                "overall_cost_per_pass": result.overall_cost_per_pass,
+                "overall_ci_lower": result.overall_ci_lower,
+                "overall_ci_upper": result.overall_ci_upper,
+            }
+        )
+
+        for tier in result.tier_results:
+            mlflow.log_metrics(
+                {
+                    f"{tier.tier}/score": tier.mean_task_score,
+                    f"{tier.tier}/completion": tier.mean_completion,
+                    f"{tier.tier}/trajectory": tier.mean_trajectory,
+                    f"{tier.tier}/behavior": tier.mean_behavior,
+                    f"{tier.tier}/reliability": tier.mean_reliability,
+                }
+            )
+
+        for i, task in enumerate(result.task_results):
+            mlflow.log_metrics(
+                {
+                    f"task/{task.task_id}/score": task.mean_task_score,
+                    f"task/{task.task_id}/reliability": task.reliability_score,
+                },
+                step=i,
+            )
+
+        mlflow.set_tags(
+            {
+                "submission_id": result.submission_id,
+                "timestamp": result.timestamp,
+                "certified": str(result.certified),
+            }
+        )
+
+        try:
+            mlflow.log_artifact(result_path)
+        except Exception as e:
+            print(f"Warning: artifact upload failed: {e}", file=sys.stderr)
+            print("Metrics and params were logged successfully.", file=sys.stderr)
+
+    print(f"Logged to MLflow: experiment={experiment.name} run={run_name}")
+
+
+if __name__ == "__main__":
+    if len(sys.argv) != 2:
+        print(f"Usage: {sys.argv[0]} <result.json>", file=sys.stderr)
+        sys.exit(1)
+    main(sys.argv[1])