add docs, manifests for k8s
Signed-off-by: sallyom <somalley@redhat.com>
This commit is contained in:
parent
d57e4a697d
commit
7d75d99643
20
README.md
20
README.md
@ -461,6 +461,26 @@ python3 scripts/run_posterior_dynamics_pipeline.py \
|
||||
clawbench diagnose profiles/local_ollama_gpt_oss.yaml
|
||||
```
|
||||
|
||||
### Running on Kubernetes
|
||||
|
||||
See [`docs/kubernetes.md`](docs/kubernetes.md) for the full runbook. The short
|
||||
version:
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_NAMESPACE=clawbench-eval
|
||||
export OPENAI_API_KEY="sk-..." # or ANTHROPIC_API_KEY, OPENROUTER_API_KEY, etc.
|
||||
export CLAWBENCH_MODEL="openai/gpt-5.5"
|
||||
# export MLFLOW_NAMESPACE="mlflow" # MLflow deploys in a separate namespace (default: mlflow)
|
||||
|
||||
./scripts/k8s/deploy.sh # deploys OpenClaw + MLflow + starts eval
|
||||
./scripts/k8s/deploy.sh --logs # follow progress
|
||||
./scripts/k8s/deploy.sh --teardown # tear down openclaw & eval (does not delete MLflow)
|
||||
```
|
||||
|
||||
API keys are stored in a Kubernetes Secret created by the deploy script.
|
||||
MLflow is deployed in its own namespace (default: `mlflow`, configurable via
|
||||
`MLFLOW_NAMESPACE`).
|
||||
|
||||
---
|
||||
|
||||
## Partner Trace Spec
|
||||
|
||||
361
docs/kubernetes.md
Normal file
361
docs/kubernetes.md
Normal file
@ -0,0 +1,361 @@
|
||||
# Running ClawBench on Kubernetes
|
||||
|
||||
ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar
|
||||
connects to the gateway over loopback (`ws://localhost:18789`), runs the
|
||||
19-task eval suite, and optionally logs results to MLflow.
|
||||
|
||||
```
|
||||
┌─── OpenClaw Pod ─────────────────────────────┐
|
||||
│ gateway container (ws://localhost:18789) │
|
||||
│ clawbench sidecar ──► gateway via loopback │
|
||||
└──────────────────────────────────────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
Model provider API MLflow (optional)
|
||||
```
|
||||
|
||||
All commands use `scripts/k8s/deploy.sh`. The script has these modes:
|
||||
|
||||
| Flag | What it does |
|
||||
|------|-------------|
|
||||
| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar |
|
||||
| `--openclaw-only` | Deploy OpenClaw gateway only |
|
||||
| `--mlflow-only` | Deploy MLflow only |
|
||||
| `--add-sidecar` | Inject clawbench sidecar (starts eval) |
|
||||
| `--remove-sidecar` | Remove clawbench sidecar |
|
||||
| `--logs` | Tail sidecar logs |
|
||||
| `--teardown` | Delete eval namespace (keeps MLflow) |
|
||||
|
||||
---
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds)
|
||||
- A container image for ClawBench (see [Building images](#building-images))
|
||||
- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)
|
||||
|
||||
For local testing with Kind:
|
||||
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
|
||||
|
||||
---
|
||||
|
||||
## Environment variables
|
||||
|
||||
Set these **before** running `deploy.sh`.
|
||||
|
||||
### Required
|
||||
|
||||
| Variable | Purpose |
|
||||
|----------|---------|
|
||||
| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) |
|
||||
| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) |
|
||||
|
||||
### Optional
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
|----------|---------|---------|
|
||||
| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image |
|
||||
| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image |
|
||||
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate |
|
||||
| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace |
|
||||
| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set |
|
||||
| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID |
|
||||
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
|
||||
| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image |
|
||||
| `ANTHROPIC_API_KEY` | | Added to K8s secret if set |
|
||||
| `OPENROUTER_API_KEY` | | Added to K8s secret if set |
|
||||
| `GEMINI_API_KEY` | | Added to K8s secret if set |
|
||||
| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config |
|
||||
|
||||
### Model routing
|
||||
|
||||
The gateway routes by provider prefix:
|
||||
|
||||
| Model string | Required variables |
|
||||
|-------------|-------------------|
|
||||
| `openai/gpt-5.5` | `OPENAI_API_KEY` |
|
||||
| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` |
|
||||
| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` |
|
||||
| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` |
|
||||
|
||||
For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model
|
||||
server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/`
|
||||
prefix for the model name:
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
|
||||
export OPENAI_API_KEY="none" # dummy value if the endpoint doesn't require auth
|
||||
export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Full deploy (quick start)
|
||||
|
||||
Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_NAMESPACE=clawbench-eval
|
||||
|
||||
# Export API keys before running. The script stores them in a K8s Secret
|
||||
# ("clawbench-secrets") that the gateway and sidecar containers read.
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
|
||||
# Model to evaluate (default: openai/gpt-5.5)
|
||||
# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
|
||||
|
||||
./scripts/k8s/deploy.sh
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
# Should show 2/2 containers (gateway + clawbench)
|
||||
kubectl get pods -n clawbench-eval
|
||||
|
||||
# Follow eval progress
|
||||
./scripts/k8s/deploy.sh --logs
|
||||
```
|
||||
|
||||
When the eval finishes, copy results and clean up:
|
||||
|
||||
```bash
|
||||
# Copy results from the sidecar
|
||||
POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
|
||||
kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json
|
||||
|
||||
# Remove the sidecar (keeps OpenClaw + MLflow running)
|
||||
./scripts/k8s/deploy.sh --remove-sidecar
|
||||
|
||||
# Or tear down everything
|
||||
./scripts/k8s/deploy.sh --teardown
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Existing cluster + existing MLflow
|
||||
|
||||
If you already have an OpenShift or Kubernetes cluster and an MLflow instance,
|
||||
you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup
|
||||
required.
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_NAMESPACE=clawbench-eval
|
||||
|
||||
# API keys — export before running deploy.sh. The script creates a
|
||||
# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
|
||||
# At least one provider key is required.
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
# export ANTHROPIC_API_KEY="sk-ant-..."
|
||||
# export OPENROUTER_API_KEY="sk-or-..."
|
||||
# export GEMINI_API_KEY="..."
|
||||
|
||||
# Model to evaluate (default: openai/gpt-5.5)
|
||||
export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
|
||||
|
||||
# Point to your existing MLflow
|
||||
export MLFLOW_TRACKING_URI="https://mlflow.example.com"
|
||||
export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5" # or use MLFLOW_EXPERIMENT_ID=42
|
||||
|
||||
# Deploy OpenClaw gateway into your cluster
|
||||
./scripts/k8s/deploy.sh --openclaw-only
|
||||
```
|
||||
|
||||
Verify OpenClaw is running:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n clawbench-eval
|
||||
# Expect: openclaw-xxxx 1/1 Running
|
||||
```
|
||||
|
||||
Then start the eval:
|
||||
|
||||
```bash
|
||||
./scripts/k8s/deploy.sh --add-sidecar
|
||||
./scripts/k8s/deploy.sh --logs
|
||||
```
|
||||
|
||||
The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment
|
||||
and patches the experiment name/ID into the clawbench ConfigMap. When the eval
|
||||
completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that
|
||||
experiment.
|
||||
|
||||
`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist.
|
||||
`MLFLOW_EXPERIMENT_ID` requires an existing experiment.
|
||||
|
||||
---
|
||||
|
||||
## Step-by-step deploy
|
||||
|
||||
Use this when you want to deploy components individually or bring your own
|
||||
OpenClaw/MLflow.
|
||||
|
||||
### Step 1: Deploy OpenClaw gateway
|
||||
|
||||
```bash
|
||||
export CLAWBENCH_NAMESPACE=clawbench-eval
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
./scripts/k8s/deploy.sh --openclaw-only
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n clawbench-eval
|
||||
# Expect: openclaw-xxxx 1/1 Running
|
||||
```
|
||||
|
||||
This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token
|
||||
auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway
|
||||
token and creates the `clawbench-secrets` Secret automatically.
|
||||
|
||||
**Skip this step** if you already have an OpenClaw deployment. Your existing
|
||||
gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`):
|
||||
|
||||
```json
|
||||
{
|
||||
"browser": {
|
||||
"enabled": true,
|
||||
"headless": true,
|
||||
"noSandbox": true,
|
||||
"ssrfPolicy": {
|
||||
"allowedHostnames": ["localhost", "127.0.0.1"]
|
||||
}
|
||||
},
|
||||
"tools": {
|
||||
"profile": "coding",
|
||||
"alsoAllow": ["browser"]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Key requirements:
|
||||
- `browser.enabled: true` — activates the bundled browser plugin
|
||||
- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default
|
||||
- `browser.ssrfPolicy` — several eval tasks need localhost access
|
||||
- Gateway must bind to loopback with token auth
|
||||
|
||||
### Step 2: Deploy MLflow
|
||||
|
||||
```bash
|
||||
./scripts/k8s/deploy.sh --mlflow-only
|
||||
```
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n mlflow
|
||||
# Expect: mlflow-xxxx 1/1 Running
|
||||
```
|
||||
|
||||
Deploys a single-replica MLflow server with SQLite backend into the `mlflow`
|
||||
namespace. The clawbench ConfigMap defaults to
|
||||
`http://mlflow-service.mlflow.svc.cluster.local:5000`.
|
||||
|
||||
**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`:
|
||||
|
||||
```bash
|
||||
export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
|
||||
export MLFLOW_EXPERIMENT_ID=4 # or MLFLOW_EXPERIMENT_NAME
|
||||
```
|
||||
|
||||
### Step 3: Run the eval
|
||||
|
||||
```bash
|
||||
./scripts/k8s/deploy.sh --add-sidecar
|
||||
```
|
||||
|
||||
This patches the OpenClaw deployment to inject a clawbench sidecar that:
|
||||
|
||||
1. Waits for the gateway (TCP check on port 18789, up to 3 min)
|
||||
2. Checks MLflow connectivity if configured
|
||||
3. Runs `clawbench run` with settings from the ConfigMap
|
||||
4. Logs results to MLflow on success
|
||||
5. Sleeps indefinitely so you can retrieve logs and results
|
||||
|
||||
Verify:
|
||||
|
||||
```bash
|
||||
kubectl get pods -n $CLAWBENCH_NAMESPACE
|
||||
# Expect: openclaw-xxxx 2/2 Running (gateway + clawbench)
|
||||
|
||||
./scripts/k8s/deploy.sh --logs
|
||||
# Should show "Waiting for gateway..." then "Starting eval..."
|
||||
```
|
||||
|
||||
When finished, remove the sidecar:
|
||||
|
||||
```bash
|
||||
./scripts/k8s/deploy.sh --remove-sidecar
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ConfigMap tuning
|
||||
|
||||
The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval
|
||||
behavior. Override at deploy time via env vars, or patch after deploy:
|
||||
|
||||
| Key | Default | What it controls |
|
||||
|-----|---------|-----------------|
|
||||
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test |
|
||||
| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) |
|
||||
| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes |
|
||||
| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) |
|
||||
| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) |
|
||||
| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds |
|
||||
| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds |
|
||||
| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run |
|
||||
| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint |
|
||||
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
|
||||
|
||||
---
|
||||
|
||||
## MLflow integration
|
||||
|
||||
Results are logged via `scripts/log_to_mlflow.py` after a successful eval.
|
||||
|
||||
**What gets logged:**
|
||||
- **Params**: model, provider, benchmark version, OpenClaw version, judge model
|
||||
- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior,
|
||||
reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
|
||||
- **Tags**: submission ID, timestamp, certified flag
|
||||
- **Artifacts**: full benchmark result JSON
|
||||
|
||||
---
|
||||
|
||||
## Building images
|
||||
|
||||
### ClawBench image
|
||||
|
||||
`quay.io/sallyom/clawbench:latest` is public
|
||||
|
||||
For Kubernetes, use the lightweight sidecar image instead — it only includes
|
||||
the eval harness and MLflow client:
|
||||
|
||||
```bash
|
||||
docker build -t clawbench:latest -f scripts/k8s/Dockerfile .
|
||||
|
||||
# For Kind clusters, load directly instead of pushing to a registry:
|
||||
kind load docker-image clawbench:latest --name openclaw
|
||||
|
||||
# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
|
||||
# Ensure you build for the right architecture, usually amd64 for non-local k8s
|
||||
```
|
||||
|
||||
Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it.
|
||||
|
||||
---
|
||||
|
||||
## Cleanup
|
||||
|
||||
```bash
|
||||
# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
|
||||
./scripts/k8s/deploy.sh --remove-sidecar
|
||||
|
||||
# Delete eval namespace (keeps MLflow running)
|
||||
./scripts/k8s/deploy.sh --teardown
|
||||
|
||||
# Delete the Kind cluster entirely
|
||||
kind delete cluster --name openclaw
|
||||
```
|
||||
@ -33,6 +33,9 @@ dev = [
|
||||
"pre-commit>=4.0,<5",
|
||||
"ruff>=0.9,<1",
|
||||
]
|
||||
mlflow = [
|
||||
"mlflow>=2.10,<3",
|
||||
]
|
||||
hermes = [
|
||||
"hermes-agent @ git+https://github.com/NousResearch/hermes-agent.git@main",
|
||||
]
|
||||
|
||||
33
scripts/k8s/Dockerfile
Normal file
33
scripts/k8s/Dockerfile
Normal file
@ -0,0 +1,33 @@
|
||||
# Lightweight ClawBench image for Kubernetes sidecar use.
|
||||
# Does NOT include the full OpenClaw server or Chromium — the gateway runs
|
||||
# in a separate container. Node.js is copied from the OpenClaw image for
|
||||
# the device-identity handshake required by the gateway protocol.
|
||||
FROM ghcr.io/openclaw/openclaw:latest AS openclaw
|
||||
|
||||
FROM python:3.12-slim
|
||||
|
||||
COPY --from=openclaw /usr/local/bin/node /usr/local/bin/node
|
||||
|
||||
RUN apt-get update && \
|
||||
apt-get install -y --no-install-recommends git && \
|
||||
rm -rf /var/lib/apt/lists/*
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
COPY pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./
|
||||
COPY clawbench/ clawbench/
|
||||
COPY tasks-public/ tasks-public/
|
||||
COPY tasks-domain/ tasks-domain/
|
||||
COPY profiles/ profiles/
|
||||
COPY baselines/ baselines/
|
||||
COPY scripts/ scripts/
|
||||
|
||||
RUN pip install --no-cache-dir ".[mlflow]"
|
||||
|
||||
RUN mkdir -p /results && chmod 777 /results
|
||||
|
||||
RUN useradd -m -d /home/node clawbench
|
||||
USER clawbench
|
||||
ENV HOME=/home/node
|
||||
|
||||
ENTRYPOINT ["clawbench"]
|
||||
394
scripts/k8s/deploy.sh
Executable file
394
scripts/k8s/deploy.sh
Executable file
@ -0,0 +1,394 @@
|
||||
#!/usr/bin/env bash
|
||||
# Deploy ClawBench evals on Kubernetes (works on OpenShift too).
|
||||
#
|
||||
# 0-to-hero pipeline:
|
||||
# Step 0: Create a cluster (see --help for Kind instructions)
|
||||
# Step 1: Deploy OpenClaw gateway (optional — bring your own)
|
||||
# Step 2: Deploy MLflow tracking server (optional — bring your own)
|
||||
# Step 3: Run evals via sidecar (add / remove)
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/k8s/deploy.sh # Full deploy: OpenClaw + MLflow + eval
|
||||
# ./scripts/k8s/deploy.sh --openclaw-only # Step 1: deploy OpenClaw gateway
|
||||
# ./scripts/k8s/deploy.sh --mlflow-only # Step 2: deploy MLflow
|
||||
# ./scripts/k8s/deploy.sh --add-sidecar # Step 3: add eval sidecar (starts eval)
|
||||
# ./scripts/k8s/deploy.sh --remove-sidecar # Step 3: remove eval sidecar
|
||||
# ./scripts/k8s/deploy.sh --logs # Tail clawbench sidecar logs
|
||||
# ./scripts/k8s/deploy.sh --teardown # Delete eval namespace (keeps MLflow)
|
||||
#
|
||||
# Environment (required):
|
||||
# CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval
|
||||
# OPENAI_API_KEY Model provider API key (or another provider key)
|
||||
#
|
||||
# Environment (optional):
|
||||
# CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest)
|
||||
# OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
|
||||
# CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5)
|
||||
# MLFLOW_NAMESPACE MLflow namespace (default: mlflow)
|
||||
# MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy if set)
|
||||
# MLFLOW_EXPERIMENT_ID MLflow experiment ID
|
||||
# MLFLOW_EXPERIMENT_NAME MLflow experiment name
|
||||
# MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
|
||||
# ANTHROPIC_API_KEY Anthropic key (added to secret if set)
|
||||
# OPENROUTER_API_KEY OpenRouter key (added to secret if set)
|
||||
# GEMINI_API_KEY Gemini key (added to secret if set)
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
NS="${CLAWBENCH_NAMESPACE:-}"
|
||||
MLFLOW_NS="${MLFLOW_NAMESPACE:-mlflow}"
|
||||
CLAWBENCH_IMG="${CLAWBENCH_IMAGE:-quay.io/sallyom/clawbench:latest}"
|
||||
OPENCLAW_IMG="${OPENCLAW_IMAGE:-ghcr.io/openclaw/openclaw:latest}"
|
||||
MLFLOW_IMG="${MLFLOW_IMAGE:-ghcr.io/mlflow/mlflow:v2.21.3}"
|
||||
|
||||
command -v kubectl &>/dev/null || { echo "Missing: kubectl" >&2; exit 1; }
|
||||
kubectl cluster-info &>/dev/null || { echo "Cannot connect to cluster. Check kubeconfig." >&2; exit 1; }
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
|
||||
cat <<'HELP'
|
||||
ClawBench Kubernetes Deployment
|
||||
===============================
|
||||
|
||||
0-to-hero pipeline for running ClawBench evals on Kubernetes.
|
||||
|
||||
Step 0: Create a cluster
|
||||
For local testing with Kind, see:
|
||||
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
|
||||
|
||||
Step 1: Deploy OpenClaw gateway (optional — skip if you have one)
|
||||
Step 2: Deploy MLflow tracking server (optional — skip if you have one)
|
||||
Step 3: Run evals via sidecar (add/remove to OpenClaw deployment)
|
||||
|
||||
Usage:
|
||||
./scripts/k8s/deploy.sh Full deploy (steps 1+2+3)
|
||||
./scripts/k8s/deploy.sh --openclaw-only Step 1: OpenClaw only
|
||||
./scripts/k8s/deploy.sh --mlflow-only Step 2: MLflow only
|
||||
./scripts/k8s/deploy.sh --add-sidecar Step 3: add eval sidecar (starts eval)
|
||||
./scripts/k8s/deploy.sh --remove-sidecar Step 3: remove eval sidecar
|
||||
./scripts/k8s/deploy.sh --logs Tail clawbench sidecar logs
|
||||
./scripts/k8s/deploy.sh --teardown Delete eval namespace (keeps MLflow)
|
||||
|
||||
Required environment:
|
||||
CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval
|
||||
OPENAI_API_KEY Model provider API key (or ANTHROPIC_API_KEY, etc.)
|
||||
|
||||
Optional environment:
|
||||
CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest)
|
||||
OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
|
||||
CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5)
|
||||
MLFLOW_NAMESPACE MLflow namespace (default: mlflow)
|
||||
MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy)
|
||||
MLFLOW_EXPERIMENT_ID MLflow experiment ID
|
||||
MLFLOW_EXPERIMENT_NAME MLflow experiment name
|
||||
MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
|
||||
ANTHROPIC_API_KEY Anthropic key (added to secret if set)
|
||||
OPENROUTER_API_KEY OpenRouter key (added to secret if set)
|
||||
GEMINI_API_KEY Gemini key (added to secret if set)
|
||||
|
||||
Works on Kubernetes and OpenShift.
|
||||
HELP
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ -z "$NS" ]]; then
|
||||
echo "CLAWBENCH_NAMESPACE is required." >&2
|
||||
echo " export CLAWBENCH_NAMESPACE=clawbench-eval" >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
MODE="full"
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--openclaw-only) MODE="openclaw-only" ;;
|
||||
--mlflow-only) MODE="mlflow-only" ;;
|
||||
--add-sidecar) MODE="add-sidecar" ;;
|
||||
--remove-sidecar) MODE="remove-sidecar" ;;
|
||||
--logs) MODE="logs" ;;
|
||||
--teardown) MODE="teardown" ;;
|
||||
*) echo "Unknown option: $1" >&2; exit 1 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# --logs
|
||||
# ---------------------------------------------------------------------------
|
||||
if [[ "$MODE" == "logs" ]]; then
|
||||
kubectl logs deploy/openclaw -c clawbench -n "$NS" -f
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# --teardown
|
||||
# ---------------------------------------------------------------------------
|
||||
if [[ "$MODE" == "teardown" ]]; then
|
||||
echo "Deleting namespace '$NS'..."
|
||||
kubectl delete namespace "$NS" --ignore-not-found
|
||||
echo "Done. MLflow namespace '$MLFLOW_NS' was not deleted."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# --remove-sidecar
|
||||
# ---------------------------------------------------------------------------
|
||||
if [[ "$MODE" == "remove-sidecar" ]]; then
|
||||
echo "Removing clawbench sidecar from openclaw in namespace '$NS'..."
|
||||
INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next((i for i,c in enumerate(cs) if c['name']=='clawbench'),-1))")
|
||||
if [[ "$INDEX" == "-1" ]]; then
|
||||
echo "No clawbench sidecar found."
|
||||
else
|
||||
kubectl patch deploy/openclaw -n "$NS" --type=json \
|
||||
-p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]"
|
||||
echo "Sidecar removed."
|
||||
fi
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Create namespace + secret
|
||||
# ---------------------------------------------------------------------------
|
||||
ensure_namespace_and_secret() {
|
||||
if ! kubectl get namespace "$NS" &>/dev/null; then
|
||||
echo "Creating namespace '$NS'..."
|
||||
kubectl create namespace "$NS"
|
||||
fi
|
||||
|
||||
if ! kubectl get secret clawbench-secrets -n "$NS" &>/dev/null; then
|
||||
echo "Creating clawbench-secrets..."
|
||||
GATEWAY_TOKEN=$(python3 -c "import secrets,base64; print(base64.b64encode(secrets.token_bytes(32)).decode())")
|
||||
|
||||
SECRET_ARGS=(
|
||||
--from-literal=OPENCLAW_GATEWAY_TOKEN="$GATEWAY_TOKEN"
|
||||
)
|
||||
[[ -n "${OPENAI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENAI_API_KEY="$OPENAI_API_KEY")
|
||||
[[ -n "${ANTHROPIC_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY")
|
||||
[[ -n "${OPENROUTER_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENROUTER_API_KEY="$OPENROUTER_API_KEY")
|
||||
[[ -n "${GEMINI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=GEMINI_API_KEY="$GEMINI_API_KEY")
|
||||
|
||||
if [[ ${#SECRET_ARGS[@]} -eq 1 ]]; then
|
||||
echo "Warning: No API keys provided. Set OPENAI_API_KEY or another provider key." >&2
|
||||
fi
|
||||
|
||||
kubectl create secret generic clawbench-secrets -n "$NS" "${SECRET_ARGS[@]}"
|
||||
echo " Gateway token: generated"
|
||||
[[ -n "${OPENAI_API_KEY:-}" ]] && echo " OPENAI_API_KEY: set"
|
||||
[[ -n "${ANTHROPIC_API_KEY:-}" ]] && echo " ANTHROPIC_API_KEY: set"
|
||||
[[ -n "${OPENROUTER_API_KEY:-}" ]] && echo " OPENROUTER_API_KEY: set"
|
||||
[[ -n "${GEMINI_API_KEY:-}" ]] && echo " GEMINI_API_KEY: set"
|
||||
else
|
||||
echo "Secret clawbench-secrets already exists in '$NS'."
|
||||
fi
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 1: Deploy OpenClaw
|
||||
# ---------------------------------------------------------------------------
|
||||
deploy_openclaw() {
|
||||
echo ""
|
||||
echo "Step 1: Deploying OpenClaw gateway (image: $OPENCLAW_IMG)..."
|
||||
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/configmap.yaml" -n "$NS"
|
||||
|
||||
# Patch gateway config with custom OpenAI-compatible base URL
|
||||
if [[ -n "${OPENAI_API_BASE:-}" ]]; then
|
||||
echo " Patching gateway config: models.providers.openai.baseUrl = $OPENAI_API_BASE"
|
||||
EXISTING_JSON=$(kubectl get configmap openclaw-config -n "$NS" -o jsonpath='{.data.openclaw\.json}')
|
||||
PATCHED_JSON=$(echo "$EXISTING_JSON" | python3 -c "
|
||||
import json, sys, os
|
||||
cfg = json.load(sys.stdin)
|
||||
openai_cfg = cfg.setdefault('models', {}).setdefault('providers', {}).setdefault('openai', {})
|
||||
openai_cfg['baseUrl'] = os.environ['OPENAI_API_BASE']
|
||||
openai_cfg.setdefault('models', [])
|
||||
json.dump(cfg, sys.stdout, indent=2)
|
||||
")
|
||||
kubectl create configmap openclaw-config -n "$NS" \
|
||||
--from-literal="openclaw.json=$PATCHED_JSON" \
|
||||
--dry-run=client -o yaml | kubectl apply -f - -n "$NS" >/dev/null
|
||||
fi
|
||||
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/pvc.yaml" -n "$NS"
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/service.yaml" -n "$NS"
|
||||
|
||||
if [[ "$OPENCLAW_IMG" != "ghcr.io/openclaw/openclaw:latest" ]]; then
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
|
||||
kubectl set image "deploy/openclaw" "gateway=$OPENCLAW_IMG" -n "$NS"
|
||||
else
|
||||
kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
|
||||
fi
|
||||
|
||||
echo "Waiting for OpenClaw rollout..."
|
||||
kubectl rollout status deploy/openclaw -n "$NS" --timeout=180s || \
|
||||
echo " (rollout still in progress)"
|
||||
echo "OpenClaw deployed."
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 2: Deploy MLflow
|
||||
# ---------------------------------------------------------------------------
|
||||
deploy_mlflow() {
|
||||
if [[ -n "${MLFLOW_TRACKING_URI:-}" ]]; then
|
||||
echo ""
|
||||
echo "Step 2: Skipping MLflow deploy (MLFLOW_TRACKING_URI is set: $MLFLOW_TRACKING_URI)"
|
||||
return
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo "Step 2: Deploying MLflow (namespace: $MLFLOW_NS, image: $MLFLOW_IMG)..."
|
||||
|
||||
if ! kubectl get namespace "$MLFLOW_NS" &>/dev/null; then
|
||||
kubectl create namespace "$MLFLOW_NS"
|
||||
fi
|
||||
|
||||
kubectl apply -f "$SCRIPT_DIR/mlflow/pvc.yaml" -n "$MLFLOW_NS"
|
||||
kubectl apply -f "$SCRIPT_DIR/mlflow/service.yaml" -n "$MLFLOW_NS"
|
||||
|
||||
if [[ "$MLFLOW_IMG" != "ghcr.io/mlflow/mlflow:v2.21.3" ]]; then
|
||||
kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
|
||||
kubectl set image "deploy/mlflow" "mlflow=$MLFLOW_IMG" -n "$MLFLOW_NS"
|
||||
else
|
||||
kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
|
||||
fi
|
||||
|
||||
echo "Waiting for MLflow rollout..."
|
||||
kubectl rollout status deploy/mlflow -n "$MLFLOW_NS" --timeout=120s || \
|
||||
echo " (rollout still in progress)"
|
||||
|
||||
MLFLOW_TRACKING_URI="http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000"
|
||||
echo "MLflow deployed: $MLFLOW_TRACKING_URI"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Step 3: Add clawbench sidecar (starts eval)
|
||||
# ---------------------------------------------------------------------------
|
||||
add_sidecar() {
|
||||
echo ""
|
||||
echo "Step 3: Adding clawbench eval sidecar..."
|
||||
|
||||
echo "Applying clawbench ConfigMap..."
|
||||
kubectl apply -f "$SCRIPT_DIR/manifests/configmap.yaml" -n "$NS" >/dev/null
|
||||
|
||||
if [[ -n "${CLAWBENCH_MODEL:-}" ]]; then
|
||||
kubectl patch configmap clawbench-config -n "$NS" \
|
||||
--type merge -p "{\"data\":{\"CLAWBENCH_MODEL\":\"$CLAWBENCH_MODEL\"}}" >/dev/null
|
||||
echo " Model: $CLAWBENCH_MODEL"
|
||||
fi
|
||||
|
||||
if [[ -n "${OPENAI_API_BASE:-}" ]]; then
|
||||
kubectl patch configmap clawbench-config -n "$NS" \
|
||||
--type merge -p "{\"data\":{\"OPENAI_API_BASE\":\"$OPENAI_API_BASE\"}}" >/dev/null
|
||||
echo " OpenAI API base: $OPENAI_API_BASE"
|
||||
fi
|
||||
|
||||
# Patch MLflow settings into ConfigMap
|
||||
PATCH_DATA=""
|
||||
MLFLOW_URI="${MLFLOW_TRACKING_URI:-http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000}"
|
||||
PATCH_DATA="\"MLFLOW_TRACKING_URI\":\"$MLFLOW_URI\""
|
||||
if [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]]; then
|
||||
PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_ID\":\"$MLFLOW_EXPERIMENT_ID\""
|
||||
fi
|
||||
if [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]]; then
|
||||
PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_NAME\":\"$MLFLOW_EXPERIMENT_NAME\""
|
||||
fi
|
||||
kubectl patch configmap clawbench-config -n "$NS" \
|
||||
--type merge -p "{\"data\":{$PATCH_DATA}}" >/dev/null
|
||||
echo " MLflow URI: $MLFLOW_URI"
|
||||
[[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]] && echo " MLflow experiment ID: $MLFLOW_EXPERIMENT_ID"
|
||||
[[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]] && echo " MLflow experiment name: $MLFLOW_EXPERIMENT_NAME"
|
||||
|
||||
# Check if sidecar already exists
|
||||
HAS_SIDECAR=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print('yes' if any(c['name']=='clawbench' for c in cs) else 'no')")
|
||||
|
||||
if [[ "$HAS_SIDECAR" == "yes" ]]; then
|
||||
echo "Removing existing clawbench sidecar..."
|
||||
INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next(i for i,c in enumerate(cs) if c['name']=='clawbench'))")
|
||||
kubectl patch deploy/openclaw -n "$NS" --type=json \
|
||||
-p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]" >/dev/null
|
||||
fi
|
||||
|
||||
# Find openclaw-home volume name
|
||||
HOME_VOLUME=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "
|
||||
import json, sys
|
||||
spec = json.load(sys.stdin)['spec']['template']['spec']
|
||||
for c in spec['containers']:
|
||||
if c['name'] == 'gateway':
|
||||
for vm in c.get('volumeMounts', []):
|
||||
if vm['mountPath'] == '/home/node/.openclaw':
|
||||
print(vm['name'])
|
||||
sys.exit(0)
|
||||
print('openclaw-home')
|
||||
")
|
||||
|
||||
echo "Adding clawbench sidecar (image: $CLAWBENCH_IMG)..."
|
||||
|
||||
# Check if results volume already exists
|
||||
HAS_RESULTS_VOL=$(kubectl get deploy/openclaw -n "$NS" -o json \
|
||||
| python3 -c "import json,sys; vs=json.load(sys.stdin)['spec']['template']['spec'].get('volumes',[]); print('yes' if any(v['name']=='clawbench-results' for v in vs) else 'no')")
|
||||
|
||||
PATCH='[{"op":"add","path":"/spec/template/spec/containers/-","value":{'
|
||||
PATCH+='"name":"clawbench",'
|
||||
PATCH+='"image":"'"$CLAWBENCH_IMG"'",'
|
||||
PATCH+='"imagePullPolicy":"IfNotPresent",'
|
||||
PATCH+='"command":["/bin/bash","-c","echo \"Waiting for gateway on localhost:18789...\"\nfor i in $(seq 1 90); do\n python3 -c \"import socket; s=socket.create_connection((\\\"127.0.0.1\\\",18789),2); s.close()\" 2>/dev/null && echo \"Gateway ready\" && break\n sleep 2\ndone\n\nif [ -n \"${MLFLOW_TRACKING_URI:-}\" ]; then\n echo \"Checking MLflow at ${MLFLOW_TRACKING_URI}...\"\n python3 -c \"import httpx,os; r=httpx.get(os.environ[\\\"MLFLOW_TRACKING_URI\\\"]+\\\"/health\\\"); print(\\\"MLflow OK:\\\",r.status_code)\" 2>&1 || echo \"MLflow pre-check failed (will retry at log time)\"\nfi\n\necho \"Starting eval...\"\nclawbench run \\\n --model \"${CLAWBENCH_MODEL}\" \\\n --gateway-token \"${OPENCLAW_GATEWAY_TOKEN}\" \\\n --runs \"${CLAWBENCH_RUNS}\" \\\n --concurrency \"${CLAWBENCH_CONCURRENCY}\" \\\n ${CLAWBENCH_JUDGE_MODEL:+--judge-model \"${CLAWBENCH_JUDGE_MODEL}\"} \\\n $([ -n \"${CLAWBENCH_TASKS:-}\" ] && for t in ${CLAWBENCH_TASKS}; do printf -- \"-t %s \" \"$t\"; done) \\\n -o /results/benchmark.json\nRC=$?\nif [ $RC -eq 0 ] && [ -n \"${MLFLOW_TRACKING_URI:-}\" ]; then\n python scripts/log_to_mlflow.py /results/benchmark.json\nfi\necho \"ClawBench finished (exit=$RC)\"\nsleep infinity"],'
|
||||
PATCH+='"envFrom":[{"configMapRef":{"name":"clawbench-config"}}],'
|
||||
PATCH+='"env":[{"name":"OPENCLAW_GATEWAY_TOKEN","valueFrom":{"secretKeyRef":{"name":"clawbench-secrets","key":"OPENCLAW_GATEWAY_TOKEN"}}}],'
|
||||
PATCH+='"resources":{"requests":{"memory":"1Gi","cpu":"500m"},"limits":{"memory":"4Gi","cpu":"2"}},'
|
||||
PATCH+='"volumeMounts":[{"name":"'"$HOME_VOLUME"'","mountPath":"/home/node/.openclaw"},{"name":"clawbench-results","mountPath":"/results"},{"name":"tmp-volume","mountPath":"/tmp"}],'
|
||||
PATCH+='"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]}}'
|
||||
PATCH+='}}'
|
||||
|
||||
if [[ "$HAS_RESULTS_VOL" == "no" ]]; then
|
||||
PATCH+=',{"op":"add","path":"/spec/template/spec/volumes/-","value":{"name":"clawbench-results","emptyDir":{}}}'
|
||||
fi
|
||||
|
||||
PATCH+=']'
|
||||
|
||||
kubectl patch deploy/openclaw -n "$NS" --type=json -p "$PATCH" >/dev/null
|
||||
|
||||
echo ""
|
||||
echo "Waiting for rollout..."
|
||||
kubectl rollout status deploy/openclaw -n "$NS" --timeout=300s 2>/dev/null || \
|
||||
echo " (rollout timeout — eval runs for 30-60 min)"
|
||||
|
||||
echo ""
|
||||
echo "Eval is running. Follow logs with:"
|
||||
echo " ./scripts/k8s/deploy.sh --logs"
|
||||
echo ""
|
||||
echo "When finished, remove the sidecar with:"
|
||||
echo " ./scripts/k8s/deploy.sh --remove-sidecar"
|
||||
}
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Execute
|
||||
# ---------------------------------------------------------------------------
|
||||
case "$MODE" in
|
||||
full)
|
||||
ensure_namespace_and_secret
|
||||
deploy_openclaw
|
||||
deploy_mlflow
|
||||
add_sidecar
|
||||
;;
|
||||
openclaw-only)
|
||||
ensure_namespace_and_secret
|
||||
deploy_openclaw
|
||||
echo ""
|
||||
echo "OpenClaw is running. Next steps:"
|
||||
echo " ./scripts/k8s/deploy.sh --mlflow-only # Deploy MLflow"
|
||||
echo " ./scripts/k8s/deploy.sh --add-sidecar # Start eval"
|
||||
;;
|
||||
mlflow-only)
|
||||
deploy_mlflow
|
||||
;;
|
||||
add-sidecar)
|
||||
if ! kubectl get deploy/openclaw -n "$NS" &>/dev/null; then
|
||||
echo "Deployment 'openclaw' not found in namespace '$NS'." >&2
|
||||
echo "Deploy OpenClaw first with: ./scripts/k8s/deploy.sh --openclaw-only" >&2
|
||||
exit 1
|
||||
fi
|
||||
add_sidecar
|
||||
;;
|
||||
esac
|
||||
18
scripts/k8s/manifests/configmap.yaml
Normal file
18
scripts/k8s/manifests/configmap.yaml
Normal file
@ -0,0 +1,18 @@
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: clawbench-config
|
||||
labels:
|
||||
app: clawbench
|
||||
data:
|
||||
CLAWBENCH_MODEL: "openai/gpt-5.5"
|
||||
OPENAI_API_BASE: ""
|
||||
CLAWBENCH_RUNS: "3"
|
||||
CLAWBENCH_CONCURRENCY: "4"
|
||||
CLAWBENCH_JUDGE_MODEL: ""
|
||||
CLAWBENCH_TASKS: ""
|
||||
CLAWBENCH_CONNECT_TIMEOUT: "120"
|
||||
CLAWBENCH_REQUEST_TIMEOUT: "300"
|
||||
CLAWBENCH_PER_RUN_BUDGET_SECONDS: "600"
|
||||
MLFLOW_TRACKING_URI: "http://mlflow-service.mlflow.svc.cluster.local:5000"
|
||||
MLFLOW_EXPERIMENT_NAME: "clawbench"
|
||||
15
scripts/k8s/manifests/secret.yaml
Normal file
15
scripts/k8s/manifests/secret.yaml
Normal file
@ -0,0 +1,15 @@
|
||||
# Reference template — do NOT apply directly.
|
||||
# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
|
||||
# from exported environment variables (OPENAI_API_KEY, etc.).
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: clawbench-secrets
|
||||
labels:
|
||||
app: clawbench
|
||||
type: Opaque
|
||||
stringData:
|
||||
OPENAI_API_KEY: "REPLACE_ME"
|
||||
# Add other provider keys as needed:
|
||||
# ANTHROPIC_API_KEY: "REPLACE_ME"
|
||||
# OPENROUTER_API_KEY: "REPLACE_ME"
|
||||
68
scripts/k8s/mlflow/deployment.yaml
Normal file
68
scripts/k8s/mlflow/deployment.yaml
Normal file
@ -0,0 +1,68 @@
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: mlflow
|
||||
labels:
|
||||
app: mlflow
|
||||
spec:
|
||||
replicas: 1
|
||||
strategy:
|
||||
type: Recreate
|
||||
selector:
|
||||
matchLabels:
|
||||
app: mlflow
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: mlflow
|
||||
spec:
|
||||
containers:
|
||||
- name: mlflow
|
||||
image: ghcr.io/mlflow/mlflow:v2.21.3
|
||||
command:
|
||||
- mlflow
|
||||
- server
|
||||
- --host
|
||||
- "0.0.0.0"
|
||||
- --port
|
||||
- "5000"
|
||||
- --backend-store-uri
|
||||
- sqlite:///mlflow/mlflow.db
|
||||
- --default-artifact-root
|
||||
- /mlflow/artifacts
|
||||
- --serve-artifacts
|
||||
ports:
|
||||
- name: http
|
||||
containerPort: 5000
|
||||
protocol: TCP
|
||||
livenessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 5000
|
||||
initialDelaySeconds: 15
|
||||
periodSeconds: 30
|
||||
readinessProbe:
|
||||
httpGet:
|
||||
path: /health
|
||||
port: 5000
|
||||
initialDelaySeconds: 5
|
||||
periodSeconds: 10
|
||||
resources:
|
||||
requests:
|
||||
cpu: 100m
|
||||
memory: 256Mi
|
||||
limits:
|
||||
cpu: 500m
|
||||
memory: 1Gi
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities:
|
||||
drop:
|
||||
- ALL
|
||||
volumeMounts:
|
||||
- name: mlflow-data
|
||||
mountPath: /mlflow
|
||||
volumes:
|
||||
- name: mlflow-data
|
||||
persistentVolumeClaim:
|
||||
claimName: mlflow-data-pvc
|
||||
12
scripts/k8s/mlflow/pvc.yaml
Normal file
12
scripts/k8s/mlflow/pvc.yaml
Normal file
@ -0,0 +1,12 @@
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: mlflow-data-pvc
|
||||
labels:
|
||||
app: mlflow
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 5Gi
|
||||
15
scripts/k8s/mlflow/service.yaml
Normal file
15
scripts/k8s/mlflow/service.yaml
Normal file
@ -0,0 +1,15 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: mlflow-service
|
||||
labels:
|
||||
app: mlflow
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector:
|
||||
app: mlflow
|
||||
ports:
|
||||
- name: http
|
||||
port: 5000
|
||||
targetPort: 5000
|
||||
protocol: TCP
|
||||
36
scripts/k8s/openclaw/configmap.yaml
Normal file
36
scripts/k8s/openclaw/configmap.yaml
Normal file
@ -0,0 +1,36 @@
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: openclaw-config
|
||||
labels:
|
||||
app: openclaw
|
||||
data:
|
||||
openclaw.json: |
|
||||
{
|
||||
"gateway": {
|
||||
"mode": "local",
|
||||
"bind": "loopback",
|
||||
"port": 18789,
|
||||
"auth": {
|
||||
"mode": "token"
|
||||
}
|
||||
},
|
||||
"browser": {
|
||||
"enabled": true,
|
||||
"headless": true,
|
||||
"noSandbox": true,
|
||||
"ssrfPolicy": {
|
||||
"allowedHostnames": ["localhost", "127.0.0.1"]
|
||||
}
|
||||
},
|
||||
"tools": {
|
||||
"profile": "coding",
|
||||
"alsoAllow": ["browser"]
|
||||
},
|
||||
"agents": {
|
||||
"defaults": {
|
||||
"workspace": "~/.openclaw/workspace"
|
||||
}
|
||||
},
|
||||
"cron": { "enabled": false }
|
||||
}
|
||||
146
scripts/k8s/openclaw/deployment.yaml
Normal file
146
scripts/k8s/openclaw/deployment.yaml
Normal file
@ -0,0 +1,146 @@
|
||||
# OpenClaw gateway deployment for ClawBench evals.
|
||||
#
|
||||
# Build the image with browser support:
|
||||
# docker build --build-arg OPENCLAW_INSTALL_BROWSER=1 \
|
||||
# -t quay.io/yourorg/openclaw:eval .
|
||||
#
|
||||
# Or use upstream without browser (browser eval tasks will score 0):
|
||||
# image: ghcr.io/openclaw/openclaw:latest
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: openclaw
|
||||
labels:
|
||||
app: openclaw
|
||||
spec:
|
||||
replicas: 1
|
||||
strategy:
|
||||
type: Recreate
|
||||
selector:
|
||||
matchLabels:
|
||||
app: openclaw
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: openclaw
|
||||
spec:
|
||||
initContainers:
|
||||
- name: init-config
|
||||
image: registry.access.redhat.com/ubi9-minimal:latest
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- |
|
||||
cp /config/openclaw.json /home/node/.openclaw/openclaw.json
|
||||
chmod 666 /home/node/.openclaw/openclaw.json
|
||||
mkdir -p /home/node/.openclaw/workspace
|
||||
mkdir -p /home/node/.openclaw/agents
|
||||
chmod 777 /home/node/.openclaw /home/node/.openclaw/workspace /home/node/.openclaw/agents
|
||||
echo "Config initialized"
|
||||
volumeMounts:
|
||||
- name: openclaw-home
|
||||
mountPath: /home/node/.openclaw
|
||||
- name: config-template
|
||||
mountPath: /config
|
||||
resources:
|
||||
limits:
|
||||
cpu: 200m
|
||||
memory: 128Mi
|
||||
requests:
|
||||
cpu: 50m
|
||||
memory: 64Mi
|
||||
containers:
|
||||
- name: gateway
|
||||
image: ghcr.io/openclaw/openclaw:latest
|
||||
imagePullPolicy: IfNotPresent
|
||||
command:
|
||||
- sh
|
||||
- -c
|
||||
- umask 007 && exec node dist/index.js gateway run --bind loopback --port 18789 --allow-unconfigured
|
||||
env:
|
||||
- name: HOME
|
||||
value: /home/node
|
||||
- name: NODE_ENV
|
||||
value: production
|
||||
- name: OPENCLAW_CONFIG_DIR
|
||||
value: /home/node/.openclaw
|
||||
- name: OPENCLAW_STATE_DIR
|
||||
value: /home/node/.openclaw
|
||||
- name: OPENCLAW_GATEWAY_TOKEN
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: OPENCLAW_GATEWAY_TOKEN
|
||||
- name: OPENAI_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: OPENAI_API_KEY
|
||||
optional: true
|
||||
- name: ANTHROPIC_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: ANTHROPIC_API_KEY
|
||||
optional: true
|
||||
- name: OPENROUTER_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: OPENROUTER_API_KEY
|
||||
optional: true
|
||||
- name: GEMINI_API_KEY
|
||||
valueFrom:
|
||||
secretKeyRef:
|
||||
name: clawbench-secrets
|
||||
key: GEMINI_API_KEY
|
||||
optional: true
|
||||
ports:
|
||||
- name: gateway
|
||||
containerPort: 18789
|
||||
protocol: TCP
|
||||
livenessProbe:
|
||||
exec:
|
||||
command:
|
||||
- node
|
||||
- -e
|
||||
- "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
|
||||
initialDelaySeconds: 60
|
||||
periodSeconds: 30
|
||||
timeoutSeconds: 10
|
||||
readinessProbe:
|
||||
exec:
|
||||
command:
|
||||
- node
|
||||
- -e
|
||||
- "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
|
||||
initialDelaySeconds: 30
|
||||
periodSeconds: 10
|
||||
timeoutSeconds: 5
|
||||
resources:
|
||||
requests:
|
||||
cpu: 250m
|
||||
memory: 1Gi
|
||||
limits:
|
||||
cpu: "2"
|
||||
memory: 4Gi
|
||||
securityContext:
|
||||
allowPrivilegeEscalation: false
|
||||
capabilities:
|
||||
drop:
|
||||
- ALL
|
||||
volumeMounts:
|
||||
- name: openclaw-home
|
||||
mountPath: /home/node/.openclaw
|
||||
- name: tmp-volume
|
||||
mountPath: /tmp
|
||||
terminationGracePeriodSeconds: 30
|
||||
volumes:
|
||||
- name: openclaw-home
|
||||
persistentVolumeClaim:
|
||||
claimName: openclaw-home-pvc
|
||||
- name: config-template
|
||||
configMap:
|
||||
name: openclaw-config
|
||||
- name: tmp-volume
|
||||
emptyDir: {}
|
||||
12
scripts/k8s/openclaw/pvc.yaml
Normal file
12
scripts/k8s/openclaw/pvc.yaml
Normal file
@ -0,0 +1,12 @@
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: openclaw-home-pvc
|
||||
labels:
|
||||
app: openclaw
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
17
scripts/k8s/openclaw/secret.yaml
Normal file
17
scripts/k8s/openclaw/secret.yaml
Normal file
@ -0,0 +1,17 @@
|
||||
# Reference template — do NOT apply directly.
|
||||
# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
|
||||
# from exported environment variables (OPENAI_API_KEY, etc.).
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: clawbench-secrets
|
||||
labels:
|
||||
app: openclaw
|
||||
type: Opaque
|
||||
stringData:
|
||||
OPENCLAW_GATEWAY_TOKEN: "REPLACE_ME"
|
||||
OPENAI_API_KEY: "REPLACE_ME"
|
||||
# Add other provider keys as needed:
|
||||
# ANTHROPIC_API_KEY: "REPLACE_ME"
|
||||
# OPENROUTER_API_KEY: "REPLACE_ME"
|
||||
# GEMINI_API_KEY: "REPLACE_ME"
|
||||
15
scripts/k8s/openclaw/service.yaml
Normal file
15
scripts/k8s/openclaw/service.yaml
Normal file
@ -0,0 +1,15 @@
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: openclaw
|
||||
labels:
|
||||
app: openclaw
|
||||
spec:
|
||||
type: ClusterIP
|
||||
selector:
|
||||
app: openclaw
|
||||
ports:
|
||||
- name: gateway
|
||||
port: 18789
|
||||
targetPort: 18789
|
||||
protocol: TCP
|
||||
125
scripts/log_to_mlflow.py
Normal file
125
scripts/log_to_mlflow.py
Normal file
@ -0,0 +1,125 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Log a ClawBench BenchmarkResult to MLflow.
|
||||
|
||||
Standalone script -- not imported by the clawbench package.
|
||||
Requires: pip install mlflow (or pip install clawbench[mlflow])
|
||||
|
||||
Usage:
|
||||
python scripts/log_to_mlflow.py /results/benchmark.json
|
||||
|
||||
Environment:
|
||||
MLFLOW_TRACKING_URI MLflow tracking server (default: http://localhost:5000)
|
||||
MLFLOW_EXPERIMENT_NAME Experiment name (default: clawbench)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
|
||||
|
||||
|
||||
def main(result_path: str) -> None:
|
||||
try:
|
||||
import mlflow
|
||||
except ImportError:
|
||||
print(
|
||||
"mlflow is not installed. Install with: pip install mlflow"
|
||||
" (or pip install clawbench[mlflow])",
|
||||
file=sys.stderr,
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
from clawbench.schemas import BenchmarkResult
|
||||
|
||||
with open(result_path, encoding="utf-8") as f:
|
||||
result = BenchmarkResult(**json.load(f))
|
||||
|
||||
experiment_id = os.environ.get("MLFLOW_EXPERIMENT_ID")
|
||||
if experiment_id:
|
||||
experiment = mlflow.set_experiment(experiment_id=experiment_id)
|
||||
else:
|
||||
experiment = mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT_NAME", "clawbench"))
|
||||
|
||||
run_name = f"{result.model}-{result.submission_id[:8]}"
|
||||
with mlflow.start_run(run_name=run_name):
|
||||
mlflow.log_params(
|
||||
{
|
||||
"model": result.model,
|
||||
"provider": result.provider,
|
||||
"benchmark_version": result.benchmark_version,
|
||||
"openclaw_version": result.openclaw_version or "unknown",
|
||||
"judge_model": result.judge_model or "none",
|
||||
"task_snapshot_fingerprint": result.task_snapshot_fingerprint or "unknown",
|
||||
}
|
||||
)
|
||||
|
||||
mlflow.log_metrics(
|
||||
{
|
||||
"overall_score": result.overall_score,
|
||||
"overall_completion": result.overall_completion,
|
||||
"overall_trajectory": result.overall_trajectory,
|
||||
"overall_behavior": result.overall_behavior,
|
||||
"overall_reliability": result.overall_reliability,
|
||||
"overall_pass_hat_k": result.overall_pass_hat_k,
|
||||
"overall_judge_score": result.overall_judge_score,
|
||||
"overall_judge_confidence": result.overall_judge_confidence,
|
||||
"overall_judge_pass_rate": result.overall_judge_pass_rate,
|
||||
"judge_task_coverage": result.judge_task_coverage,
|
||||
"overall_weighted_query_score": result.overall_weighted_query_score,
|
||||
"overall_median_latency_ms": result.overall_median_latency_ms,
|
||||
"overall_p95_latency_ms": result.overall_p95_latency_ms,
|
||||
"overall_total_tokens": result.overall_total_tokens,
|
||||
"overall_cost_usd": result.overall_cost_usd,
|
||||
"overall_tokens_per_pass": result.overall_tokens_per_pass,
|
||||
"overall_cost_per_pass": result.overall_cost_per_pass,
|
||||
"overall_ci_lower": result.overall_ci_lower,
|
||||
"overall_ci_upper": result.overall_ci_upper,
|
||||
}
|
||||
)
|
||||
|
||||
for tier in result.tier_results:
|
||||
mlflow.log_metrics(
|
||||
{
|
||||
f"{tier.tier}/score": tier.mean_task_score,
|
||||
f"{tier.tier}/completion": tier.mean_completion,
|
||||
f"{tier.tier}/trajectory": tier.mean_trajectory,
|
||||
f"{tier.tier}/behavior": tier.mean_behavior,
|
||||
f"{tier.tier}/reliability": tier.mean_reliability,
|
||||
}
|
||||
)
|
||||
|
||||
for i, task in enumerate(result.task_results):
|
||||
mlflow.log_metrics(
|
||||
{
|
||||
f"task/{task.task_id}/score": task.mean_task_score,
|
||||
f"task/{task.task_id}/reliability": task.reliability_score,
|
||||
},
|
||||
step=i,
|
||||
)
|
||||
|
||||
mlflow.set_tags(
|
||||
{
|
||||
"submission_id": result.submission_id,
|
||||
"timestamp": result.timestamp,
|
||||
"certified": str(result.certified),
|
||||
}
|
||||
)
|
||||
|
||||
try:
|
||||
mlflow.log_artifact(result_path)
|
||||
except Exception as e:
|
||||
print(f"Warning: artifact upload failed: {e}", file=sys.stderr)
|
||||
print("Metrics and params were logged successfully.", file=sys.stderr)
|
||||
|
||||
print(f"Logged to MLflow: experiment={experiment.name} run={run_name}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) != 2:
|
||||
print(f"Usage: {sys.argv[0]} <result.json>", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
main(sys.argv[1])
|
||||
Loading…
Reference in New Issue
Block a user