add docs, manifests for k8s

Signed-off-by: sallyom <somalley@redhat.com>
This commit is contained in:
sallyom 2026-05-05 21:36:44 -04:00
parent d57e4a697d
commit 7d75d99643
No known key found for this signature in database
GPG Key ID: 0643052434ACDE18
16 changed files with 1290 additions and 0 deletions

View File

@ -461,6 +461,26 @@ python3 scripts/run_posterior_dynamics_pipeline.py \
clawbench diagnose profiles/local_ollama_gpt_oss.yaml
```
### Running on Kubernetes
See [`docs/kubernetes.md`](docs/kubernetes.md) for the full runbook. The short
version:
```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
export OPENAI_API_KEY="sk-..." # or ANTHROPIC_API_KEY, OPENROUTER_API_KEY, etc.
export CLAWBENCH_MODEL="openai/gpt-5.5"
# export MLFLOW_NAMESPACE="mlflow" # MLflow deploys in a separate namespace (default: mlflow)
./scripts/k8s/deploy.sh # deploys OpenClaw + MLflow + starts eval
./scripts/k8s/deploy.sh --logs # follow progress
./scripts/k8s/deploy.sh --teardown # tear down openclaw & eval (does not delete MLflow)
```
API keys are stored in a Kubernetes Secret created by the deploy script.
MLflow is deployed in its own namespace (default: `mlflow`, configurable via
`MLFLOW_NAMESPACE`).
---
## Partner Trace Spec

361
docs/kubernetes.md Normal file
View File

@ -0,0 +1,361 @@
# Running ClawBench on Kubernetes
ClawBench runs as a **sidecar** in the OpenClaw gateway pod. The sidecar
connects to the gateway over loopback (`ws://localhost:18789`), runs the
19-task eval suite, and optionally logs results to MLflow.
```
┌─── OpenClaw Pod ─────────────────────────────┐
│ gateway container (ws://localhost:18789) │
│ clawbench sidecar ──► gateway via loopback │
└──────────────────────────────────────────────┘
│ │
▼ ▼
Model provider API MLflow (optional)
```
All commands use `scripts/k8s/deploy.sh`. The script has these modes:
| Flag | What it does |
|------|-------------|
| *(none)* | Full deploy: OpenClaw + MLflow + eval sidecar |
| `--openclaw-only` | Deploy OpenClaw gateway only |
| `--mlflow-only` | Deploy MLflow only |
| `--add-sidecar` | Inject clawbench sidecar (starts eval) |
| `--remove-sidecar` | Remove clawbench sidecar |
| `--logs` | Tail sidecar logs |
| `--teardown` | Delete eval namespace (keeps MLflow) |
---
## Prerequisites
- `kubectl` on PATH, connected to a cluster (`kubectl cluster-info` succeeds)
- A container image for ClawBench (see [Building images](#building-images))
- At least one model provider API key (`OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, etc.)
For local testing with Kind:
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
---
## Environment variables
Set these **before** running `deploy.sh`.
### Required
| Variable | Purpose |
|----------|---------|
| `CLAWBENCH_NAMESPACE` | Namespace for OpenClaw + eval (e.g. `clawbench-eval`) |
| `OPENAI_API_KEY` | Model provider key (or use another provider — see table below) |
### Optional
| Variable | Default | Purpose |
|----------|---------|---------|
| `CLAWBENCH_IMAGE` | `quay.io/sallyom/clawbench:latest` | ClawBench sidecar image |
| `OPENCLAW_IMAGE` | `ghcr.io/openclaw/openclaw:latest` | OpenClaw gateway image |
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model to evaluate |
| `MLFLOW_NAMESPACE` | `mlflow` | MLflow namespace |
| `MLFLOW_TRACKING_URI` | *(deployed by script)* | External MLflow URI — skips MLflow deploy if set |
| `MLFLOW_EXPERIMENT_ID` | | MLflow experiment ID |
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
| `MLFLOW_IMAGE` | `ghcr.io/mlflow/mlflow:v2.21.3` | MLflow server image |
| `ANTHROPIC_API_KEY` | | Added to K8s secret if set |
| `OPENROUTER_API_KEY` | | Added to K8s secret if set |
| `GEMINI_API_KEY` | | Added to K8s secret if set |
| `OPENAI_API_BASE` | | Base URL for OpenAI-compatible endpoints (e.g. vLLM, Ollama); patched into gateway config |
### Model routing
The gateway routes by provider prefix:
| Model string | Required variables |
|-------------|-------------------|
| `openai/gpt-5.5` | `OPENAI_API_KEY` |
| `anthropic/claude-sonnet-4-6` | `ANTHROPIC_API_KEY` |
| `openrouter/anthropic/claude-sonnet-4-6` | `OPENROUTER_API_KEY` |
| `openai/my-local-model` | `OPENAI_API_KEY` + `OPENAI_API_BASE` |
For OpenAI-compatible endpoints (vLLM, Ollama, TGI, or any in-cluster model
server), set `OPENAI_API_BASE` to the endpoint URL and use the `openai/`
prefix for the model name:
```bash
export CLAWBENCH_MODEL="openai/meta-llama/Llama-4-Scout-17B"
export OPENAI_API_KEY="none" # dummy value if the endpoint doesn't require auth
export OPENAI_API_BASE="http://vllm-service.my-ns.svc.cluster.local:8000/v1"
```
---
## Full deploy (quick start)
Deploys OpenClaw gateway, MLflow, and the eval sidecar in one command.
```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
# Export API keys before running. The script stores them in a K8s Secret
# ("clawbench-secrets") that the gateway and sidecar containers read.
export OPENAI_API_KEY="sk-..."
# Model to evaluate (default: openai/gpt-5.5)
# export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
./scripts/k8s/deploy.sh
```
Verify:
```bash
# Should show 2/2 containers (gateway + clawbench)
kubectl get pods -n clawbench-eval
# Follow eval progress
./scripts/k8s/deploy.sh --logs
```
When the eval finishes, copy results and clean up:
```bash
# Copy results from the sidecar
POD=$(kubectl get pod -n $CLAWBENCH_NAMESPACE -l app=openclaw -o jsonpath='{.items[0].metadata.name}')
kubectl cp "$CLAWBENCH_NAMESPACE/$POD:/results/benchmark.json" -c clawbench ./benchmark.json
# Remove the sidecar (keeps OpenClaw + MLflow running)
./scripts/k8s/deploy.sh --remove-sidecar
# Or tear down everything
./scripts/k8s/deploy.sh --teardown
```
---
## Existing cluster + existing MLflow
If you already have an OpenShift or Kubernetes cluster and an MLflow instance,
you only need to deploy OpenClaw and run the eval — no cluster or MLflow setup
required.
```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
# API keys — export before running deploy.sh. The script creates a
# Kubernetes Secret ("clawbench-secrets") from whichever keys are set.
# At least one provider key is required.
export OPENAI_API_KEY="sk-..."
# export ANTHROPIC_API_KEY="sk-ant-..."
# export OPENROUTER_API_KEY="sk-or-..."
# export GEMINI_API_KEY="..."
# Model to evaluate (default: openai/gpt-5.5)
export CLAWBENCH_MODEL="anthropic/claude-sonnet-4-6"
# Point to your existing MLflow
export MLFLOW_TRACKING_URI="https://mlflow.example.com"
export MLFLOW_EXPERIMENT_NAME="clawbench-gpt5.5" # or use MLFLOW_EXPERIMENT_ID=42
# Deploy OpenClaw gateway into your cluster
./scripts/k8s/deploy.sh --openclaw-only
```
Verify OpenClaw is running:
```bash
kubectl get pods -n clawbench-eval
# Expect: openclaw-xxxx 1/1 Running
```
Then start the eval:
```bash
./scripts/k8s/deploy.sh --add-sidecar
./scripts/k8s/deploy.sh --logs
```
The deploy script sets `MLFLOW_TRACKING_URI` to skip its own MLflow deployment
and patches the experiment name/ID into the clawbench ConfigMap. When the eval
completes, `scripts/log_to_mlflow.py` logs results to your MLflow under that
experiment.
`MLFLOW_EXPERIMENT_NAME` creates the experiment if it doesn't exist.
`MLFLOW_EXPERIMENT_ID` requires an existing experiment.
---
## Step-by-step deploy
Use this when you want to deploy components individually or bring your own
OpenClaw/MLflow.
### Step 1: Deploy OpenClaw gateway
```bash
export CLAWBENCH_NAMESPACE=clawbench-eval
export OPENAI_API_KEY="sk-..."
./scripts/k8s/deploy.sh --openclaw-only
```
Verify:
```bash
kubectl get pods -n clawbench-eval
# Expect: openclaw-xxxx 1/1 Running
```
This deploys from `scripts/k8s/openclaw/`: a single gateway pod with token
auth, ClusterIP service, and 10Gi PVC. The deploy script generates a gateway
token and creates the `clawbench-secrets` Secret automatically.
**Skip this step** if you already have an OpenClaw deployment. Your existing
gateway must have this config (see `scripts/k8s/openclaw/configmap.yaml`):
```json
{
"browser": {
"enabled": true,
"headless": true,
"noSandbox": true,
"ssrfPolicy": {
"allowedHostnames": ["localhost", "127.0.0.1"]
}
},
"tools": {
"profile": "coding",
"alsoAllow": ["browser"]
}
}
```
Key requirements:
- `browser.enabled: true` — activates the bundled browser plugin
- `tools.alsoAllow: ["browser"]` — the `coding` profile does NOT include browser by default
- `browser.ssrfPolicy` — several eval tasks need localhost access
- Gateway must bind to loopback with token auth
### Step 2: Deploy MLflow
```bash
./scripts/k8s/deploy.sh --mlflow-only
```
Verify:
```bash
kubectl get pods -n mlflow
# Expect: mlflow-xxxx 1/1 Running
```
Deploys a single-replica MLflow server with SQLite backend into the `mlflow`
namespace. The clawbench ConfigMap defaults to
`http://mlflow-service.mlflow.svc.cluster.local:5000`.
**Skip this step** if you have an external MLflow — set `MLFLOW_TRACKING_URI`:
```bash
export MLFLOW_TRACKING_URI=http://my-mlflow.example.com:5000
export MLFLOW_EXPERIMENT_ID=4 # or MLFLOW_EXPERIMENT_NAME
```
### Step 3: Run the eval
```bash
./scripts/k8s/deploy.sh --add-sidecar
```
This patches the OpenClaw deployment to inject a clawbench sidecar that:
1. Waits for the gateway (TCP check on port 18789, up to 3 min)
2. Checks MLflow connectivity if configured
3. Runs `clawbench run` with settings from the ConfigMap
4. Logs results to MLflow on success
5. Sleeps indefinitely so you can retrieve logs and results
Verify:
```bash
kubectl get pods -n $CLAWBENCH_NAMESPACE
# Expect: openclaw-xxxx 2/2 Running (gateway + clawbench)
./scripts/k8s/deploy.sh --logs
# Should show "Waiting for gateway..." then "Starting eval..."
```
When finished, remove the sidecar:
```bash
./scripts/k8s/deploy.sh --remove-sidecar
```
---
## ConfigMap tuning
The clawbench ConfigMap (`scripts/k8s/manifests/configmap.yaml`) controls eval
behavior. Override at deploy time via env vars, or patch after deploy:
| Key | Default | What it controls |
|-----|---------|-----------------|
| `CLAWBENCH_MODEL` | `openai/gpt-5.5` | Model under test |
| `CLAWBENCH_RUNS` | `3` | Runs per task (19 tasks x 3 = 57 total) |
| `CLAWBENCH_CONCURRENCY` | `4` | Parallel eval lanes |
| `CLAWBENCH_JUDGE_MODEL` | *(empty)* | Separate judge model (optional) |
| `CLAWBENCH_TASKS` | *(empty — runs all)* | Space-separated task IDs (e.g. `t1-bugfix-discount t2-config-loader`) |
| `CLAWBENCH_CONNECT_TIMEOUT` | `120` | Gateway connect timeout in seconds |
| `CLAWBENCH_REQUEST_TIMEOUT` | `300` | Per-request timeout in seconds |
| `CLAWBENCH_PER_RUN_BUDGET_SECONDS` | `600` | Max wall time per run |
| `MLFLOW_TRACKING_URI` | `http://mlflow-service.mlflow.svc.cluster.local:5000` | MLflow endpoint |
| `MLFLOW_EXPERIMENT_NAME` | `clawbench` | MLflow experiment name |
---
## MLflow integration
Results are logged via `scripts/log_to_mlflow.py` after a successful eval.
**What gets logged:**
- **Params**: model, provider, benchmark version, OpenClaw version, judge model
- **Metrics**: overall score, per-axis scores (completion, trajectory, behavior,
reliability), cost, tokens, latency, CI bounds, per-tier and per-task scores
- **Tags**: submission ID, timestamp, certified flag
- **Artifacts**: full benchmark result JSON
---
## Building images
### ClawBench image
`quay.io/sallyom/clawbench:latest` is public
For Kubernetes, use the lightweight sidecar image instead — it only includes
the eval harness and MLflow client:
```bash
docker build -t clawbench:latest -f scripts/k8s/Dockerfile .
# For Kind clusters, load directly instead of pushing to a registry:
kind load docker-image clawbench:latest --name openclaw
# For non-Kind clusters, push to registry and set CLAWBENCH_IMAGE accordingly
# Ensure you build for the right architecture, usually amd64 for non-local k8s
```
Set `CLAWBENCH_IMAGE=clawbench:latest` when running `deploy.sh` to use it.
---
## Cleanup
```bash
# Remove eval sidecar only (keeps OpenClaw + MLflow running for another eval)
./scripts/k8s/deploy.sh --remove-sidecar
# Delete eval namespace (keeps MLflow running)
./scripts/k8s/deploy.sh --teardown
# Delete the Kind cluster entirely
kind delete cluster --name openclaw
```

View File

@ -33,6 +33,9 @@ dev = [
"pre-commit>=4.0,<5",
"ruff>=0.9,<1",
]
mlflow = [
"mlflow>=2.10,<3",
]
hermes = [
"hermes-agent @ git+https://github.com/NousResearch/hermes-agent.git@main",
]

33
scripts/k8s/Dockerfile Normal file
View File

@ -0,0 +1,33 @@
# Lightweight ClawBench image for Kubernetes sidecar use.
# Does NOT include the full OpenClaw server or Chromium — the gateway runs
# in a separate container. Node.js is copied from the OpenClaw image for
# the device-identity handshake required by the gateway protocol.
FROM ghcr.io/openclaw/openclaw:latest AS openclaw
FROM python:3.12-slim
COPY --from=openclaw /usr/local/bin/node /usr/local/bin/node
RUN apt-get update && \
apt-get install -y --no-install-recommends git && \
rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY pyproject.toml README.md CLAWBENCH_V0_4_SPEC.md PARTNER_TRACE_SPEC.md ./
COPY clawbench/ clawbench/
COPY tasks-public/ tasks-public/
COPY tasks-domain/ tasks-domain/
COPY profiles/ profiles/
COPY baselines/ baselines/
COPY scripts/ scripts/
RUN pip install --no-cache-dir ".[mlflow]"
RUN mkdir -p /results && chmod 777 /results
RUN useradd -m -d /home/node clawbench
USER clawbench
ENV HOME=/home/node
ENTRYPOINT ["clawbench"]

394
scripts/k8s/deploy.sh Executable file
View File

@ -0,0 +1,394 @@
#!/usr/bin/env bash
# Deploy ClawBench evals on Kubernetes (works on OpenShift too).
#
# 0-to-hero pipeline:
# Step 0: Create a cluster (see --help for Kind instructions)
# Step 1: Deploy OpenClaw gateway (optional — bring your own)
# Step 2: Deploy MLflow tracking server (optional — bring your own)
# Step 3: Run evals via sidecar (add / remove)
#
# Usage:
# ./scripts/k8s/deploy.sh # Full deploy: OpenClaw + MLflow + eval
# ./scripts/k8s/deploy.sh --openclaw-only # Step 1: deploy OpenClaw gateway
# ./scripts/k8s/deploy.sh --mlflow-only # Step 2: deploy MLflow
# ./scripts/k8s/deploy.sh --add-sidecar # Step 3: add eval sidecar (starts eval)
# ./scripts/k8s/deploy.sh --remove-sidecar # Step 3: remove eval sidecar
# ./scripts/k8s/deploy.sh --logs # Tail clawbench sidecar logs
# ./scripts/k8s/deploy.sh --teardown # Delete eval namespace (keeps MLflow)
#
# Environment (required):
# CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval
# OPENAI_API_KEY Model provider API key (or another provider key)
#
# Environment (optional):
# CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest)
# OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
# CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5)
# MLFLOW_NAMESPACE MLflow namespace (default: mlflow)
# MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy if set)
# MLFLOW_EXPERIMENT_ID MLflow experiment ID
# MLFLOW_EXPERIMENT_NAME MLflow experiment name
# MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
# ANTHROPIC_API_KEY Anthropic key (added to secret if set)
# OPENROUTER_API_KEY OpenRouter key (added to secret if set)
# GEMINI_API_KEY Gemini key (added to secret if set)
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
NS="${CLAWBENCH_NAMESPACE:-}"
MLFLOW_NS="${MLFLOW_NAMESPACE:-mlflow}"
CLAWBENCH_IMG="${CLAWBENCH_IMAGE:-quay.io/sallyom/clawbench:latest}"
OPENCLAW_IMG="${OPENCLAW_IMAGE:-ghcr.io/openclaw/openclaw:latest}"
MLFLOW_IMG="${MLFLOW_IMAGE:-ghcr.io/mlflow/mlflow:v2.21.3}"
command -v kubectl &>/dev/null || { echo "Missing: kubectl" >&2; exit 1; }
kubectl cluster-info &>/dev/null || { echo "Cannot connect to cluster. Check kubeconfig." >&2; exit 1; }
# ---------------------------------------------------------------------------
if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then
cat <<'HELP'
ClawBench Kubernetes Deployment
===============================
0-to-hero pipeline for running ClawBench evals on Kubernetes.
Step 0: Create a cluster
For local testing with Kind, see:
https://github.com/openclaw/openclaw/blob/main/docs/install/kubernetes.md#local-testing-with-kind
Step 1: Deploy OpenClaw gateway (optional — skip if you have one)
Step 2: Deploy MLflow tracking server (optional — skip if you have one)
Step 3: Run evals via sidecar (add/remove to OpenClaw deployment)
Usage:
./scripts/k8s/deploy.sh Full deploy (steps 1+2+3)
./scripts/k8s/deploy.sh --openclaw-only Step 1: OpenClaw only
./scripts/k8s/deploy.sh --mlflow-only Step 2: MLflow only
./scripts/k8s/deploy.sh --add-sidecar Step 3: add eval sidecar (starts eval)
./scripts/k8s/deploy.sh --remove-sidecar Step 3: remove eval sidecar
./scripts/k8s/deploy.sh --logs Tail clawbench sidecar logs
./scripts/k8s/deploy.sh --teardown Delete eval namespace (keeps MLflow)
Required environment:
CLAWBENCH_NAMESPACE Namespace for OpenClaw + eval
OPENAI_API_KEY Model provider API key (or ANTHROPIC_API_KEY, etc.)
Optional environment:
CLAWBENCH_IMAGE Clawbench image (default: quay.io/sallyom/clawbench:latest)
OPENCLAW_IMAGE OpenClaw image (default: ghcr.io/openclaw/openclaw:latest)
CLAWBENCH_MODEL Model to eval (default: openai/gpt-5.5)
MLFLOW_NAMESPACE MLflow namespace (default: mlflow)
MLFLOW_TRACKING_URI External MLflow URI (skips MLflow deploy)
MLFLOW_EXPERIMENT_ID MLflow experiment ID
MLFLOW_EXPERIMENT_NAME MLflow experiment name
MLFLOW_IMAGE MLflow image (default: ghcr.io/mlflow/mlflow:v2.21.3)
ANTHROPIC_API_KEY Anthropic key (added to secret if set)
OPENROUTER_API_KEY OpenRouter key (added to secret if set)
GEMINI_API_KEY Gemini key (added to secret if set)
Works on Kubernetes and OpenShift.
HELP
exit 0
fi
if [[ -z "$NS" ]]; then
echo "CLAWBENCH_NAMESPACE is required." >&2
echo " export CLAWBENCH_NAMESPACE=clawbench-eval" >&2
exit 1
fi
MODE="full"
while [[ $# -gt 0 ]]; do
case "$1" in
--openclaw-only) MODE="openclaw-only" ;;
--mlflow-only) MODE="mlflow-only" ;;
--add-sidecar) MODE="add-sidecar" ;;
--remove-sidecar) MODE="remove-sidecar" ;;
--logs) MODE="logs" ;;
--teardown) MODE="teardown" ;;
*) echo "Unknown option: $1" >&2; exit 1 ;;
esac
shift
done
# ---------------------------------------------------------------------------
# --logs
# ---------------------------------------------------------------------------
if [[ "$MODE" == "logs" ]]; then
kubectl logs deploy/openclaw -c clawbench -n "$NS" -f
exit 0
fi
# ---------------------------------------------------------------------------
# --teardown
# ---------------------------------------------------------------------------
if [[ "$MODE" == "teardown" ]]; then
echo "Deleting namespace '$NS'..."
kubectl delete namespace "$NS" --ignore-not-found
echo "Done. MLflow namespace '$MLFLOW_NS' was not deleted."
exit 0
fi
# ---------------------------------------------------------------------------
# --remove-sidecar
# ---------------------------------------------------------------------------
if [[ "$MODE" == "remove-sidecar" ]]; then
echo "Removing clawbench sidecar from openclaw in namespace '$NS'..."
INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next((i for i,c in enumerate(cs) if c['name']=='clawbench'),-1))")
if [[ "$INDEX" == "-1" ]]; then
echo "No clawbench sidecar found."
else
kubectl patch deploy/openclaw -n "$NS" --type=json \
-p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]"
echo "Sidecar removed."
fi
exit 0
fi
# ---------------------------------------------------------------------------
# Create namespace + secret
# ---------------------------------------------------------------------------
ensure_namespace_and_secret() {
if ! kubectl get namespace "$NS" &>/dev/null; then
echo "Creating namespace '$NS'..."
kubectl create namespace "$NS"
fi
if ! kubectl get secret clawbench-secrets -n "$NS" &>/dev/null; then
echo "Creating clawbench-secrets..."
GATEWAY_TOKEN=$(python3 -c "import secrets,base64; print(base64.b64encode(secrets.token_bytes(32)).decode())")
SECRET_ARGS=(
--from-literal=OPENCLAW_GATEWAY_TOKEN="$GATEWAY_TOKEN"
)
[[ -n "${OPENAI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENAI_API_KEY="$OPENAI_API_KEY")
[[ -n "${ANTHROPIC_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=ANTHROPIC_API_KEY="$ANTHROPIC_API_KEY")
[[ -n "${OPENROUTER_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=OPENROUTER_API_KEY="$OPENROUTER_API_KEY")
[[ -n "${GEMINI_API_KEY:-}" ]] && SECRET_ARGS+=(--from-literal=GEMINI_API_KEY="$GEMINI_API_KEY")
if [[ ${#SECRET_ARGS[@]} -eq 1 ]]; then
echo "Warning: No API keys provided. Set OPENAI_API_KEY or another provider key." >&2
fi
kubectl create secret generic clawbench-secrets -n "$NS" "${SECRET_ARGS[@]}"
echo " Gateway token: generated"
[[ -n "${OPENAI_API_KEY:-}" ]] && echo " OPENAI_API_KEY: set"
[[ -n "${ANTHROPIC_API_KEY:-}" ]] && echo " ANTHROPIC_API_KEY: set"
[[ -n "${OPENROUTER_API_KEY:-}" ]] && echo " OPENROUTER_API_KEY: set"
[[ -n "${GEMINI_API_KEY:-}" ]] && echo " GEMINI_API_KEY: set"
else
echo "Secret clawbench-secrets already exists in '$NS'."
fi
}
# ---------------------------------------------------------------------------
# Step 1: Deploy OpenClaw
# ---------------------------------------------------------------------------
deploy_openclaw() {
echo ""
echo "Step 1: Deploying OpenClaw gateway (image: $OPENCLAW_IMG)..."
kubectl apply -f "$SCRIPT_DIR/openclaw/configmap.yaml" -n "$NS"
# Patch gateway config with custom OpenAI-compatible base URL
if [[ -n "${OPENAI_API_BASE:-}" ]]; then
echo " Patching gateway config: models.providers.openai.baseUrl = $OPENAI_API_BASE"
EXISTING_JSON=$(kubectl get configmap openclaw-config -n "$NS" -o jsonpath='{.data.openclaw\.json}')
PATCHED_JSON=$(echo "$EXISTING_JSON" | python3 -c "
import json, sys, os
cfg = json.load(sys.stdin)
openai_cfg = cfg.setdefault('models', {}).setdefault('providers', {}).setdefault('openai', {})
openai_cfg['baseUrl'] = os.environ['OPENAI_API_BASE']
openai_cfg.setdefault('models', [])
json.dump(cfg, sys.stdout, indent=2)
")
kubectl create configmap openclaw-config -n "$NS" \
--from-literal="openclaw.json=$PATCHED_JSON" \
--dry-run=client -o yaml | kubectl apply -f - -n "$NS" >/dev/null
fi
kubectl apply -f "$SCRIPT_DIR/openclaw/pvc.yaml" -n "$NS"
kubectl apply -f "$SCRIPT_DIR/openclaw/service.yaml" -n "$NS"
if [[ "$OPENCLAW_IMG" != "ghcr.io/openclaw/openclaw:latest" ]]; then
kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
kubectl set image "deploy/openclaw" "gateway=$OPENCLAW_IMG" -n "$NS"
else
kubectl apply -f "$SCRIPT_DIR/openclaw/deployment.yaml" -n "$NS"
fi
echo "Waiting for OpenClaw rollout..."
kubectl rollout status deploy/openclaw -n "$NS" --timeout=180s || \
echo " (rollout still in progress)"
echo "OpenClaw deployed."
}
# ---------------------------------------------------------------------------
# Step 2: Deploy MLflow
# ---------------------------------------------------------------------------
deploy_mlflow() {
if [[ -n "${MLFLOW_TRACKING_URI:-}" ]]; then
echo ""
echo "Step 2: Skipping MLflow deploy (MLFLOW_TRACKING_URI is set: $MLFLOW_TRACKING_URI)"
return
fi
echo ""
echo "Step 2: Deploying MLflow (namespace: $MLFLOW_NS, image: $MLFLOW_IMG)..."
if ! kubectl get namespace "$MLFLOW_NS" &>/dev/null; then
kubectl create namespace "$MLFLOW_NS"
fi
kubectl apply -f "$SCRIPT_DIR/mlflow/pvc.yaml" -n "$MLFLOW_NS"
kubectl apply -f "$SCRIPT_DIR/mlflow/service.yaml" -n "$MLFLOW_NS"
if [[ "$MLFLOW_IMG" != "ghcr.io/mlflow/mlflow:v2.21.3" ]]; then
kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
kubectl set image "deploy/mlflow" "mlflow=$MLFLOW_IMG" -n "$MLFLOW_NS"
else
kubectl apply -f "$SCRIPT_DIR/mlflow/deployment.yaml" -n "$MLFLOW_NS"
fi
echo "Waiting for MLflow rollout..."
kubectl rollout status deploy/mlflow -n "$MLFLOW_NS" --timeout=120s || \
echo " (rollout still in progress)"
MLFLOW_TRACKING_URI="http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000"
echo "MLflow deployed: $MLFLOW_TRACKING_URI"
}
# ---------------------------------------------------------------------------
# Step 3: Add clawbench sidecar (starts eval)
# ---------------------------------------------------------------------------
add_sidecar() {
echo ""
echo "Step 3: Adding clawbench eval sidecar..."
echo "Applying clawbench ConfigMap..."
kubectl apply -f "$SCRIPT_DIR/manifests/configmap.yaml" -n "$NS" >/dev/null
if [[ -n "${CLAWBENCH_MODEL:-}" ]]; then
kubectl patch configmap clawbench-config -n "$NS" \
--type merge -p "{\"data\":{\"CLAWBENCH_MODEL\":\"$CLAWBENCH_MODEL\"}}" >/dev/null
echo " Model: $CLAWBENCH_MODEL"
fi
if [[ -n "${OPENAI_API_BASE:-}" ]]; then
kubectl patch configmap clawbench-config -n "$NS" \
--type merge -p "{\"data\":{\"OPENAI_API_BASE\":\"$OPENAI_API_BASE\"}}" >/dev/null
echo " OpenAI API base: $OPENAI_API_BASE"
fi
# Patch MLflow settings into ConfigMap
PATCH_DATA=""
MLFLOW_URI="${MLFLOW_TRACKING_URI:-http://mlflow-service.${MLFLOW_NS}.svc.cluster.local:5000}"
PATCH_DATA="\"MLFLOW_TRACKING_URI\":\"$MLFLOW_URI\""
if [[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]]; then
PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_ID\":\"$MLFLOW_EXPERIMENT_ID\""
fi
if [[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]]; then
PATCH_DATA="$PATCH_DATA,\"MLFLOW_EXPERIMENT_NAME\":\"$MLFLOW_EXPERIMENT_NAME\""
fi
kubectl patch configmap clawbench-config -n "$NS" \
--type merge -p "{\"data\":{$PATCH_DATA}}" >/dev/null
echo " MLflow URI: $MLFLOW_URI"
[[ -n "${MLFLOW_EXPERIMENT_ID:-}" ]] && echo " MLflow experiment ID: $MLFLOW_EXPERIMENT_ID"
[[ -n "${MLFLOW_EXPERIMENT_NAME:-}" ]] && echo " MLflow experiment name: $MLFLOW_EXPERIMENT_NAME"
# Check if sidecar already exists
HAS_SIDECAR=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print('yes' if any(c['name']=='clawbench' for c in cs) else 'no')")
if [[ "$HAS_SIDECAR" == "yes" ]]; then
echo "Removing existing clawbench sidecar..."
INDEX=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "import json,sys; cs=json.load(sys.stdin)['spec']['template']['spec']['containers']; print(next(i for i,c in enumerate(cs) if c['name']=='clawbench'))")
kubectl patch deploy/openclaw -n "$NS" --type=json \
-p "[{\"op\":\"remove\",\"path\":\"/spec/template/spec/containers/$INDEX\"}]" >/dev/null
fi
# Find openclaw-home volume name
HOME_VOLUME=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "
import json, sys
spec = json.load(sys.stdin)['spec']['template']['spec']
for c in spec['containers']:
if c['name'] == 'gateway':
for vm in c.get('volumeMounts', []):
if vm['mountPath'] == '/home/node/.openclaw':
print(vm['name'])
sys.exit(0)
print('openclaw-home')
")
echo "Adding clawbench sidecar (image: $CLAWBENCH_IMG)..."
# Check if results volume already exists
HAS_RESULTS_VOL=$(kubectl get deploy/openclaw -n "$NS" -o json \
| python3 -c "import json,sys; vs=json.load(sys.stdin)['spec']['template']['spec'].get('volumes',[]); print('yes' if any(v['name']=='clawbench-results' for v in vs) else 'no')")
PATCH='[{"op":"add","path":"/spec/template/spec/containers/-","value":{'
PATCH+='"name":"clawbench",'
PATCH+='"image":"'"$CLAWBENCH_IMG"'",'
PATCH+='"imagePullPolicy":"IfNotPresent",'
PATCH+='"command":["/bin/bash","-c","echo \"Waiting for gateway on localhost:18789...\"\nfor i in $(seq 1 90); do\n python3 -c \"import socket; s=socket.create_connection((\\\"127.0.0.1\\\",18789),2); s.close()\" 2>/dev/null && echo \"Gateway ready\" && break\n sleep 2\ndone\n\nif [ -n \"${MLFLOW_TRACKING_URI:-}\" ]; then\n echo \"Checking MLflow at ${MLFLOW_TRACKING_URI}...\"\n python3 -c \"import httpx,os; r=httpx.get(os.environ[\\\"MLFLOW_TRACKING_URI\\\"]+\\\"/health\\\"); print(\\\"MLflow OK:\\\",r.status_code)\" 2>&1 || echo \"MLflow pre-check failed (will retry at log time)\"\nfi\n\necho \"Starting eval...\"\nclawbench run \\\n --model \"${CLAWBENCH_MODEL}\" \\\n --gateway-token \"${OPENCLAW_GATEWAY_TOKEN}\" \\\n --runs \"${CLAWBENCH_RUNS}\" \\\n --concurrency \"${CLAWBENCH_CONCURRENCY}\" \\\n ${CLAWBENCH_JUDGE_MODEL:+--judge-model \"${CLAWBENCH_JUDGE_MODEL}\"} \\\n $([ -n \"${CLAWBENCH_TASKS:-}\" ] && for t in ${CLAWBENCH_TASKS}; do printf -- \"-t %s \" \"$t\"; done) \\\n -o /results/benchmark.json\nRC=$?\nif [ $RC -eq 0 ] && [ -n \"${MLFLOW_TRACKING_URI:-}\" ]; then\n python scripts/log_to_mlflow.py /results/benchmark.json\nfi\necho \"ClawBench finished (exit=$RC)\"\nsleep infinity"],'
PATCH+='"envFrom":[{"configMapRef":{"name":"clawbench-config"}}],'
PATCH+='"env":[{"name":"OPENCLAW_GATEWAY_TOKEN","valueFrom":{"secretKeyRef":{"name":"clawbench-secrets","key":"OPENCLAW_GATEWAY_TOKEN"}}}],'
PATCH+='"resources":{"requests":{"memory":"1Gi","cpu":"500m"},"limits":{"memory":"4Gi","cpu":"2"}},'
PATCH+='"volumeMounts":[{"name":"'"$HOME_VOLUME"'","mountPath":"/home/node/.openclaw"},{"name":"clawbench-results","mountPath":"/results"},{"name":"tmp-volume","mountPath":"/tmp"}],'
PATCH+='"securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]}}'
PATCH+='}}'
if [[ "$HAS_RESULTS_VOL" == "no" ]]; then
PATCH+=',{"op":"add","path":"/spec/template/spec/volumes/-","value":{"name":"clawbench-results","emptyDir":{}}}'
fi
PATCH+=']'
kubectl patch deploy/openclaw -n "$NS" --type=json -p "$PATCH" >/dev/null
echo ""
echo "Waiting for rollout..."
kubectl rollout status deploy/openclaw -n "$NS" --timeout=300s 2>/dev/null || \
echo " (rollout timeout — eval runs for 30-60 min)"
echo ""
echo "Eval is running. Follow logs with:"
echo " ./scripts/k8s/deploy.sh --logs"
echo ""
echo "When finished, remove the sidecar with:"
echo " ./scripts/k8s/deploy.sh --remove-sidecar"
}
# ---------------------------------------------------------------------------
# Execute
# ---------------------------------------------------------------------------
case "$MODE" in
full)
ensure_namespace_and_secret
deploy_openclaw
deploy_mlflow
add_sidecar
;;
openclaw-only)
ensure_namespace_and_secret
deploy_openclaw
echo ""
echo "OpenClaw is running. Next steps:"
echo " ./scripts/k8s/deploy.sh --mlflow-only # Deploy MLflow"
echo " ./scripts/k8s/deploy.sh --add-sidecar # Start eval"
;;
mlflow-only)
deploy_mlflow
;;
add-sidecar)
if ! kubectl get deploy/openclaw -n "$NS" &>/dev/null; then
echo "Deployment 'openclaw' not found in namespace '$NS'." >&2
echo "Deploy OpenClaw first with: ./scripts/k8s/deploy.sh --openclaw-only" >&2
exit 1
fi
add_sidecar
;;
esac

View File

@ -0,0 +1,18 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: clawbench-config
labels:
app: clawbench
data:
CLAWBENCH_MODEL: "openai/gpt-5.5"
OPENAI_API_BASE: ""
CLAWBENCH_RUNS: "3"
CLAWBENCH_CONCURRENCY: "4"
CLAWBENCH_JUDGE_MODEL: ""
CLAWBENCH_TASKS: ""
CLAWBENCH_CONNECT_TIMEOUT: "120"
CLAWBENCH_REQUEST_TIMEOUT: "300"
CLAWBENCH_PER_RUN_BUDGET_SECONDS: "600"
MLFLOW_TRACKING_URI: "http://mlflow-service.mlflow.svc.cluster.local:5000"
MLFLOW_EXPERIMENT_NAME: "clawbench"

View File

@ -0,0 +1,15 @@
# Reference template — do NOT apply directly.
# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
# from exported environment variables (OPENAI_API_KEY, etc.).
apiVersion: v1
kind: Secret
metadata:
name: clawbench-secrets
labels:
app: clawbench
type: Opaque
stringData:
OPENAI_API_KEY: "REPLACE_ME"
# Add other provider keys as needed:
# ANTHROPIC_API_KEY: "REPLACE_ME"
# OPENROUTER_API_KEY: "REPLACE_ME"

View File

@ -0,0 +1,68 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow
labels:
app: mlflow
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: mlflow
template:
metadata:
labels:
app: mlflow
spec:
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:v2.21.3
command:
- mlflow
- server
- --host
- "0.0.0.0"
- --port
- "5000"
- --backend-store-uri
- sqlite:///mlflow/mlflow.db
- --default-artifact-root
- /mlflow/artifacts
- --serve-artifacts
ports:
- name: http
containerPort: 5000
protocol: TCP
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 15
periodSeconds: 30
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 5
periodSeconds: 10
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 1Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
volumeMounts:
- name: mlflow-data
mountPath: /mlflow
volumes:
- name: mlflow-data
persistentVolumeClaim:
claimName: mlflow-data-pvc

View File

@ -0,0 +1,12 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mlflow-data-pvc
labels:
app: mlflow
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: mlflow-service
labels:
app: mlflow
spec:
type: ClusterIP
selector:
app: mlflow
ports:
- name: http
port: 5000
targetPort: 5000
protocol: TCP

View File

@ -0,0 +1,36 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: openclaw-config
labels:
app: openclaw
data:
openclaw.json: |
{
"gateway": {
"mode": "local",
"bind": "loopback",
"port": 18789,
"auth": {
"mode": "token"
}
},
"browser": {
"enabled": true,
"headless": true,
"noSandbox": true,
"ssrfPolicy": {
"allowedHostnames": ["localhost", "127.0.0.1"]
}
},
"tools": {
"profile": "coding",
"alsoAllow": ["browser"]
},
"agents": {
"defaults": {
"workspace": "~/.openclaw/workspace"
}
},
"cron": { "enabled": false }
}

View File

@ -0,0 +1,146 @@
# OpenClaw gateway deployment for ClawBench evals.
#
# Build the image with browser support:
# docker build --build-arg OPENCLAW_INSTALL_BROWSER=1 \
# -t quay.io/yourorg/openclaw:eval .
#
# Or use upstream without browser (browser eval tasks will score 0):
# image: ghcr.io/openclaw/openclaw:latest
apiVersion: apps/v1
kind: Deployment
metadata:
name: openclaw
labels:
app: openclaw
spec:
replicas: 1
strategy:
type: Recreate
selector:
matchLabels:
app: openclaw
template:
metadata:
labels:
app: openclaw
spec:
initContainers:
- name: init-config
image: registry.access.redhat.com/ubi9-minimal:latest
command:
- sh
- -c
- |
cp /config/openclaw.json /home/node/.openclaw/openclaw.json
chmod 666 /home/node/.openclaw/openclaw.json
mkdir -p /home/node/.openclaw/workspace
mkdir -p /home/node/.openclaw/agents
chmod 777 /home/node/.openclaw /home/node/.openclaw/workspace /home/node/.openclaw/agents
echo "Config initialized"
volumeMounts:
- name: openclaw-home
mountPath: /home/node/.openclaw
- name: config-template
mountPath: /config
resources:
limits:
cpu: 200m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi
containers:
- name: gateway
image: ghcr.io/openclaw/openclaw:latest
imagePullPolicy: IfNotPresent
command:
- sh
- -c
- umask 007 && exec node dist/index.js gateway run --bind loopback --port 18789 --allow-unconfigured
env:
- name: HOME
value: /home/node
- name: NODE_ENV
value: production
- name: OPENCLAW_CONFIG_DIR
value: /home/node/.openclaw
- name: OPENCLAW_STATE_DIR
value: /home/node/.openclaw
- name: OPENCLAW_GATEWAY_TOKEN
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: OPENCLAW_GATEWAY_TOKEN
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: OPENAI_API_KEY
optional: true
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: ANTHROPIC_API_KEY
optional: true
- name: OPENROUTER_API_KEY
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: OPENROUTER_API_KEY
optional: true
- name: GEMINI_API_KEY
valueFrom:
secretKeyRef:
name: clawbench-secrets
key: GEMINI_API_KEY
optional: true
ports:
- name: gateway
containerPort: 18789
protocol: TCP
livenessProbe:
exec:
command:
- node
- -e
- "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
initialDelaySeconds: 60
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
exec:
command:
- node
- -e
- "require('http').get('http://127.0.0.1:18789/',r=>process.exit(r.statusCode<400?0:1)).on('error',()=>process.exit(1))"
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
resources:
requests:
cpu: 250m
memory: 1Gi
limits:
cpu: "2"
memory: 4Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
volumeMounts:
- name: openclaw-home
mountPath: /home/node/.openclaw
- name: tmp-volume
mountPath: /tmp
terminationGracePeriodSeconds: 30
volumes:
- name: openclaw-home
persistentVolumeClaim:
claimName: openclaw-home-pvc
- name: config-template
configMap:
name: openclaw-config
- name: tmp-volume
emptyDir: {}

View File

@ -0,0 +1,12 @@
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: openclaw-home-pvc
labels:
app: openclaw
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi

View File

@ -0,0 +1,17 @@
# Reference template — do NOT apply directly.
# The deploy script (scripts/k8s/deploy.sh) creates this secret automatically
# from exported environment variables (OPENAI_API_KEY, etc.).
apiVersion: v1
kind: Secret
metadata:
name: clawbench-secrets
labels:
app: openclaw
type: Opaque
stringData:
OPENCLAW_GATEWAY_TOKEN: "REPLACE_ME"
OPENAI_API_KEY: "REPLACE_ME"
# Add other provider keys as needed:
# ANTHROPIC_API_KEY: "REPLACE_ME"
# OPENROUTER_API_KEY: "REPLACE_ME"
# GEMINI_API_KEY: "REPLACE_ME"

View File

@ -0,0 +1,15 @@
apiVersion: v1
kind: Service
metadata:
name: openclaw
labels:
app: openclaw
spec:
type: ClusterIP
selector:
app: openclaw
ports:
- name: gateway
port: 18789
targetPort: 18789
protocol: TCP

125
scripts/log_to_mlflow.py Normal file
View File

@ -0,0 +1,125 @@
#!/usr/bin/env python3
"""Log a ClawBench BenchmarkResult to MLflow.
Standalone script -- not imported by the clawbench package.
Requires: pip install mlflow (or pip install clawbench[mlflow])
Usage:
python scripts/log_to_mlflow.py /results/benchmark.json
Environment:
MLFLOW_TRACKING_URI MLflow tracking server (default: http://localhost:5000)
MLFLOW_EXPERIMENT_NAME Experiment name (default: clawbench)
"""
from __future__ import annotations
import json
import os
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).resolve().parent.parent))
def main(result_path: str) -> None:
try:
import mlflow
except ImportError:
print(
"mlflow is not installed. Install with: pip install mlflow"
" (or pip install clawbench[mlflow])",
file=sys.stderr,
)
sys.exit(1)
from clawbench.schemas import BenchmarkResult
with open(result_path, encoding="utf-8") as f:
result = BenchmarkResult(**json.load(f))
experiment_id = os.environ.get("MLFLOW_EXPERIMENT_ID")
if experiment_id:
experiment = mlflow.set_experiment(experiment_id=experiment_id)
else:
experiment = mlflow.set_experiment(os.environ.get("MLFLOW_EXPERIMENT_NAME", "clawbench"))
run_name = f"{result.model}-{result.submission_id[:8]}"
with mlflow.start_run(run_name=run_name):
mlflow.log_params(
{
"model": result.model,
"provider": result.provider,
"benchmark_version": result.benchmark_version,
"openclaw_version": result.openclaw_version or "unknown",
"judge_model": result.judge_model or "none",
"task_snapshot_fingerprint": result.task_snapshot_fingerprint or "unknown",
}
)
mlflow.log_metrics(
{
"overall_score": result.overall_score,
"overall_completion": result.overall_completion,
"overall_trajectory": result.overall_trajectory,
"overall_behavior": result.overall_behavior,
"overall_reliability": result.overall_reliability,
"overall_pass_hat_k": result.overall_pass_hat_k,
"overall_judge_score": result.overall_judge_score,
"overall_judge_confidence": result.overall_judge_confidence,
"overall_judge_pass_rate": result.overall_judge_pass_rate,
"judge_task_coverage": result.judge_task_coverage,
"overall_weighted_query_score": result.overall_weighted_query_score,
"overall_median_latency_ms": result.overall_median_latency_ms,
"overall_p95_latency_ms": result.overall_p95_latency_ms,
"overall_total_tokens": result.overall_total_tokens,
"overall_cost_usd": result.overall_cost_usd,
"overall_tokens_per_pass": result.overall_tokens_per_pass,
"overall_cost_per_pass": result.overall_cost_per_pass,
"overall_ci_lower": result.overall_ci_lower,
"overall_ci_upper": result.overall_ci_upper,
}
)
for tier in result.tier_results:
mlflow.log_metrics(
{
f"{tier.tier}/score": tier.mean_task_score,
f"{tier.tier}/completion": tier.mean_completion,
f"{tier.tier}/trajectory": tier.mean_trajectory,
f"{tier.tier}/behavior": tier.mean_behavior,
f"{tier.tier}/reliability": tier.mean_reliability,
}
)
for i, task in enumerate(result.task_results):
mlflow.log_metrics(
{
f"task/{task.task_id}/score": task.mean_task_score,
f"task/{task.task_id}/reliability": task.reliability_score,
},
step=i,
)
mlflow.set_tags(
{
"submission_id": result.submission_id,
"timestamp": result.timestamp,
"certified": str(result.certified),
}
)
try:
mlflow.log_artifact(result_path)
except Exception as e:
print(f"Warning: artifact upload failed: {e}", file=sys.stderr)
print("Metrics and params were logged successfully.", file=sys.stderr)
print(f"Logged to MLflow: experiment={experiment.name} run={run_name}")
if __name__ == "__main__":
if len(sys.argv) != 2:
print(f"Usage: {sys.argv[0]} <result.json>", file=sys.stderr)
sys.exit(1)
main(sys.argv[1])