clawbench/Dockerfile
scoootscooob 8447ab1ca6 docker: revert OpenClaw base pin; remove reference scores
Per request: drop the Docker-base-pinning approach and the inline
reference scores. Treat published numbers as version-, provider-, and
seed-dependent.

Dockerfile: revert FROM ghcr.io/openclaw/openclaw:2026.4.15-beta.1
back to FROM ghcr.io/openclaw/openclaw:latest. Builds will track the
current OpenClaw release. The state-isolation patch + rejudge
pipeline (the actually load-bearing reproducibility infra) stay in
place; only the pinned-version approach is reverted.

README.md:
  - drops the "Docker base pinning" row from the "What's new" table;
    replaced with "Reproducibility-first infrastructure" framing
  - drops the "pinned" badge; added a "Diagnostics" badge instead
  - updates "Reproducibility caveats" to recommend "build both sides
    of any comparison from the same OpenClaw release" rather than
    "pin to 2026.4.15-beta.1"
  - updates Quick Start to record (not assume) the OpenClaw version
    the build resolved to
  - drops the pinned-base row from the comparison table; replaced
    with "State-isolation per run" (the actually distinguishing infra)
  - updates the version log entry for Core v1 to highlight the
    dynamical-systems diagnostics + state-isolation rather than the
    pinning that's no longer there

tasks-public/README.md:
  - drops the 8-row "Established ranking" table per request
  - replaced with a "Selection criteria" section that explains how
    the 19 tasks were chosen (0 inversions, min-gap 0.0049) without
    publishing version-dependent scores
  - reframes the build instructions to track :latest with a comment
    about platform-version drift

tasks-public/MANIFEST.yaml:
  - drops `openclaw_version: 2026.4.15-beta.1` (could be misread as
    a hard requirement)
  - drops the `established_ranking` block
  - replaced with `selection_basis` that documents the methodology
    and explicitly states why scores are intentionally omitted

Test suite still green: 156 passed locally, 152 passed in the
CI-equivalent (no private tasks/) configuration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 21:24:42 -07:00

48 lines
1.3 KiB
Docker

# ClawBench HF Docker Space
# Layer the benchmark harness on top of the official OpenClaw image.
FROM ghcr.io/openclaw/openclaw:latest
USER root
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
apt-get install -y python3-pip python-is-python3 && \
rm -rf /var/lib/apt/lists/*
RUN ln -s /app /openclaw
ENV PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
RUN npx -y playwright@1.59.1 install --with-deps chromium && \
CHROME_PATH="$(find /ms-playwright -path '*/chrome' -type f | sort | head -n 1)" && \
test -x "$CHROME_PATH" && \
ln -sf "$CHROME_PATH" /usr/bin/chromium
ENV HOME=/home/node PATH=/home/node/.local/bin:$PATH
WORKDIR /home/node/app
COPY --chown=node:node pyproject.toml README.md ./
COPY --chown=node:node clawbench/ clawbench/
COPY --chown=node:node tasks/ tasks/
COPY --chown=node:node baselines/ baselines/
COPY --chown=node:node app.py .
RUN python3 -m pip install --break-system-packages --no-cache-dir .
RUN mkdir -p \
/data/results \
/data/queue \
/home/node/.openclaw/agents/dev \
/home/node/.openclaw/agents/main/agent && \
chown -R node:node /data /home/node/.openclaw && \
chmod -R 777 /data /home/node/.openclaw
USER node
ENV GATEWAY_PORT=18789
ENV OPENCLAW_HOME=/home/node
ENV OPENCLAW_STATE_DIR=/home/node/.openclaw
EXPOSE 7860
CMD ["python", "app.py"]