docs: fix ollama profile guidance

This commit is contained in:
scoootscooob 2026-04-16 19:49:04 -07:00
parent 517f2207b0
commit ee8ff79347
2 changed files with 17 additions and 15 deletions

View File

@ -258,11 +258,11 @@ A single consumer GPU running an open-weight model through
algorithmic ideas, and submit scored results — no API keys or cloud spend
required.
Profiles tested locally can be submitted as pull requests. The official
ClawBench CI re-evaluates merged profiles against frontier models, so
researchers and small teams can contribute configurations and novel
strategies (tool-routing, memory architectures, prompt scaffolding) while
the project handles the expensive runs.
Profiles tested locally can still be submitted as pull requests with
reference results. The built-in GitHub Actions workflows in this repo only
run the test suite and deployment sync, so treat local Ollama numbers as
contributor-side evidence unless a maintainer separately reruns them on
other infrastructure.
```bash
# Pull a model and set your gateway token
@ -275,18 +275,18 @@ clawbench run --model ollama/gpt-oss:20b --task t1-fs-quick-note --runs 1
# Tier-1 sweep with confidence intervals
clawbench run --model ollama/gpt-oss:20b --tier tier1 --runs 5
# Full local eval with a plugin profile (see profiles/ for examples)
clawbench run --model ollama/gpt-oss:20b \
--profile profiles/local_ollama_gpt_oss.yaml \
--tier tier1 --tier tier2 --runs 5 --concurrency 2
# Tier-2 sweep (run separately; the CLI accepts one --tier at a time)
clawbench run --model ollama/gpt-oss:20b --tier tier2 --runs 5 --concurrency 2
# Inspect the reference profile's fingerprint and historical neighbors
clawbench diagnose profiles/local_ollama_gpt_oss.yaml
```
**Reference results** (gpt-oss:20b, RTX 4090, Docker sandbox, network=none):
**Reference contributor-side results** (gpt-oss:20b, RTX 4090, Docker sandbox, network=none):
| Scope | Score | CI | Completion | Trajectory | Behavior |
|---|---|---|---|---|---|
| Tier-1 (6 tasks × 3 runs) | 0.397 | 0.3460.447 | 0.056 | 0.522 | 1.000 |
| Tier-1+2 (14 tasks × 5 runs) | 0.467 | 0.3750.545 | 0.202 | 0.666 | 0.850 |
High trajectory/behavior but low completion — the model uses tools correctly
but writes to wrong paths or misses format constraints. This gap is where

View File

@ -2,11 +2,13 @@ profile:
name: local-ollama-gpt-oss
base_model: ollama/gpt-oss:20b
notes: |
Reference profile for local Ollama models. Change base_model and
tools_allow to match your setup, run with --profile, and submit
your results as a PR.
Reference profile metadata for local Ollama experiments. Change
base_model and tools_allow to match your setup, then use
`clawbench diagnose` to inspect the fingerprint and attach your
benchmark results separately in a PR.
Baseline (gpt-oss:20b, RTX 4090, Docker sandbox, network=none):
Contributor-side baseline (gpt-oss:20b, RTX 4090, Docker sandbox,
network=none):
Tier-1+2: 0.467 (CI 0.375-0.545), T=0.666, B=0.850, C=0.202
plugins:
enabled: