WebSocket client + CLI harness + pytest suite that exercises each axis of a CKBunker + Coldcard Mk4 policy and asserts the expected outcomes, including the critical negative test that a large PSBT without TOTP is rejected with a specific 'rule #1: need user(s) confirmation' reason. Configuration via .env / YAML / CLI flags, two pre-crafted test PSBTs as fixtures (generation guide in fixtures/README.md), dashboard counter scraper as sanity check, design rationale in docs/.
160 lines
6.9 KiB
Markdown
160 lines
6.9 KiB
Markdown
# Why this harness exists, and why it's written the way it is
|
|
|
|
## Why a harness at all
|
|
|
|
The Coldcard HSM's whole value proposition is that the **policy on the
|
|
device is what enforces safety** — not the VM, not the network, not the
|
|
operator. That's a great story, until someone mis-installs a policy file
|
|
and nobody notices because the "happy path" (small, auto-approved txs)
|
|
still works.
|
|
|
|
Failure modes this harness is designed to catch:
|
|
|
|
1. **Policy rule collapse** — the auto-approve rule (Rule #2) is loaded
|
|
but the user-auth rule (Rule #1) is missing or weakened, so large
|
|
transactions sign without 2FA. The **`rule1_without_totp_rejects`
|
|
test** is the single most important assertion: it attempts to sign an
|
|
above-threshold transaction without TOTP and requires a specific
|
|
rejection reason.
|
|
|
|
2. **TOTP secret drift** — authenticator app rotated, backup unclear, or
|
|
a policy rewrite issued a new secret without updating the operator's
|
|
phone. The **`rule1_with_totp_signs` test** catches this before you
|
|
need to send a real transaction.
|
|
|
|
3. **Coldcard USB detach** — Proxmox USB passthrough occasionally
|
|
detaches after host reboots. CKBunker starts, the UI renders, but the
|
|
Coldcard isn't actually attached. The **`message_signing` test**
|
|
catches this cheaply (no UTXO needed).
|
|
|
|
4. **Cloudflare Access regression** — an accident in the Zero Trust
|
|
dashboard exposes the bunker to the internet. The harness doesn't
|
|
directly test CF Access policy, but running it via the Tailscale IP
|
|
while periodically curl-ing the public hostname catches the
|
|
"SSO gate missing" case.
|
|
|
|
5. **Silent server rejection** — CKBunker returns an HTTP 200 with a
|
|
rejection modal, not an HTTP error code. Automated clients that only
|
|
check HTTP status can "succeed" against a server that refused to
|
|
sign. The harness parses the modal and treats rejections as failures
|
|
when a signature was expected.
|
|
|
|
## Why WebSocket, not HTTP
|
|
|
|
CKBunker's web UI and its signing protocol live on the same WebSocket
|
|
endpoint. The HTTP endpoints render HTML only. If you only speak HTTP
|
|
you can **watch** the counters but can't **cause** a sign. The harness
|
|
needs to cause signs — so WebSocket.
|
|
|
|
An unfortunate side-effect: Cloudflare Access with service tokens
|
|
doesn't pass the WebSocket upgrade cleanly. This is why the harness
|
|
assumes a private ingress (Tailscale) is available even for
|
|
CF-fronted deployments.
|
|
|
|
## Why a custom client and not upstream
|
|
|
|
Upstream CKBunker ships a `ckbunker` console script, but in `v0.9.1` it
|
|
has a broken import path (tries to `import main` from outside the
|
|
package). There is no packaged Python client. The 500-line client in
|
|
`ckbunker_hsm_sign/client.py` is hand-rolled against the observed
|
|
WebSocket protocol — small enough to audit, big enough to be useful,
|
|
and stable because CKBunker's own Vue front-end doesn't change often.
|
|
|
|
The cost: if upstream changes frame shapes, this harness will need an
|
|
update. The protocol doc (`PROTOCOL.md`) captures the current shapes so
|
|
future changes are easy to diff.
|
|
|
|
## Why the harness doesn't generate PSBTs
|
|
|
|
**Generating spendable PSBTs requires the Coldcard's xpub, a UTXO, and
|
|
a recipient.** That's significant state that differs per deployment. The
|
|
harness stays deployment-agnostic by accepting **pre-crafted PSBT
|
|
fixtures** (see [`fixtures/README.md`](../fixtures/README.md)).
|
|
|
|
This also means you don't risk spending real sats on a validation run.
|
|
The same `large.psbt` can be re-used indefinitely for the reject-path
|
|
test because the Coldcard rejects on **amount**, not UTXO availability.
|
|
|
|
## Why config over code
|
|
|
|
Every deployment has its own policy shape. Rather than hard-code
|
|
"10,000 sats" as the auto-approve cap, the harness reads thresholds
|
|
from `config.yaml` and asserts them against outcomes. If your Rule #2
|
|
per-txn cap is 50,000 sats, you:
|
|
|
|
1. Edit `config.yaml` — set `policy.auto_approve.per_txn_sats: 50000`.
|
|
2. Craft `small.psbt` at 49,000 sats and `large.psbt` at 100,000 sats.
|
|
3. Run the harness.
|
|
|
|
No code changes. The **outcomes** the harness asserts are framed as
|
|
"this PSBT should/shouldn't sign in this path", not "this specific sat
|
|
amount should sign".
|
|
|
|
## Why pytest AND a CLI
|
|
|
|
Different operators want different ergonomics:
|
|
|
|
- **`hsm_validate.py`** (CLI) — human-readable coloured output, runs the
|
|
tests in order, exits 0/1/2. Good for oncall dashboards, cron monitors,
|
|
demoing to stakeholders.
|
|
- **`pytest tests/`** — integrates with existing CI, produces JUnit XML,
|
|
lets you parametrise against multiple environments. Good for
|
|
automated deploy gates.
|
|
|
|
Both paths share the same client, fixtures, and config loader — there's
|
|
no duplication.
|
|
|
|
## Why the tests are numbered (`test_01`, `test_02` …)
|
|
|
|
pytest doesn't guarantee execution order across files. The numbered
|
|
prefixes ensure the order reads top-to-bottom when presented (by
|
|
collection order and by `pytest -v` output), matching the narrative
|
|
of the CLI harness. This helps when screenshotting a run for an
|
|
incident report — the sequence looks sensible.
|
|
|
|
## Why we scrape the dashboard at all
|
|
|
|
The counters test is a **sanity check against client-side deception**.
|
|
If a future bug in the client mis-identifies a rejection as a
|
|
signature (or vice versa), the dashboard deltas reveal it: the
|
|
Coldcard doesn't lie about whether it signed, and the dashboard
|
|
reflects Coldcard state. If the harness says "4 signs, 1 reject" but
|
|
the dashboard shows "0 signs, 0 rejects", something is wrong at the
|
|
network layer.
|
|
|
|
The scraper is tolerant: CKBunker versions vary in HTML shape, so if
|
|
the regex can't find the numbers the test skips rather than fails.
|
|
The real signing assertions already prove end-to-end correctness.
|
|
|
|
## Why rejections aren't exceptions
|
|
|
|
A rejection is a successful policy evaluation — the **Coldcard did
|
|
exactly what it was configured to do**. Treating rejections as Python
|
|
exceptions would:
|
|
|
|
- force every call site into try/except
|
|
- conflate policy behaviour with transport errors (network, timeout)
|
|
- hide the rejection reason behind an exception type
|
|
|
|
Instead, `SignResult.status` is an enum with four values (`SIGNED`,
|
|
`REJECTED`, `TIMEOUT`, `WS_ERROR`) and the caller asserts the status it
|
|
expects. `is_expected_rejection("rule #1")` keeps the specific-reason
|
|
check terse.
|
|
|
|
## Why "don't broadcast" is the default
|
|
|
|
`submit_psbt` accepts a `broadcast=True` flag that asks CKBunker to
|
|
push the signed tx. The harness always sends `broadcast=false`. A
|
|
validation run should never touch the mempool. Operators who want to
|
|
drive real signings via this client should use it directly, not via
|
|
the harness.
|
|
|
|
## Why there's no CI/CD templating
|
|
|
|
Every shop's CI is different (GitHub Actions, Drone, Gitea Actions,
|
|
Jenkins, Woodpecker). Providing a single-vendor pipeline template
|
|
would add maintenance burden without saving meaningful integration
|
|
time. The `hsm_validate.py` CLI returns exit code 0 on success, 1 on
|
|
failure — which is all any CI needs. Integration examples live in the
|
|
README.
|