WebSocket client + CLI harness + pytest suite that exercises each axis of a CKBunker + Coldcard Mk4 policy and asserts the expected outcomes, including the critical negative test that a large PSBT without TOTP is rejected with a specific 'rule #1: need user(s) confirmation' reason. Configuration via .env / YAML / CLI flags, two pre-crafted test PSBTs as fixtures (generation guide in fixtures/README.md), dashboard counter scraper as sanity check, design rationale in docs/.
6.9 KiB
Why this harness exists, and why it's written the way it is
Why a harness at all
The Coldcard HSM's whole value proposition is that the policy on the device is what enforces safety — not the VM, not the network, not the operator. That's a great story, until someone mis-installs a policy file and nobody notices because the "happy path" (small, auto-approved txs) still works.
Failure modes this harness is designed to catch:
-
Policy rule collapse — the auto-approve rule (Rule #2) is loaded but the user-auth rule (Rule #1) is missing or weakened, so large transactions sign without 2FA. The
rule1_without_totp_rejectstest is the single most important assertion: it attempts to sign an above-threshold transaction without TOTP and requires a specific rejection reason. -
TOTP secret drift — authenticator app rotated, backup unclear, or a policy rewrite issued a new secret without updating the operator's phone. The
rule1_with_totp_signstest catches this before you need to send a real transaction. -
Coldcard USB detach — Proxmox USB passthrough occasionally detaches after host reboots. CKBunker starts, the UI renders, but the Coldcard isn't actually attached. The
message_signingtest catches this cheaply (no UTXO needed). -
Cloudflare Access regression — an accident in the Zero Trust dashboard exposes the bunker to the internet. The harness doesn't directly test CF Access policy, but running it via the Tailscale IP while periodically curl-ing the public hostname catches the "SSO gate missing" case.
-
Silent server rejection — CKBunker returns an HTTP 200 with a rejection modal, not an HTTP error code. Automated clients that only check HTTP status can "succeed" against a server that refused to sign. The harness parses the modal and treats rejections as failures when a signature was expected.
Why WebSocket, not HTTP
CKBunker's web UI and its signing protocol live on the same WebSocket endpoint. The HTTP endpoints render HTML only. If you only speak HTTP you can watch the counters but can't cause a sign. The harness needs to cause signs — so WebSocket.
An unfortunate side-effect: Cloudflare Access with service tokens doesn't pass the WebSocket upgrade cleanly. This is why the harness assumes a private ingress (Tailscale) is available even for CF-fronted deployments.
Why a custom client and not upstream
Upstream CKBunker ships a ckbunker console script, but in v0.9.1 it
has a broken import path (tries to import main from outside the
package). There is no packaged Python client. The 500-line client in
ckbunker_hsm_sign/client.py is hand-rolled against the observed
WebSocket protocol — small enough to audit, big enough to be useful,
and stable because CKBunker's own Vue front-end doesn't change often.
The cost: if upstream changes frame shapes, this harness will need an
update. The protocol doc (PROTOCOL.md) captures the current shapes so
future changes are easy to diff.
Why the harness doesn't generate PSBTs
Generating spendable PSBTs requires the Coldcard's xpub, a UTXO, and
a recipient. That's significant state that differs per deployment. The
harness stays deployment-agnostic by accepting pre-crafted PSBT
fixtures (see fixtures/README.md).
This also means you don't risk spending real sats on a validation run.
The same large.psbt can be re-used indefinitely for the reject-path
test because the Coldcard rejects on amount, not UTXO availability.
Why config over code
Every deployment has its own policy shape. Rather than hard-code
"10,000 sats" as the auto-approve cap, the harness reads thresholds
from config.yaml and asserts them against outcomes. If your Rule #2
per-txn cap is 50,000 sats, you:
- Edit
config.yaml— setpolicy.auto_approve.per_txn_sats: 50000. - Craft
small.psbtat 49,000 sats andlarge.psbtat 100,000 sats. - Run the harness.
No code changes. The outcomes the harness asserts are framed as "this PSBT should/shouldn't sign in this path", not "this specific sat amount should sign".
Why pytest AND a CLI
Different operators want different ergonomics:
hsm_validate.py(CLI) — human-readable coloured output, runs the tests in order, exits 0/1/2. Good for oncall dashboards, cron monitors, demoing to stakeholders.pytest tests/— integrates with existing CI, produces JUnit XML, lets you parametrise against multiple environments. Good for automated deploy gates.
Both paths share the same client, fixtures, and config loader — there's no duplication.
Why the tests are numbered (test_01, test_02 …)
pytest doesn't guarantee execution order across files. The numbered
prefixes ensure the order reads top-to-bottom when presented (by
collection order and by pytest -v output), matching the narrative
of the CLI harness. This helps when screenshotting a run for an
incident report — the sequence looks sensible.
Why we scrape the dashboard at all
The counters test is a sanity check against client-side deception. If a future bug in the client mis-identifies a rejection as a signature (or vice versa), the dashboard deltas reveal it: the Coldcard doesn't lie about whether it signed, and the dashboard reflects Coldcard state. If the harness says "4 signs, 1 reject" but the dashboard shows "0 signs, 0 rejects", something is wrong at the network layer.
The scraper is tolerant: CKBunker versions vary in HTML shape, so if the regex can't find the numbers the test skips rather than fails. The real signing assertions already prove end-to-end correctness.
Why rejections aren't exceptions
A rejection is a successful policy evaluation — the Coldcard did exactly what it was configured to do. Treating rejections as Python exceptions would:
- force every call site into try/except
- conflate policy behaviour with transport errors (network, timeout)
- hide the rejection reason behind an exception type
Instead, SignResult.status is an enum with four values (SIGNED,
REJECTED, TIMEOUT, WS_ERROR) and the caller asserts the status it
expects. is_expected_rejection("rule #1") keeps the specific-reason
check terse.
Why "don't broadcast" is the default
submit_psbt accepts a broadcast=True flag that asks CKBunker to
push the signed tx. The harness always sends broadcast=false. A
validation run should never touch the mempool. Operators who want to
drive real signings via this client should use it directly, not via
the harness.
Why there's no CI/CD templating
Every shop's CI is different (GitHub Actions, Drone, Gitea Actions,
Jenkins, Woodpecker). Providing a single-vendor pipeline template
would add maintenance burden without saving meaningful integration
time. The hsm_validate.py CLI returns exit code 0 on success, 1 on
failure — which is all any CI needs. Integration examples live in the
README.