mineracks 9d380f5013 Initial import: CKBunker HSM validation harness

WebSocket client + CLI harness + pytest suite that exercises each axis of
a CKBunker + Coldcard Mk4 policy and asserts the expected outcomes, including
the critical negative test that a large PSBT without TOTP is rejected with
a specific 'rule #1: need user(s) confirmation' reason.

Configuration via .env / YAML / CLI flags, two pre-crafted test PSBTs as
fixtures (generation guide in fixtures/README.md), dashboard counter
scraper as sanity check, design rationale in docs/.

2026-04-14 10:50:04 +10:00

6.9 KiB

Raw Blame History

Why this harness exists, and why it's written the way it is

Why a harness at all

The Coldcard HSM's whole value proposition is that the policy on the device is what enforces safety — not the VM, not the network, not the operator. That's a great story, until someone mis-installs a policy file and nobody notices because the "happy path" (small, auto-approved txs) still works.

Failure modes this harness is designed to catch:

Policy rule collapse — the auto-approve rule (Rule #2) is loaded but the user-auth rule (Rule #1) is missing or weakened, so large transactions sign without 2FA. The rule1_without_totp_rejects test is the single most important assertion: it attempts to sign an above-threshold transaction without TOTP and requires a specific rejection reason.
TOTP secret drift — authenticator app rotated, backup unclear, or a policy rewrite issued a new secret without updating the operator's phone. The rule1_with_totp_signs test catches this before you need to send a real transaction.
Coldcard USB detach — Proxmox USB passthrough occasionally detaches after host reboots. CKBunker starts, the UI renders, but the Coldcard isn't actually attached. The message_signing test catches this cheaply (no UTXO needed).
Cloudflare Access regression — an accident in the Zero Trust dashboard exposes the bunker to the internet. The harness doesn't directly test CF Access policy, but running it via the Tailscale IP while periodically curl-ing the public hostname catches the "SSO gate missing" case.
Silent server rejection — CKBunker returns an HTTP 200 with a rejection modal, not an HTTP error code. Automated clients that only check HTTP status can "succeed" against a server that refused to sign. The harness parses the modal and treats rejections as failures when a signature was expected.

Why WebSocket, not HTTP

CKBunker's web UI and its signing protocol live on the same WebSocket endpoint. The HTTP endpoints render HTML only. If you only speak HTTP you can watch the counters but can't cause a sign. The harness needs to cause signs — so WebSocket.

An unfortunate side-effect: Cloudflare Access with service tokens doesn't pass the WebSocket upgrade cleanly. This is why the harness assumes a private ingress (Tailscale) is available even for CF-fronted deployments.

Why a custom client and not upstream

Upstream CKBunker ships a ckbunker console script, but in v0.9.1 it has a broken import path (tries to import main from outside the package). There is no packaged Python client. The 500-line client in ckbunker_hsm_sign/client.py is hand-rolled against the observed WebSocket protocol — small enough to audit, big enough to be useful, and stable because CKBunker's own Vue front-end doesn't change often.

The cost: if upstream changes frame shapes, this harness will need an update. The protocol doc (PROTOCOL.md) captures the current shapes so future changes are easy to diff.

Why the harness doesn't generate PSBTs

Generating spendable PSBTs requires the Coldcard's xpub, a UTXO, and a recipient. That's significant state that differs per deployment. The harness stays deployment-agnostic by accepting pre-crafted PSBT fixtures (see fixtures/README.md).

This also means you don't risk spending real sats on a validation run. The same large.psbt can be re-used indefinitely for the reject-path test because the Coldcard rejects on amount, not UTXO availability.

Why config over code

Every deployment has its own policy shape. Rather than hard-code "10,000 sats" as the auto-approve cap, the harness reads thresholds from config.yaml and asserts them against outcomes. If your Rule #2 per-txn cap is 50,000 sats, you:

Edit config.yaml — set policy.auto_approve.per_txn_sats: 50000.
Craft small.psbt at 49,000 sats and large.psbt at 100,000 sats.
Run the harness.

No code changes. The outcomes the harness asserts are framed as "this PSBT should/shouldn't sign in this path", not "this specific sat amount should sign".

Why pytest AND a CLI

Different operators want different ergonomics:

hsm_validate.py (CLI) — human-readable coloured output, runs the tests in order, exits 0/1/2. Good for oncall dashboards, cron monitors, demoing to stakeholders.
pytest tests/ — integrates with existing CI, produces JUnit XML, lets you parametrise against multiple environments. Good for automated deploy gates.

Both paths share the same client, fixtures, and config loader — there's no duplication.

Why the tests are numbered (`test_01`, `test_02` …)

pytest doesn't guarantee execution order across files. The numbered prefixes ensure the order reads top-to-bottom when presented (by collection order and by pytest -v output), matching the narrative of the CLI harness. This helps when screenshotting a run for an incident report — the sequence looks sensible.

Why we scrape the dashboard at all

The counters test is a sanity check against client-side deception. If a future bug in the client mis-identifies a rejection as a signature (or vice versa), the dashboard deltas reveal it: the Coldcard doesn't lie about whether it signed, and the dashboard reflects Coldcard state. If the harness says "4 signs, 1 reject" but the dashboard shows "0 signs, 0 rejects", something is wrong at the network layer.

The scraper is tolerant: CKBunker versions vary in HTML shape, so if the regex can't find the numbers the test skips rather than fails. The real signing assertions already prove end-to-end correctness.

Why rejections aren't exceptions

A rejection is a successful policy evaluation — the Coldcard did exactly what it was configured to do. Treating rejections as Python exceptions would:

force every call site into try/except
conflate policy behaviour with transport errors (network, timeout)
hide the rejection reason behind an exception type

Instead, SignResult.status is an enum with four values (SIGNED, REJECTED, TIMEOUT, WS_ERROR) and the caller asserts the status it expects. is_expected_rejection("rule #1") keeps the specific-reason check terse.

Why "don't broadcast" is the default

submit_psbt accepts a broadcast=True flag that asks CKBunker to push the signed tx. The harness always sends broadcast=false. A validation run should never touch the mempool. Operators who want to drive real signings via this client should use it directly, not via the harness.

Why there's no CI/CD templating

Every shop's CI is different (GitHub Actions, Drone, Gitea Actions, Jenkins, Woodpecker). Providing a single-vendor pipeline template would add maintenance burden without saving meaningful integration time. The hsm_validate.py CLI returns exit code 0 on success, 1 on failure — which is all any CI needs. Integration examples live in the README.

6.9 KiB Raw Blame History