docs: add DEMO.md — full walkthrough against a real production deployment

Adds a demonstration doc showing every harness test mapped to the UI state you should see on a correctly-configured CKBunker + Coldcard HSM. Each screenshot is paired with the specific test that asserts the outcome, plus guidance on what failure at that step means. Sensitive/site-specific identifiers (IPs, domain, device serial, CF tunnel UUID) are generalised so the doc reads as a template for any deployment. 15 screenshots in docs/images/ cover: physical rack installation, policy config UI, message signing end-to-end, sub-threshold auto-sign via web UI and CLI, the critical policy-rejection case, TOTP-authorised signing, and dashboard counter verification.
2026-04-14 11:00:33 +10:00 · 2026-04-14 11:00:33 +10:00 · 3489ae6e8f
commit 3489ae6e8f
parent 9d380f5013
17 changed files with 313 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -6,6 +6,8 @@ Runs a short, structured sequence of tests against a live CKBunker and exits non
 > **The critical test**: a transaction above your auto-approve cap is submitted without 2FA. The Coldcard must reject it with a specific `rule #1: need user(s) confirmation` error. If it signs, something is catastrophically wrong with your policy and the harness exits with a loud failure.
 📖 **See [`docs/DEMO.md`](docs/DEMO.md) for a full walkthrough** against a real rack-mounted production deployment, with screenshots of every test showing the expected UI state. Use it as the reference for "what good looks like" before you run the harness on your own CKBunker.
 ---
 ## Table of contents
@ -438,9 +440,11 @@ parameter from it.
 │   └── README.md                       ← how to generate test PSBTs
 │
 └── docs/
    ├── DEMO.md                         ← full demo against a real production deployment
    ├── PROTOCOL.md                     ← CKBunker WebSocket protocol reference
    ├── WHY.md                          ← design rationale
-    └── POLICY_RECOMMENDATIONS.md       ← how to design a two-tier policy
+    ├── POLICY_RECOMMENDATIONS.md       ← how to design a two-tier policy
    └── images/                         ← screenshots used in DEMO.md
 ```
 ---
--- a/docs/DEMO.md
+++ b/docs/DEMO.md
@ -0,0 +1,308 @@
 # Demo — validating a real production CKBunker deployment
 This walkthrough shows the harness run against a live, rack-mounted
 CKBunker + Coldcard Mk4 in HSM mode. Every screenshot is from a real
 validation run on production hardware, paired with the exact test in this
 repo that asserts the outcome you see. Use it as reference for what "good"
 looks like when you run the harness against your own deployment.
 Environment details (IPs, domain names, device serials) have been generalised;
 your values will differ.
 ---
 ## The deployment being validated
 ```
        ┌──────────────────────────────────────┐
        │  Client (laptop / CI runner)         │
        │    python hsm_validate.py            │
        └──────┬───────────────────────────────┘
               │  Tailscale WireGuard overlay
               ▼
        ┌──────────────────────────────────────┐
        │  CKBunker VM                         │
        │    Ubuntu 24.04, Python 3.12         │
        │    ckbunker.service (systemd)        │
        │    hsm.<your-domain>  (CF Tunnel)    │
        │    http://<tailnet-ip>:9823          │
        └──────┬───────────────────────────────┘
               │  USB HID passthrough
               ▼
        ┌──────────────────────────────────────┐
        │  Coldcard Mk4 in HSM mode            │
        │    "<your org> HSM approval" policy  │
        │    Rule #1 / #2 / TOTP enforcement   │
        └──────────────────────────────────────┘
 ```
 **Policy installed on the Coldcard** (abbreviated — yours may differ in
 thresholds):
 ```
 Rule #1: ≤ 0.001 BTC/txn, ≤ 0.005 BTC/period  (needs TOTP from user "operator")
 Rule #2: ≤ 0.0001 BTC/txn, ≤ 0.0005 BTC/period (auto-approved)
 Velocity period: 1440 min (24 h)
 Message signing: any path allowed
 MicroSD logging: on
 Boot to HSM: on (6-digit escape code)
 ```
 ---
 ## Physical setup
 The Coldcard Mk4 is rack-mounted and USB-attached to the Proxmox host
 that runs the CKBunker VM. It stays in HSM mode continuously; the keypad
 is the only channel for policy changes.
 <figure>
 <img src="images/14-production-rack.jpg" alt="Production rack — CKBunker HSM installation">
 <figcaption><em>Production rack view. The Coldcard Mk4 is installed in the lower shelf of the stack, USB-tethered to the host running the CKBunker VM. USB passthrough on the hypervisor is configured by vendor/product ID (<code>d13e:cc10</code>) so the device survives VM restarts.</em></figcaption>
 </figure>
 <figure>
 <img src="images/13-coldcard-installed-closeup.jpg" alt="Coldcard Mk4 installed — front panel">
 <figcaption><em>Coldcard Mk4 front panel in HSM mode. The keypad is the only path to change policy. Nothing the harness does — and nothing any remote attacker can do — affects what's shown here.</em></figcaption>
 </figure>
 <figure>
 <img src="images/15-coldcard-ports.jpg" alt="Coldcard rear — USB tether and ports">
 <figcaption><em>Rear view showing the USB tether. Once the policy is loaded and Boot-to-HSM is enabled, the only way back to the main menu is the 6-digit escape code entered within 60 seconds of power-on.</em></figcaption>
 </figure>
 ---
 ## The policy — configured once, enforced forever
 The Coldcard's policy is loaded on-device via keypad + MicroSD. The
 CKBunker web UI lets you *author* the policy file before it gets signed
 and shipped to the Coldcard, but **it cannot modify an already-installed
 policy over the wire**.
 <figure>
 <img src="images/07-policy-bunker-setup.png" alt="Bunker Setup — Other Policy">
 <figcaption><em>Bunker Setup → Other Policy. The 6-digit "Boot To HSM" escape code is the only secret that can take the Coldcard out of HSM mode once the policy is live. The free-form approval note shows on the Coldcard screen when signing, providing a human-readable identifier for the active policy.</em></figcaption>
 </figure>
 The harness reads your expected thresholds from `config.yaml` and asserts
 every outcome against them. Your policy shape can differ from the example
 — adjust `policy.auto_approve.per_txn_sats` etc. to match what you
 actually installed.
 ---
 ## Test 1 — `connectivity`
 The cheapest check: HTTP reachable, WebSocket URL extractable from the
 page, session cookie obtained.
 ```bash
 ./hsm_validate.py --tests connectivity
 ```
 No UI screenshot — this happens before any user-visible action. On success
 you'll see:
 ```
 ✓ connectivity    HTTP + WS endpoint reachable  (0.3s)
    WebSocket URL: ws://<bunker>:9823/websocket/CBG5KH5BCCG6W3BXDH5QQY5Q
    Session cookies: yes
 ```
 If **Session cookies: none** appears, you're most likely hitting a
 CF-Access-protected URL without a service token — auth will fail on the
 WebSocket upgrade. Switch `CKBUNKER_URL` to your private ingress.
 ---
 ## Test 2 — `message_signing`
 CKBunker can sign an arbitrary text message with a key derived from the
 Coldcard seed. The server never sees the key; it forwards the message to
 the Coldcard and returns the signature.
 <figure>
 <img src="images/02-message-signing-ui.png" alt="CK Bunker — Text Message Signing">
 <figcaption><em>Tools → Text Message Signing on the CKBunker UI. Derivation path <code>m/84'/0'/0'/1</code>, segwit (bech32) address. The "Sign Message" button triggers the same WebSocket action the harness invokes programmatically.</em></figcaption>
 </figure>
 The harness verifies the returned signature by sending it back through a
 wallet (Sparrow in this example) to confirm it validates against the
 expected address:
 <figure>
 <img src="images/01-message-signed-verified.png" alt="Sparrow — Verification Succeeded">
 <figcaption><em>Verification succeeded in Sparrow for a CKBunker-produced signature. If this fails, either the Coldcard isn't the device you think it is, or the derivation path in the harness config doesn't match the wallet you're verifying against.</em></figcaption>
 </figure>
 Why this test is cheap and valuable: it doesn't need a UTXO, doesn't
 affect spending counters, and catches about 80% of "the Coldcard is
 detached" or "wrong Coldcard" problems in one second.
 ---
 ## Test 3 — `rule2_auto_approve`
 Sub-threshold PSBT (under your Rule #2 per-txn cap) signs with **no TOTP**.
 ### Via the web UI
 <figure>
 <img src="images/05-tx-small-signing.png" alt="CK Bunker — small tx signing page">
 <figcaption><em>Signing page for a 9,000-sat PSBT (under a 10,000-sat Rule #2 cap). The Transaction Preview expands the PSBT; "Authorizing User" / "One-Time Code" fields are left empty because Rule #2 does not require them. The policy summary at the bottom is always visible so operators can verify against what's displayed.</em></figcaption>
 </figure>
 <figure>
 <img src="images/06-tx-small-success.png" alt="CK Bunker — Transaction signed">
 <figcaption><em>Coldcard approved and signed without any human interaction. Approvals counter ticks up; Amount Spent accumulates against the 24-hour velocity budget. The signature came back under a second later.</em></figcaption>
 </figure>
 ### Via the harness
 ```bash
 ./hsm_validate.py --tests rule2_auto_approve
 ```
 The harness uses the identical WebSocket protocol the browser uses:
 <figure>
 <img src="images/08-cli-sign-small.png" alt="CLI — cksign small transaction">
 <figcaption><em>Terminal output from the harness client signing a sub-threshold PSBT. It opens a WebSocket, uploads the PSBT, waits for the Coldcard to evaluate policy, and writes the signed PSBT. No TOTP prompt because Rule #2 does not require one.</em></figcaption>
 </figure>
 To check the output is actually valid, load it in a wallet:
 <figure>
 <img src="images/09-tx-small-broadcast-ready.png" alt="Sparrow — signed small PSBT ready to broadcast">
 <figcaption><em>The resulting signed PSBT loaded into Sparrow: "Pay 9,000 sats", the Coldcard signature row is fully filled, Broadcast button is live. End-to-end: harness → CKBunker → Coldcard → signed PSBT → wallet → (would be) broadcast. The harness itself never broadcasts; that's the operator's choice.</em></figcaption>
 </figure>
 Don't broadcast these test PSBTs during a validation run. Re-use the same
 `small.psbt` fixture across runs while the UTXO is still unspent in your
 watch-only wallet.
 ---
 ## Test 4 — `rule1_without_totp_rejects` — **the critical assertion**
 The single most important test in this harness. A PSBT over your Rule #2
 cap is submitted **without** a TOTP code. The Coldcard must reject it.
 <figure>
 <img src="images/03-tx-large-unsigned.png" alt="Sparrow — unsigned 100,000 sat PSBT">
 <figcaption><em>An unsigned 100,000-sat test PSBT (0.001 BTC) — above the 10,000-sat auto-approve cap but within the 100,000-sat user-auth cap. A correctly-configured policy should <strong>refuse</strong> to sign this without TOTP.</em></figcaption>
 </figure>
 <figure>
 <img src="images/04-tx-large-rejected.png" alt="CK Bunker — Failed: rejected by Coldcard">
 <figcaption><em>The Coldcard responds: <strong>"Rejected: rule #1: need user(s) confirmation, rule #2: would exceed period spending"</strong>. The CKBunker VM had <strong>no power to override this</strong> — the rejection comes from the Coldcard's policy engine. The Refusals counter increments.</em></figcaption>
 </figure>
 The harness asserts not just "some rejection happened" but **that the
 reason contains "rule #1"**:
 ```python
 # tests/test_04_rule1_without_totp_rejects.py
 assert res.is_expected_rejection("rule #1"), (
    f"expected a 'rule #1: need user(s) confirmation' rejection, "
    f"got status={res.status.value} reason={res.reason!r}"
 )
 ```
 ### What failure looks like
 If this test reports **PASS** when it should fail — i.e. the Coldcard
 signed an above-threshold PSBT without TOTP — your policy is broken. The
 harness explicitly flags this case:
 ```
 ✗ rule1_without_totp_rejects    policy NOT enforced: large PSBT was signed
                                without TOTP — STOP AND INVESTIGATE
 ```
 Action: exit HSM mode via the escape code and re-install the policy.
 ---
 ## Test 5 — `rule1_with_totp_signs`
 The same large PSBT. A fresh TOTP code. Should sign cleanly.
 <figure>
 <img src="images/10-cli-sign-totp.png" alt="CLI — cksign with TOTP">
 <figcaption><em>Terminal output from the harness: the <code>--totp</code> flag auto-generates a 6-digit code from the stored <code>TOTP_SECRET</code> (shown here as <code>579322</code>, valid for 6 seconds). The client submits the code as a user authorisation, then uploads the PSBT. The Coldcard verifies TOTP against its seeded secret, sees Rule #1 is satisfied, and signs.</em></figcaption>
 </figure>
 <figure>
 <img src="images/11-tx-large-broadcast-ready.png" alt="Sparrow — signed large PSBT ready to broadcast">
 <figcaption><em>The signed large PSBT in Sparrow — same 100k-sat transaction that was rejected in test 4, now with a valid Coldcard signature. The only difference: a 6-digit code held exclusively by the authorised user. The key and policy never moved.</em></figcaption>
 </figure>
 If this test fails with `bad TOTP code` reason:
 - Your Mac / runner clock is out of sync. TOTP has a 30-second window;
  check `ntpdate -q time.apple.com` or equivalent.
 - Your `TOTP_SECRET` env var is stale (TOTP was rotated on the Coldcard
  but the secret on disk wasn't updated).
 - The user name in your config doesn't match the user named in the
  policy's Rule #1.
 ---
 ## Test 6 — `counters_tracked`
 A sanity check that the **server-visible counters** moved as expected.
 This catches the unlikely case where the harness thinks a sign happened
 but the CKBunker / Coldcard don't agree.
 <figure>
 <img src="images/12-dashboard-counters.png" alt="CK Bunker — dashboard counters after demo">
 <figcaption><em>Dashboard after the full harness run: <strong>4 Approvals</strong> (message sign, small PSBT via UI, small PSBT via CLI, large PSBT via CLI+TOTP), <strong>1 Refusal</strong> (the large PSBT attempted without TOTP), <strong>0.00218 BTC</strong> cumulative "amount spent" in the current velocity window. The refusal is the smoking-gun proof that the policy is active.</em></figcaption>
 </figure>
 The harness snapshots the counters before and after running the signing
 tests, computes the deltas, and asserts they match the number of
 approvals/rejections it saw in its own results. If the numbers agree,
 everything from the WebSocket to the device-visible state is consistent.
 If this test skips with "could not parse dashboard counters on this
 CKBunker version", the scraper regex didn't find the numbers in your
 CKBunker's HTML. The signing tests already proved correctness — file an
 issue with your CKBunker page source if you'd like regex support.
 ---
 ## The end-to-end picture
 All six tests tell you, in one short run, whether the whole trust model
 is intact:
 | Layer validated                              | Tests that cover it    |
 |----------------------------------------------|------------------------|
 | Network / Tailscale / Cloudflare reachability| 1                      |
 | CKBunker service running, WS protocol intact | 1, 2, 3, 4, 5          |
 | Coldcard reachable, USB passthrough live     | 2                      |
 | Coldcard policy loaded, Rule #2 path active  | 3                      |
 | **Coldcard policy Rule #1 gate enforced**    | **4**                  |
 | TOTP secret in sync between device + holder  | 5                      |
 | Server state tracks device decisions         | 6                      |
 A green run is a strong signal that the HSM is doing its job. A red run
 on test 4 is the kind of finding you'd want to wake up for.
 ---
 ## Running this yourself
 Every capability shown here maps to one test in [the test suite](../tests/).
 To reproduce on your own deployment:
 1. Follow the setup in the top-level [README](../README.md).
 2. Adjust `config.yaml` to match your policy's per-txn and per-period caps.
 3. Craft two PSBTs per [fixtures/README.md](../fixtures/README.md) —
   one below your Rule #2 cap, one above it.
 4. Run `./hsm_validate.py`.
 A passing run should match the flow in this document. A failing run
 should tell you exactly which layer of the HSM contract is broken.
--- a/docs/images/01-message-signed-verified.png
+++ b/docs/images/01-message-signed-verified.png
--- a/docs/images/02-message-signing-ui.png
+++ b/docs/images/02-message-signing-ui.png
--- a/docs/images/03-tx-large-unsigned.png
+++ b/docs/images/03-tx-large-unsigned.png
--- a/docs/images/04-tx-large-rejected.png
+++ b/docs/images/04-tx-large-rejected.png
--- a/docs/images/05-tx-small-signing.png
+++ b/docs/images/05-tx-small-signing.png
--- a/docs/images/06-tx-small-success.png
+++ b/docs/images/06-tx-small-success.png
--- a/docs/images/07-policy-bunker-setup.png
+++ b/docs/images/07-policy-bunker-setup.png
--- a/docs/images/08-cli-sign-small.png
+++ b/docs/images/08-cli-sign-small.png
--- a/docs/images/09-tx-small-broadcast-ready.png
+++ b/docs/images/09-tx-small-broadcast-ready.png
--- a/docs/images/10-cli-sign-totp.png
+++ b/docs/images/10-cli-sign-totp.png
--- a/docs/images/11-tx-large-broadcast-ready.png
+++ b/docs/images/11-tx-large-broadcast-ready.png
--- a/docs/images/12-dashboard-counters.png
+++ b/docs/images/12-dashboard-counters.png
--- a/docs/images/13-coldcard-installed-closeup.jpg
+++ b/docs/images/13-coldcard-installed-closeup.jpg
--- a/docs/images/14-production-rack.jpg
+++ b/docs/images/14-production-rack.jpg
--- a/docs/images/15-coldcard-ports.jpg
+++ b/docs/images/15-coldcard-ports.jpg