multisig-hsm/docs/OPERATOR-MANUAL.md
mineracks 2b14ab6d57 Point demo links to canonical multisighsm.com
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 14:23:42 +10:00

25 KiB
Raw Blame History

Operator's Manual — mineracks distributed policy-enforced multisig HSM

A 2-of-3 threshold HSM that auto-signs cold→hot refills under on-device policy, with no human in the loop.

This manual is everything you need to set the system up and operate it safely. Read §1§3 before you touch anything — the safety model is not optional context, it is the reason the design is shaped the way it is, and operating it without understanding it will get funds lost.

Status A live reference deployment runs at multisighsm.com on real Bitcoin signet — every spend is a genuine on-chain 2-of-3 co-signature enforced by on-device policy. This manual documents both the reference configuration ([REF]) and the production hardware procedure ([PROD]).
Audience The operator running the treasury / refill tier (you), and a reviewer evaluating it.
Companion docs Design & market rationale: README.md · live demo: multisighsm.com
Scope The cold/warm tier and the cold→hot refill pipe — low-throughput, high-stakes. NOT your high-TPS hot-wallet signer (that's MPC's job — see §1.2).

1. Read this first — the security model

1.1 What the system is

Three independent Coldcard hardware signers, each in HSM mode under its own spending policy, each on a physically separate host (ideally one genuinely offsite). A keyless coordinator watches your hot wallet, builds a refill transaction when it runs low, fans the unsigned PSBT to any two of the three signers, they auto-sign under their on-device policy with no human, and the coordinator combines, finalizes and broadcasts. Any 2 of 3 can move funds; lose any one signer and you keep operating; lose two and funds are frozen safe.

1.2 What it is NOT

It is not a hot-wallet signing engine. Coldcard signing is seconds-per-PSBT and is not built for the hundredsthousands of signatures/hour a busy exchange hot wallet does, and it carries no FIPS/PKCS#11 certification an auditor or insurer will expect for the primary hot engine. Keep using an MPC platform for the hot wallet. This system secures the 95% of reserves behind it and the refill pipe between cold and hot — where automation with hardware keys you physically hold beats both a manual 3am ceremony and a custodian.

1.3 The three enforcement layers (and which one you actually trust)

Layer Where it runs Enforces Trust property
On-device policy Each Coldcard secure element per-txn cap · velocity · address whitelist · message-signing paths Tamper-proof. This is the safety floor. A compromised host cannot lift it.
Coordinator global velocity cap The keyless coordinator (software) the authoritative total-per-period across all signers Operational, not safety. Precise day-to-day limit; bypassable if the coordinator is compromised — but bounded by the layer above.
Quorum (2-of-3) The protocol no single signer can move funds; no single signer outage freezes funds Structural.

The golden rule: the coordinator may limit, but only the hardware may bound. Size the on-device limits as your real safety envelope; treat the coordinator cap as a tighter operational convenience.

1.4 What a compromise can and cannot do

Coordinator fully compromised (worst realistic software breach):

  • It holds no keys → cannot sign or forge. Every spend still needs two Coldcards to each pass their own on-device whitelist + cap + velocity.
  • It cannot redirect funds off the whitelist → at worst it prematurely refills your own hot-wallet addresses, never an attacker's address.
  • It cannot exceed the devices' velocity ceilings. That ceiling is the true catastrophic bound.
  • Blast radius = 1.5 × the per-device velocity ceiling V (with any-2-of-3): every spend burns two signatures for one unit of value, so 3 devices × V ÷ 2 sigs = 1.5V extractable before the hardware freezes it. Size accordingly — see §6.3.
  • Residual: a compromised coordinator plus a compromised hot wallet could drain up to 1.5V into the (whitelisted) hot wallet and out. That's why V lives on the secure element, not in software.

One or more signer hosts compromised:

  • Owning one signer host moves nothing (need two; the device still enforces its own policy and holds the key in its secure element — the host can't extract it).
  • Owning two signer hosts: an attacker can produce two signatures, but each device still enforces its policy — so spends are still bounded by per-txn cap + velocity and confined to the whitelist. To steal, an attacker would need to defeat two independent secure elements' policies, not two Linux hosts.

Coordinator offline: fail-safe for safety. No coordinator → no PSBT is built → no spend → the cap trivially holds. You lose the ability to refill (a liveness gap, §1.5), never control of funds.

1.5 No single point of failure — and the one you must engineer around

  • Keys: 2-of-3 across independent failure domains. Survives losing any one signer.
  • Global velocity counter: derived from the blockchain, not a single-host file (see §6.2). Any coordinator replica on any host recomputes the same number from chain history → no single ledger to lose or tamper.
  • Coordinator liveness: a single coordinator is a liveness SPOF (if it's down you can't refill). Run it replicated across the same independent hosts as the signers so any replica can drive a refill. Because the counter is chain-derived and the cap is bounded by hardware, replicas need no shared state and a rogue replica can't exceed the hardware bound.

2. Architecture & topology

        hot wallet (MPC / your existing engine)  ── monitored ──┐
                      ▲                                          │
                      │ signed refill broadcast                  │ "balance below floor"
                      │                                          ▼
   ┌──────────────────┴───────────────────────────────────────────────┐
   │  KEYLESS COORDINATOR  (replicated; holds NO keys)                  │
   │   • watches hot-wallet floor   • builds PSBT from watch-only wallet│
   │   • global velocity cap (chain-derived)  • fans to any 2 of 3      │
   │   • combines → finalizes → broadcasts                              │
   └───────┬───────────────────┬───────────────────┬──────────────────-┘
           │ tailnet           │ tailnet           │ tailnet
   ┌───────▼──────┐    ┌───────▼──────┐    ┌────────▼─────┐
   │ signer host A│    │ signer host B│    │ signer host C│   ← independent failure domains
   │ Coldcard +   │    │ Coldcard +   │    │ Coldcard +   │     (power / switch / site)
   │ signer-agent │    │ signer-agent │    │ signer-agent │     one ideally OFFSITE
   │ HSM policy A │    │ HSM policy B │    │ HSM policy C │
   └──────────────┘    └──────────────┘    └──────────────┘
           keys live ONLY on the secure elements; agents hold none
                              │
                   watch-only 2-of-3 descriptor wallet
                       on a Bitcoin full node

Components:

  • Signer host (×3) — a small machine (NUC/Pi/VM) with a USB-attached Coldcard, running a signer agent: a thin authenticated tailnet service wrapping ckcc-protocol. Receives {psbt, wallet_id}, ensures its device is HSM-started with the 2-of-3 registered, signs, returns partial_psbt or denied(reason). Holds no keys. [REF] the reference deployment runs all three as segregated --mk5 --hsm Coldcard signers, each on its own unix socket.
  • Coordinator — builds PSBTs, enforces the global cap, fans out 2-of-3, combines/broadcasts, exposes the control surface. [REF] orchestrator.py (multisig-orchestrator, :8099).
  • Watch-only wallet — a bitcoind descriptor wallet tracking the 2-of-3 descriptor. Reuse, don't build. [REF] a watch-only wallet on a signet node.
  • Bitcoin node — provides the watch-only wallet, builds PSBTs, broadcasts, and is the source of truth for the velocity counter. [REF] a signet node (RPC over the tailnet).

3. Failure-domain placement (the make-or-break)

This is the single most important deployment decision, and the reasoning is concrete: nominally independent hosts can still fail together — a shared power feed or UPS, a common chassis or drive batch, the same hypervisor or network switch, or correlated hardware faults. If two of three keys sit in the same failure domain, a single event can take both down and freeze the treasury. Therefore:

  • The three signers MUST sit in independent failure domains — different physical hosts, ideally different power circuits / UPS / network switches.
  • At least one signer should be genuinely offsite (a small physical box over Tailscale). A cloud VPS cannot host a USB Coldcard, but a Pi/NUC at a second location can be signer #3.
  • Spread the coordinator replicas across those same domains.
  • Do not co-locate two signers on hosts that share a single point of failure (same PSU, same switch, same rack PDU, same hypervisor). Quorum HA is worthless if one event takes out two keys.

[REF] note: the reference deployment runs all three signers on one host for convenience — that has no failure-domain independence and is for functional validation. Production uses three independent hosts (above); never hold mainnet value on a single host that runs more than one signer.


4. Prerequisites & bill of materials

[PROD] hardware:

  • 3× Coldcard Mk5 (or Q) — the current device; dual secure elements (ATECC608 + DS28C36B). HSM mode + multisig (P2SH/P2WSH) co-signing are supported in firmware.
  • 3× signer hosts in independent failure domains (one offsite), each with a free USB port.
  • Steel backups for 3 seeds + a durable record of the wallet descriptor.

Software / services:

  • A Bitcoin full node (watch-only-capable; descriptor wallets). Mainnet for production; signet for the lab.
  • Tailscale on every signer host + coordinator (signer agents are RPC clients — they don't bind the tailnet IP, so the bind-race gotcha doesn't apply).
  • ckcc-protocol (the ckcc CLI / Python lib) on each signer host.
  • The coordinator + signer-agent software (orchestrator.py is the reference coordinator).
  • (Optional) CKBunker as a human break-glass UI, kept off the automated critical path.

Skills: comfortable with bitcoind RPC, descriptors/PSBT, Coldcard HSM mode, and Linux service ops.


5. Initial setup

⚠️ Do a complete dry-run on signet or testnet first (the lab does exactly this). Only move to mainnet once you have rehearsed a refill, a failover, a policy change, and a full restore from backup.

Step 1 — Generate three independent seeds

Generate a distinct seed on each Coldcard (never clone one seed to three devices — that defeats the whole model). Record each 24-word seed to steel and store the three geographically separated. Note each device's master fingerprint + the BIP-48 account xpub (m/48'/0'/0'/2' mainnet; m/48'/1'/0'/2' signet).

Step 2 — Build the 2-of-3 descriptor + watch-only wallet

Construct wsh(sortedmulti(2, key1, key2, key3)) from the three [fingerprint/48h/0h/0h/2h]xpub keys (receive /0/* and change /1/* branches). Create a watch-only descriptor wallet on the node and import both branches (importdescriptors, private keys disabled, internal=true for the change branch). Record the descriptor durably — see §9, losing it makes recovery painful even with all three seeds.

Step 3 — Register the 2-of-3 wallet on EACH Coldcard

Each device must have the multisig wallet registered so it recognises change back to the wallet as internal (and doesn't mistake it for an external send that the whitelist/velocity would block). Export the Coldcard multisig config and ckcc upload -m <wallet.txt> to each device (confirm on-device).

Step 4 — Author the HSM policy (per device)

Policy is JSON, versioned in this repo, deployed to each device. Minimum viable policy (see §6 for the full reference and the diverse-policy option):

{
  "must_log": true,
  "period": 60,
  "msg_paths": ["any"],
  "rules": [
    { "max_amount": 8000,
      "per_period": 33333,
      "wallet": "ckms23",
      "whitelist": ["<hot-wallet deposit address 1>", "<...>"] }
  ]
}

(per_period here is the per-device ceiling V; see §6.3 for why V = ⅔ × global cap.)

Step 4b — Enrol the TOTP "owner" (only if you use the surge tier)

If your policy has a TOTP-gated surge tier (§6.2b — a rule with users:["owner"], min_users:1), enrol the owner TOTP user on each Coldcard before loading the policy. All three signers must hold the same secret (one authenticator code has to work for whichever two devices sign), so the device-picks-its-own-QR path does not apply here — a shared secret has to be generated once and loaded onto all three.

🔑 Who generates it matters (production vs demo). Production: the owner generates the secret (in their authenticator app, or offline) and loads it onto each Coldcard directly over USB during setupnever through the coordinator. Then a fully-compromised coordinator can't mint surge codes; at spend time it only relays the owner's live 6-digit code (it never needs the secret). Demo only: the coordinator generates the secret + shows the enrolment QR for convenience — acceptable on signet with no real funds, but it raises a coordinator-compromise blast radius from tier-1 up to the surge ceiling, so don't do it in prod.

Mechanically: create_user("owner", USER_AUTH_TOTP, <shared-secret>) on each device (standard RFC-6238, so any authenticator app works). The secret must persist so re-arming a rebooted signer re-enrols the same value.

Step 5 — Load policy + start HSM mode (MIND THE ORDERING)

🪤 Sharp edge (ckbunker issue #12): loading an HSM policy can delete a registered multisig wallet on the Coldcard. Order: register the multisig wallet (Step 3) → enrol the TOTP user (Step 4b, if any) → load the policy → verify the wallet still exists on the device. Re-check hsm_status and the registered-wallet count after every policy load.

Load the policy and enter HSM mode on each device (the on-device approval is two-step: confirm, then a random digit to save). HSM mode is a one-way trip until reboot — design for it.

Step 6 — Configure the coordinator

  • Global velocity cap G — the authoritative total-per-period (chain-derived; §6.2).
  • Per-device ceiling V = ⅔ × G in each device policy (§6.3), and enable round-robin of the signer pair so rotation stays even.
  • Whitelist = your hot-wallet deposit addresses (anonymous/untrusted callers can never add to it).
  • Refill trigger — the hot-wallet floor that starts a refill, and the refill amount.
  • The coordinator's HMAC session secret lives only on the host (chmod 600, never in git).

Step 7 — Verify before funding

Quorum check: ≥2 of 3 signers online and HSM-active and wallet-registered. Then, on signet/testnet: run a within-policy refill (expect sign+broadcast), an over-cap spend (expect on-device refusal), an off-list spend (expect refusal), and a velocity-exceeding burst (expect the coordinator to block at G). Confirm the chain-derived counter increments on broadcast and the on-device refusal reasons read correctly.


6. Policy reference

6.1 The four on-device gates (tamper-proof, per device)

Gate Policy field What it does Notes
Per-txn cap max_amount refuses any single spend over the cap sats
Velocity per_period + top-level period refuses once this device's signed total in the window exceeds it per-device, counted locally — see §6.3
Whitelist whitelist: [addr…] external outputs must be on this list; change back to wallet is exempt the strongest control — confines where funds can go
Message paths msg_paths which derivation paths may sign messages (["any"] or specific) proof-of-control without moving funds
(binding) wallet names the registered multisig so change is recognised as internal required, or change trips the whitelist
(audit) must_log: true device writes a log entry per decision feed into monitoring (§8)

6.2 The coordinator global velocity cap (the authoritative limit)

On-device velocity counters drift under a rotating "any 2 of 3" (each device only counts what it signed) and a device decrements its counter the moment it signs even if the PSBT is later dropped and never broadcast. So no single device's counter reflects true global outflow. The coordinator therefore enforces the real cap, and:

  • it counts only real broadcasts — so dropped/refused PSBTs never burn budget;
  • it includes mempool (0-conf) sends — they're counted the instant they broadcast, so a burst of spends inside one ~10-min block interval can't slip past (abandoned/conflicted txns are excluded so a dropped tx doesn't permanently burn budget);
  • it is chain-derived — computed from the watch-only wallet's on-chain send total over the window (listtransactions), not a single-host file, so any replica recomputes it and there is no ledger SPOF.

6.2b Surge tier — a TOTP-gated higher tier (human in the loop)

For occasional large moves, add a second, ordered rule that permits a higher max_amount + per_period (and/or extra whitelist) but requires the owner's TOTP (users:["owner"], min_users:1). Coldcard evaluates rules first-match, so routine spends match the automated tier-1 and a larger one falls through to tier-2 and is refused unless a valid code is presented. Secure model: the owner reads the 6-digit code from a normal authenticator app (standard RFC-6238); the keyless coordinator only relays it (user_auth) — the secret lives on the devices + the owner's app, never on the coordinator, so a compromised coordinator can't mint codes. The coordinator's global cap gets a matching surge ceiling. Sizing: the surge ceilings raise the §1.4 blast-radius bound when a code is present — set them to the most you'd ever authorise in one human-approved move, and treat entering a code as approving a spend up to the surge cap. (Live for signed-in users on the reference deployment; exercised end-to-end on real signet.)

6.3 Sizing — the 1.5× rule (do not skip this)

A compromised coordinator that ignores G is bounded by the hardware ceilings. With any-2-of-3, worst-case extractable = 1.5 × V (each spend burns two signatures for one unit of value; 3V ÷ 2). Therefore:

  • Set each device's per_period V = ⅔ × G. Then worst-case 1.5 × ⅔G = G — the hardware bound equals your intended global velocity.
  • Round-robin the signer pair so rotation is even, otherwise busy devices hit ⅔G early and honest refills stall (raising V for liveness margin pushes the worst case back above G).
  • Topology dial:
    • any-2-of-3 auto-signers → worst case 1.5×, but "lose any one, keep running" unattended.
    • 2 fixed auto-signers + 1 human break-glass (offline) → every routine spend needs both online devices, so V ≥ G, but the offline key can't be used unattended → worst case 1.0×, at the cost of unattended failover. Choose deliberately.

7. Day-to-day operations

Automated refill (the normal path, no human): hot wallet drops below the floor → coordinator builds a PSBT to a whitelisted address from the watch-only wallet → checks the global cap → fans to 2 of 3 → devices auto-sign under policy → combine → broadcast. Watch it confirm on your explorer.

Change a policy value (cap / velocity / period): edit the versioned policy, push to each device, re-arm all signers. Re-arming restarts the HSM session and resets the per-device velocity counters — expected. Re-verify the registered wallet survived (§5 Step 5 trap).

Add a whitelisted destination: add the address to the policy whitelist, push, re-arm. Only the operator can do this; untrusted callers can never extend the whitelist.

Take a signer down for maintenance: quorum tolerates one down — the other two keep signing. Bring it back, re-arm it (§ boot-to-signing-ready), confirm quorum is 3 again. Never take a second one down while one is already offline (that freezes spending until one returns).

Boot-to-signing-ready: a Coldcard needs PIN + HSM-mode entry after any power loss. Unattended operation means the signer agent must restore the device to signing-ready automatically after a reboot, and monitoring must confirm it did — a device that silently fails to return erodes quorum. Rehearse: reboot a signer host and confirm the agent re-arms it and quorum self-heals.


8. Monitoring & alerting (non-negotiable for unattended operation)

Wire these into your observability stack (your observability stack (e.g. Loki + Grafana + alerting)):

  • Quorum health — are ≥2 of 3 signers online and HSM-active and wallet-registered? Alert the moment it drops to exactly 2 (one more failure = frozen).
  • Velocity near limit — global cap approaching for the period; per-device counters near V.
  • Policy denials — every on-device refusal (the must_log trail) → alert; a spike may signal an attack or a misconfiguration.
  • USB / device health — VMs surviving a host crash can carry latent USB/udev damage; don't repeat-restart, restore from backup.
  • Refill anomalies — refills outside expected cadence/size (a compromised coordinator's tell).

9. Backup, recovery & DR

  • 3 seeds, each to steel, geographically separated. 2-of-3 survives losing one seed.
  • The descriptor is load-bearing. Losing it makes recovery painful even with all three seeds — store the wallet descriptor offsite, independently of the seeds (in your vault).
  • Rehearse recovery before funding mainnet: reconstruct the watch-only wallet from the descriptor on a fresh node, recover a signer from seed, re-register the multisig, reload policy, sign a test spend.
  • Coordinator state is disposable — it's chain-derived; a replacement coordinator recomputes the velocity counter from chain history with zero handed-over state.

10. Incident response & break-glass

Situation Response
One signer down Operate on the remaining two; restore + re-arm the third; do not drop a second.
Two signers down Spending is frozen (safe). Restore one to resume. This is the design working.
Coordinator compromised suspected Funds are bounded by §1.4 (whitelist + 1.5V). Rotate the coordinator host/secret; the hardware caps already contain the blast radius. Review the policy-denial + refill logs.
Large / exception spend needed Use the human break-glass path (3rd human-held key / CKBunker), outside the automated policy.
Suspected key compromise (one device) One key alone moves nothing. Rotate to a fresh 2-of-3 (new seeds + descriptor), sweep funds under the old policy to the new wallet.

11. Known sharp edges (read before production)

  • USB passthrough pins a VM to its host — a VM with a physical USB Coldcard cannot live-migrate. These signer VMs deliberately break the "freely migratable" model; the multisig is the HA.
  • HSM + multisig together is advanced / lightly-charted — soak on signet for a long time before mainnet.
  • CKBunker is niche (v0.9.1, "at your own risk") — keep it as the human break-glass surface only; the automated path is the signer-agent over ckcc-protocol, not CKBunker.
  • The #12 ordering trap (§5 Step 5) — register wallet → load policy → verify wallet survived.
  • Velocity counters don't compose across devices — that's the whole reason the authoritative cap is the coordinator's chain-derived one; per-device velocity is the hardware bound, not the operational truth.
  • Post-host-crash latent damage — a signer VM that survived a host hard-crash can carry subclinical FS/USB damage; restore from backup rather than repeat-restarting.

12. Regulatory note

Running this for your own funds (treasury / your own refill tier) is not custody-of-others. Offering it as custody-as-a-service for third parties is regulated activity in AU (AUSTRAC / financial services) — get legal advice before productising. Same flag as a regulated swap service.