# Operator's Manual — mineracks distributed policy-enforced multisig HSM **A 2-of-3 threshold HSM that auto-signs cold→hot refills under on-device policy, with no human in the loop.** This manual is everything you need to set the system up and **operate it safely**. Read §1–§3 before you touch anything — the safety model is not optional context, it is the reason the design is shaped the way it is, and operating it without understanding it will get funds lost. | | | |---|---| | **Status** | A **live reference deployment** runs at `multisighsm.com` on real Bitcoin signet — every spend is a genuine on-chain 2-of-3 co-signature enforced by on-device policy. This manual documents both the **reference configuration** (**[REF]**) and the **production hardware procedure** (**[PROD]**). | | **Audience** | The operator running the treasury / refill tier (you), and a reviewer evaluating it. | | **Companion docs** | Design & market rationale: [`README.md`](./README.md) · live demo: `multisighsm.com` | | **Scope** | The **cold/warm tier and the cold→hot refill pipe** — low-throughput, high-stakes. **NOT** your high-TPS hot-wallet signer (that's MPC's job — see §1.2). | --- ## 1. Read this first — the security model ### 1.1 What the system is Three independent **Coldcard** hardware signers, each in **HSM mode** under its own spending policy, each on a **physically separate host** (ideally one genuinely offsite). A **keyless coordinator** watches your hot wallet, builds a refill transaction when it runs low, fans the unsigned PSBT to **any two** of the three signers, they **auto-sign under their on-device policy with no human**, and the coordinator combines, finalizes and broadcasts. Any 2 of 3 can move funds; lose any one signer and you keep operating; lose two and funds are **frozen safe**. ### 1.2 What it is NOT It is **not a hot-wallet signing engine.** Coldcard signing is seconds-per-PSBT and is not built for the hundreds–thousands of signatures/hour a busy exchange hot wallet does, and it carries no FIPS/PKCS#11 certification an auditor or insurer will expect for the primary hot engine. **Keep using an MPC platform for the hot wallet.** This system secures the **95% of reserves behind it** and the **refill pipe** between cold and hot — where automation with hardware keys you physically hold beats both a manual 3am ceremony and a custodian. ### 1.3 The three enforcement layers (and which one you actually trust) | Layer | Where it runs | Enforces | Trust property | |---|---|---|---| | **On-device policy** | Each Coldcard secure element | per-txn cap · velocity · **address whitelist** · message-signing paths | **Tamper-proof. This is the safety floor.** A compromised host cannot lift it. | | **Coordinator global velocity cap** | The keyless coordinator (software) | the *authoritative* total-per-period across all signers | **Operational, not safety.** Precise day-to-day limit; bypassable if the coordinator is compromised — but bounded by the layer above. | | **Quorum (2-of-3)** | The protocol | no single signer can move funds; no single signer outage freezes funds | Structural. | > **The golden rule: the coordinator may *limit*, but only the hardware may *bound*.** Size the on-device > limits as your real safety envelope; treat the coordinator cap as a tighter operational convenience. ### 1.4 What a compromise can and cannot do **Coordinator fully compromised (worst realistic software breach):** - It **holds no keys** → cannot sign or forge. Every spend still needs **two Coldcards** to each pass their own on-device whitelist + cap + velocity. - It **cannot redirect funds off the whitelist** → at worst it prematurely refills *your own* hot-wallet addresses, never an attacker's address. - It **cannot exceed the devices' velocity ceilings.** That ceiling is the true catastrophic bound. - **Blast radius = 1.5 × the per-device velocity ceiling `V`** (with any-2-of-3): every spend burns two signatures for one unit of value, so 3 devices × `V` ÷ 2 sigs = `1.5V` extractable before the hardware freezes it. **Size accordingly — see §6.3.** - *Residual:* a compromised coordinator **plus** a compromised hot wallet could drain up to `1.5V` into the (whitelisted) hot wallet and out. That's why `V` lives on the secure element, not in software. **One or more signer hosts compromised:** - Owning **one** signer host moves nothing (need two; the device still enforces its own policy and holds the key in its secure element — the host can't extract it). - Owning **two** signer hosts: an attacker can produce two signatures, but each device **still enforces its policy** — so spends are still bounded by per-txn cap + velocity and confined to the whitelist. To steal, an attacker would need to defeat **two independent secure elements' policies**, not two Linux hosts. **Coordinator offline:** **fail-safe for safety.** No coordinator → no PSBT is built → no spend → the cap trivially holds. You lose the ability to *refill* (a liveness gap, §1.5), never control of funds. ### 1.5 No single point of failure — and the one you must engineer around - **Keys:** 2-of-3 across independent failure domains. Survives losing any one signer. - **Global velocity counter:** **derived from the blockchain**, not a single-host file (see §6.2). Any coordinator replica on any host recomputes the same number from chain history → no single ledger to lose or tamper. - **Coordinator liveness:** a single coordinator is a *liveness* SPOF (if it's down you can't refill). Run it **replicated across the same independent hosts as the signers** so any replica can drive a refill. Because the counter is chain-derived and the cap is bounded by hardware, replicas need no shared state and a rogue replica can't exceed the hardware bound. --- ## 2. Architecture & topology ``` hot wallet (MPC / your existing engine) ── monitored ──┐ ▲ │ │ signed refill broadcast │ "balance below floor" │ ▼ ┌──────────────────┴───────────────────────────────────────────────┐ │ KEYLESS COORDINATOR (replicated; holds NO keys) │ │ • watches hot-wallet floor • builds PSBT from watch-only wallet│ │ • global velocity cap (chain-derived) • fans to any 2 of 3 │ │ • combines → finalizes → broadcasts │ └───────┬───────────────────┬───────────────────┬──────────────────-┘ │ tailnet │ tailnet │ tailnet ┌───────▼──────┐ ┌───────▼──────┐ ┌────────▼─────┐ │ signer host A│ │ signer host B│ │ signer host C│ ← independent failure domains │ Coldcard + │ │ Coldcard + │ │ Coldcard + │ (power / switch / site) │ signer-agent │ │ signer-agent │ │ signer-agent │ one ideally OFFSITE │ HSM policy A │ │ HSM policy B │ │ HSM policy C │ └──────────────┘ └──────────────┘ └──────────────┘ keys live ONLY on the secure elements; agents hold none │ watch-only 2-of-3 descriptor wallet on a Bitcoin full node ``` **Components:** - **Signer host (×3)** — a small machine (NUC/Pi/VM) with a USB-attached Coldcard, running a **signer agent**: a thin authenticated tailnet service wrapping `ckcc-protocol`. Receives `{psbt, wallet_id}`, ensures its device is HSM-started with the 2-of-3 registered, signs, returns `partial_psbt` or `denied(reason)`. **Holds no keys.** **[REF]** the reference deployment runs all three as segregated `--mk5 --hsm` Coldcard signers, each on its own unix socket. - **Coordinator** — builds PSBTs, enforces the global cap, fans out 2-of-3, combines/broadcasts, exposes the control surface. **[REF]** `orchestrator.py` (`multisig-orchestrator`, `:8099`). - **Watch-only wallet** — a `bitcoind` descriptor wallet tracking the 2-of-3 descriptor. **Reuse, don't build.** **[REF]** a watch-only wallet on a signet node. - **Bitcoin node** — provides the watch-only wallet, builds PSBTs, broadcasts, and is the source of truth for the velocity counter. **[REF]** a signet node (RPC over the tailnet). --- ## 3. Failure-domain placement (the make-or-break) This is the single most important deployment decision, and the reasoning is concrete: nominally independent hosts can still fail *together* — a shared power feed or UPS, a common chassis or drive batch, the same hypervisor or network switch, or correlated hardware faults. **If two of three keys sit in the same failure domain, a single event can take both down and freeze the treasury.** Therefore: - The three signers **MUST** sit in **independent failure domains** — different physical hosts, ideally different power circuits / UPS / network switches. - **At least one signer should be genuinely offsite** (a small physical box over Tailscale). A cloud VPS cannot host a USB Coldcard, but a Pi/NUC at a second location can be signer #3. - Spread the coordinator replicas across those same domains. - Do **not** co-locate two signers on hosts that share a single point of failure (same PSU, same switch, same rack PDU, same hypervisor). Quorum HA is worthless if one event takes out two keys. > **[REF] note:** the reference deployment runs all three signers on one host for convenience — that has **no** > failure-domain independence and is for functional validation. Production uses three independent hosts (above); > never hold mainnet value on a single host that runs more than one signer. --- ## 4. Prerequisites & bill of materials **[PROD] hardware:** - **3× Coldcard Mk5** (or Q) — the current device; dual secure elements (ATECC608 + DS28C36B). HSM mode + multisig (P2SH/P2WSH) co-signing are supported in firmware. - **3× signer hosts** in independent failure domains (one offsite), each with a free USB port. - **Steel backups** for 3 seeds + a durable record of the wallet descriptor. **Software / services:** - A **Bitcoin full node** (watch-only-capable; descriptor wallets). Mainnet for production; signet for the lab. - **Tailscale** on every signer host + coordinator (signer agents are RPC *clients* — they don't bind the tailnet IP, so the bind-race gotcha doesn't apply). - `ckcc-protocol` (the `ckcc` CLI / Python lib) on each signer host. - The coordinator + signer-agent software (`orchestrator.py` is the reference coordinator). - (Optional) **CKBunker** as a human break-glass UI, kept **off** the automated critical path. **Skills:** comfortable with bitcoind RPC, descriptors/PSBT, Coldcard HSM mode, and Linux service ops. --- ## 5. Initial setup > ⚠️ **Do a complete dry-run on signet or testnet first** (the lab does exactly this). Only move to mainnet > once you have rehearsed a refill, a failover, a policy change, and a full restore from backup. ### Step 1 — Generate three independent seeds Generate a **distinct** seed on **each** Coldcard (never clone one seed to three devices — that defeats the whole model). Record each 24-word seed to steel and store the three **geographically separated**. Note each device's master fingerprint + the BIP-48 account xpub (`m/48'/0'/0'/2'` mainnet; `m/48'/1'/0'/2'` signet). ### Step 2 — Build the 2-of-3 descriptor + watch-only wallet Construct `wsh(sortedmulti(2, key1, key2, key3))` from the three `[fingerprint/48h/0h/0h/2h]xpub` keys (receive `/0/*` and change `/1/*` branches). Create a **watch-only** descriptor wallet on the node and import both branches (`importdescriptors`, private keys disabled, internal=true for the change branch). Record the descriptor durably — see §9, losing it makes recovery painful even with all three seeds. ### Step 3 — Register the 2-of-3 wallet on EACH Coldcard Each device must have the multisig wallet **registered** so it recognises change back to the wallet as internal (and doesn't mistake it for an external send that the whitelist/velocity would block). Export the Coldcard multisig config and `ckcc upload -m ` to each device (confirm on-device). ### Step 4 — Author the HSM policy (per device) Policy is **JSON, versioned in this repo**, deployed to each device. Minimum viable policy (see §6 for the full reference and the diverse-policy option): ```json { "must_log": true, "period": 60, "msg_paths": ["any"], "rules": [ { "max_amount": 8000, "per_period": 33333, "wallet": "ckms23", "whitelist": ["", "<...>"] } ] } ``` (`per_period` here is the **per-device** ceiling `V`; see §6.3 for why `V = ⅔ × global cap`.) ### Step 4b — Enrol the TOTP "owner" (only if you use the surge tier) If your policy has a TOTP-gated **surge tier** (§6.2b — a rule with `users:["owner"], min_users:1`), enrol the `owner` TOTP user on **each** Coldcard **before** loading the policy. All three signers must hold the **same** secret (one authenticator code has to work for whichever two devices sign), so the device-picks-its-own-QR path does **not** apply here — a shared secret has to be generated once and loaded onto all three. > 🔑 **Who generates it matters (production vs demo).** **Production:** *the owner* generates the secret > (in their authenticator app, or offline) and loads it onto each Coldcard **directly over USB during setup** — > **never through the coordinator**. Then a fully-compromised coordinator can't mint surge codes; at spend time > it only **relays** the owner's live 6-digit code (it never needs the secret). **Demo only:** the coordinator > generates the secret + shows the enrolment QR for convenience — acceptable on signet with no real funds, but > it raises a coordinator-compromise blast radius from tier-1 up to the surge ceiling, so don't do it in prod. Mechanically: `create_user("owner", USER_AUTH_TOTP, )` on each device (standard RFC-6238, so any authenticator app works). The secret must persist so re-arming a rebooted signer re-enrols the same value. ### Step 5 — Load policy + start HSM mode (MIND THE ORDERING) > 🪤 **Sharp edge (ckbunker issue #12):** loading an HSM policy can *delete a registered multisig wallet* on > the Coldcard. **Order: register the multisig wallet (Step 3) → enrol the TOTP user (Step 4b, if any) → load > the policy → verify the wallet still exists on the device.** Re-check `hsm_status` and the registered-wallet > count after every policy load. Load the policy and enter HSM mode on each device (the on-device approval is two-step: confirm, then a random digit to save). HSM mode is a **one-way trip** until reboot — design for it. ### Step 6 — Configure the coordinator - **Global velocity cap `G`** — the authoritative total-per-period (chain-derived; §6.2). - **Per-device ceiling `V = ⅔ × G`** in each device policy (§6.3), and enable **round-robin** of the signer pair so rotation stays even. - **Whitelist** = your hot-wallet deposit addresses (anonymous/untrusted callers can never add to it). - **Refill trigger** — the hot-wallet floor that starts a refill, and the refill amount. - The coordinator's HMAC session secret lives **only on the host** (`chmod 600`, **never in git**). ### Step 7 — Verify before funding Quorum check: **≥2 of 3** signers online **and** HSM-active **and** wallet-registered. Then, on signet/testnet: run a within-policy refill (expect sign+broadcast), an over-cap spend (expect on-device refusal), an off-list spend (expect refusal), and a velocity-exceeding burst (expect the coordinator to block at `G`). Confirm the chain-derived counter increments on broadcast and the on-device refusal reasons read correctly. --- ## 6. Policy reference ### 6.1 The four on-device gates (tamper-proof, per device) | Gate | Policy field | What it does | Notes | |---|---|---|---| | Per-txn cap | `max_amount` | refuses any single spend over the cap | sats | | Velocity | `per_period` + top-level `period` | refuses once this device's signed total in the window exceeds it | **per-device**, counted locally — see §6.3 | | Whitelist | `whitelist: [addr…]` | external outputs must be on this list; change back to `wallet` is exempt | the strongest control — confines *where* funds can go | | Message paths | `msg_paths` | which derivation paths may sign messages (`["any"]` or specific) | proof-of-control without moving funds | | (binding) | `wallet` | names the registered multisig so change is recognised as internal | required, or change trips the whitelist | | (audit) | `must_log: true` | device writes a log entry per decision | feed into monitoring (§8) | ### 6.2 The coordinator global velocity cap (the authoritative limit) On-device velocity counters **drift** under a rotating "any 2 of 3" (each device only counts what *it* signed) and a device decrements its counter the moment it signs **even if the PSBT is later dropped and never broadcast**. So no single device's counter reflects true global outflow. The coordinator therefore enforces the real cap, and: - it counts **only real broadcasts** — so dropped/refused PSBTs never burn budget; - it **includes mempool (0-conf) sends** — they're counted the instant they broadcast, so a burst of spends inside one ~10-min block interval can't slip past (abandoned/conflicted txns are excluded so a dropped tx doesn't permanently burn budget); - it is **chain-derived** — computed from the watch-only wallet's on-chain `send` total over the window (`listtransactions`), not a single-host file, so any replica recomputes it and there is no ledger SPOF. ### 6.2b Surge tier — a TOTP-gated higher tier (human in the loop) For occasional large moves, add a **second, ordered rule** that permits a higher `max_amount` + `per_period` (and/or extra whitelist) but requires the owner's **TOTP** (`users:["owner"], min_users:1`). Coldcard evaluates rules **first-match**, so routine spends match the automated tier-1 and a larger one falls through to tier-2 and is refused unless a valid code is presented. **Secure model:** the owner reads the 6-digit code from a normal authenticator app (standard RFC-6238); the **keyless coordinator only relays it** (`user_auth`) — the secret lives on the devices + the owner's app, never on the coordinator, so a compromised coordinator can't mint codes. The coordinator's global cap gets a matching surge ceiling. **Sizing:** the surge ceilings raise the §1.4 blast-radius bound *when a code is present* — set them to the most you'd ever authorise in one human-approved move, and treat entering a code as approving a spend up to the surge cap. *(Live for signed-in users on the reference deployment; exercised end-to-end on real signet.)* ### 6.3 Sizing — the 1.5× rule (do not skip this) A compromised coordinator that ignores `G` is bounded by the hardware ceilings. With any-2-of-3, worst-case extractable = **`1.5 × V`** (each spend burns two signatures for one unit of value; `3V ÷ 2`). Therefore: - **Set each device's `per_period` `V = ⅔ × G`.** Then worst-case `1.5 × ⅔G = G` — the hardware bound equals your intended global velocity. - **Round-robin the signer pair** so rotation is even, otherwise busy devices hit `⅔G` early and honest refills stall (raising `V` for liveness margin pushes the worst case back above `G`). - **Topology dial:** - *any-2-of-3 auto-signers* → worst case **1.5×**, but "lose any one, keep running" unattended. - *2 fixed auto-signers + 1 human break-glass (offline)* → every routine spend needs both online devices, so `V ≥ G`, but the offline key can't be used unattended → worst case **1.0×**, at the cost of unattended failover. Choose deliberately. --- ## 7. Day-to-day operations **Automated refill (the normal path, no human):** hot wallet drops below the floor → coordinator builds a PSBT to a whitelisted address from the watch-only wallet → checks the global cap → fans to 2 of 3 → devices auto-sign under policy → combine → broadcast. Watch it confirm on your explorer. **Change a policy value (cap / velocity / period):** edit the versioned policy, push to each device, **re-arm** all signers. Re-arming restarts the HSM session and **resets the per-device velocity counters** — expected. Re-verify the registered wallet survived (§5 Step 5 trap). **Add a whitelisted destination:** add the address to the policy `whitelist`, push, re-arm. Only the operator can do this; untrusted callers can never extend the whitelist. **Take a signer down for maintenance:** quorum tolerates **one** down — the other two keep signing. Bring it back, re-arm it (§ boot-to-signing-ready), confirm quorum is 3 again. **Never** take a second one down while one is already offline (that freezes spending until one returns). **Boot-to-signing-ready:** a Coldcard needs PIN + HSM-mode entry after any power loss. Unattended operation means the signer agent must restore the device to signing-ready automatically after a reboot, and monitoring must confirm it did — a device that silently fails to return erodes quorum. Rehearse: reboot a signer host and confirm the agent re-arms it and quorum self-heals. --- ## 8. Monitoring & alerting (non-negotiable for unattended operation) Wire these into your observability stack (your observability stack (e.g. Loki + Grafana + alerting)): - **Quorum health** — are **≥2 of 3** signers online **and** HSM-active **and** wallet-registered? **Alert the moment it drops to exactly 2** (one more failure = frozen). - **Velocity near limit** — global cap approaching for the period; per-device counters near `V`. - **Policy denials** — every on-device refusal (the `must_log` trail) → alert; a spike may signal an attack or a misconfiguration. - **USB / device health** — VMs surviving a host crash can carry latent USB/udev damage; don't repeat-restart, restore from backup. - **Refill anomalies** — refills outside expected cadence/size (a compromised coordinator's tell). --- ## 9. Backup, recovery & DR - **3 seeds**, each to steel, **geographically separated**. 2-of-3 survives losing **one** seed. - **The descriptor is load-bearing.** Losing it makes recovery painful **even with all three seeds** — store the wallet descriptor offsite, independently of the seeds (in your vault). - **Rehearse recovery** before funding mainnet: reconstruct the watch-only wallet from the descriptor on a fresh node, recover a signer from seed, re-register the multisig, reload policy, sign a test spend. - **Coordinator state is disposable** — it's chain-derived; a replacement coordinator recomputes the velocity counter from chain history with zero handed-over state. --- ## 10. Incident response & break-glass | Situation | Response | |---|---| | One signer down | Operate on the remaining two; restore + re-arm the third; do not drop a second. | | Two signers down | Spending is **frozen (safe)**. Restore one to resume. This is the design working. | | Coordinator compromised suspected | Funds are bounded by §1.4 (whitelist + `1.5V`). Rotate the coordinator host/secret; the hardware caps already contain the blast radius. Review the policy-denial + refill logs. | | Large / exception spend needed | Use the **human break-glass** path (3rd human-held key / CKBunker), outside the automated policy. | | Suspected key compromise (one device) | One key alone moves nothing. Rotate to a fresh 2-of-3 (new seeds + descriptor), sweep funds under the old policy to the new wallet. | --- ## 11. Known sharp edges (read before production) - **USB passthrough pins a VM to its host** — a VM with a physical USB Coldcard **cannot live-migrate**. These signer VMs deliberately break the "freely migratable" model; the multisig *is* the HA. - **HSM + multisig together is advanced / lightly-charted** — soak on signet for a long time before mainnet. - **CKBunker is niche** (v0.9.1, "at your own risk") — keep it as the human break-glass surface only; the automated path is the signer-agent over `ckcc-protocol`, not CKBunker. - **The #12 ordering trap** (§5 Step 5) — register wallet → load policy → verify wallet survived. - **Velocity counters don't compose across devices** — that's the whole reason the authoritative cap is the coordinator's chain-derived one; per-device velocity is the hardware bound, not the operational truth. - **Post-host-crash latent damage** — a signer VM that survived a host hard-crash can carry subclinical FS/USB damage; restore from backup rather than repeat-restarting. --- ## 12. Regulatory note Running this for **your own** funds (treasury / your own refill tier) is **not** custody-of-others. Offering it as **custody-as-a-service for third parties** is **regulated activity in AU (AUSTRAC / financial services)** — get legal advice before productising. Same flag as a regulated swap service. ---