multisig-hsm/docs/OPERATOR-MANUAL.md
mineracks 2b14ab6d57 Point demo links to canonical multisighsm.com
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-26 14:23:42 +10:00

386 lines
25 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Operator's Manual — mineracks distributed policy-enforced multisig HSM
**A 2-of-3 threshold HSM that auto-signs cold→hot refills under on-device policy, with no human in the loop.**
This manual is everything you need to set the system up and **operate it safely**. Read §1§3 before you
touch anything — the safety model is not optional context, it is the reason the design is shaped the way it
is, and operating it without understanding it will get funds lost.
| | |
|---|---|
| **Status** | A **live reference deployment** runs at `multisighsm.com` on real Bitcoin signet — every spend is a genuine on-chain 2-of-3 co-signature enforced by on-device policy. This manual documents both the **reference configuration** (**[REF]**) and the **production hardware procedure** (**[PROD]**). |
| **Audience** | The operator running the treasury / refill tier (you), and a reviewer evaluating it. |
| **Companion docs** | Design & market rationale: [`README.md`](./README.md) · live demo: `multisighsm.com` |
| **Scope** | The **cold/warm tier and the cold→hot refill pipe** — low-throughput, high-stakes. **NOT** your high-TPS hot-wallet signer (that's MPC's job — see §1.2). |
---
## 1. Read this first — the security model
### 1.1 What the system is
Three independent **Coldcard** hardware signers, each in **HSM mode** under its own spending policy, each on a
**physically separate host** (ideally one genuinely offsite). A **keyless coordinator** watches your hot
wallet, builds a refill transaction when it runs low, fans the unsigned PSBT to **any two** of the three
signers, they **auto-sign under their on-device policy with no human**, and the coordinator combines,
finalizes and broadcasts. Any 2 of 3 can move funds; lose any one signer and you keep operating; lose two and
funds are **frozen safe**.
### 1.2 What it is NOT
It is **not a hot-wallet signing engine.** Coldcard signing is seconds-per-PSBT and is not built for the
hundredsthousands of signatures/hour a busy exchange hot wallet does, and it carries no FIPS/PKCS#11
certification an auditor or insurer will expect for the primary hot engine. **Keep using an MPC platform for
the hot wallet.** This system secures the **95% of reserves behind it** and the **refill pipe** between cold
and hot — where automation with hardware keys you physically hold beats both a manual 3am ceremony and a
custodian.
### 1.3 The three enforcement layers (and which one you actually trust)
| Layer | Where it runs | Enforces | Trust property |
|---|---|---|---|
| **On-device policy** | Each Coldcard secure element | per-txn cap · velocity · **address whitelist** · message-signing paths | **Tamper-proof. This is the safety floor.** A compromised host cannot lift it. |
| **Coordinator global velocity cap** | The keyless coordinator (software) | the *authoritative* total-per-period across all signers | **Operational, not safety.** Precise day-to-day limit; bypassable if the coordinator is compromised — but bounded by the layer above. |
| **Quorum (2-of-3)** | The protocol | no single signer can move funds; no single signer outage freezes funds | Structural. |
> **The golden rule: the coordinator may *limit*, but only the hardware may *bound*.** Size the on-device
> limits as your real safety envelope; treat the coordinator cap as a tighter operational convenience.
### 1.4 What a compromise can and cannot do
**Coordinator fully compromised (worst realistic software breach):**
- It **holds no keys** → cannot sign or forge. Every spend still needs **two Coldcards** to each pass their
own on-device whitelist + cap + velocity.
- It **cannot redirect funds off the whitelist** → at worst it prematurely refills *your own* hot-wallet
addresses, never an attacker's address.
- It **cannot exceed the devices' velocity ceilings.** That ceiling is the true catastrophic bound.
- **Blast radius = 1.5 × the per-device velocity ceiling `V`** (with any-2-of-3): every spend burns two
signatures for one unit of value, so 3 devices × `V` ÷ 2 sigs = `1.5V` extractable before the hardware
freezes it. **Size accordingly — see §6.3.**
- *Residual:* a compromised coordinator **plus** a compromised hot wallet could drain up to `1.5V` into the
(whitelisted) hot wallet and out. That's why `V` lives on the secure element, not in software.
**One or more signer hosts compromised:**
- Owning **one** signer host moves nothing (need two; the device still enforces its own policy and holds the
key in its secure element — the host can't extract it).
- Owning **two** signer hosts: an attacker can produce two signatures, but each device **still enforces its
policy** — so spends are still bounded by per-txn cap + velocity and confined to the whitelist. To steal,
an attacker would need to defeat **two independent secure elements' policies**, not two Linux hosts.
**Coordinator offline:** **fail-safe for safety.** No coordinator → no PSBT is built → no spend → the cap
trivially holds. You lose the ability to *refill* (a liveness gap, §1.5), never control of funds.
### 1.5 No single point of failure — and the one you must engineer around
- **Keys:** 2-of-3 across independent failure domains. Survives losing any one signer.
- **Global velocity counter:** **derived from the blockchain**, not a single-host file (see §6.2). Any
coordinator replica on any host recomputes the same number from chain history → no single ledger to lose or
tamper.
- **Coordinator liveness:** a single coordinator is a *liveness* SPOF (if it's down you can't refill). Run it
**replicated across the same independent hosts as the signers** so any replica can drive a refill. Because
the counter is chain-derived and the cap is bounded by hardware, replicas need no shared state and a rogue
replica can't exceed the hardware bound.
---
## 2. Architecture & topology
```
hot wallet (MPC / your existing engine) ── monitored ──┐
▲ │
│ signed refill broadcast │ "balance below floor"
│ ▼
┌──────────────────┴───────────────────────────────────────────────┐
│ KEYLESS COORDINATOR (replicated; holds NO keys) │
│ • watches hot-wallet floor • builds PSBT from watch-only wallet│
│ • global velocity cap (chain-derived) • fans to any 2 of 3 │
│ • combines → finalizes → broadcasts │
└───────┬───────────────────┬───────────────────┬──────────────────-┘
│ tailnet │ tailnet │ tailnet
┌───────▼──────┐ ┌───────▼──────┐ ┌────────▼─────┐
│ signer host A│ │ signer host B│ │ signer host C│ ← independent failure domains
│ Coldcard + │ │ Coldcard + │ │ Coldcard + │ (power / switch / site)
│ signer-agent │ │ signer-agent │ │ signer-agent │ one ideally OFFSITE
│ HSM policy A │ │ HSM policy B │ │ HSM policy C │
└──────────────┘ └──────────────┘ └──────────────┘
keys live ONLY on the secure elements; agents hold none
watch-only 2-of-3 descriptor wallet
on a Bitcoin full node
```
**Components:**
- **Signer host (×3)** — a small machine (NUC/Pi/VM) with a USB-attached Coldcard, running a **signer agent**:
a thin authenticated tailnet service wrapping `ckcc-protocol`. Receives `{psbt, wallet_id}`, ensures its
device is HSM-started with the 2-of-3 registered, signs, returns `partial_psbt` or `denied(reason)`.
**Holds no keys.** **[REF]** the reference deployment runs all three as segregated `--mk5 --hsm` Coldcard
signers, each on its own unix socket.
- **Coordinator** — builds PSBTs, enforces the global cap, fans out 2-of-3, combines/broadcasts, exposes the
control surface. **[REF]** `orchestrator.py` (`multisig-orchestrator`, `:8099`).
- **Watch-only wallet** — a `bitcoind` descriptor wallet tracking the 2-of-3 descriptor. **Reuse, don't
build.** **[REF]** a watch-only wallet on a signet node.
- **Bitcoin node** — provides the watch-only wallet, builds PSBTs, broadcasts, and is the source of truth for
the velocity counter. **[REF]** a signet node (RPC over the tailnet).
---
## 3. Failure-domain placement (the make-or-break)
This is the single most important deployment decision, and the reasoning is concrete: nominally independent
hosts can still fail *together* — a shared power feed or UPS, a common chassis or drive batch, the same
hypervisor or network switch, or correlated hardware faults. **If two of three keys sit in the same failure
domain, a single event can take both down and freeze the treasury.** Therefore:
- The three signers **MUST** sit in **independent failure domains** — different physical hosts, ideally
different power circuits / UPS / network switches.
- **At least one signer should be genuinely offsite** (a small physical box over Tailscale). A cloud VPS
cannot host a USB Coldcard, but a Pi/NUC at a second location can be signer #3.
- Spread the coordinator replicas across those same domains.
- Do **not** co-locate two signers on hosts that share a single point of failure (same PSU, same switch, same
rack PDU, same hypervisor). Quorum HA is worthless if one event takes out two keys.
> **[REF] note:** the reference deployment runs all three signers on one host for convenience — that has **no**
> failure-domain independence and is for functional validation. Production uses three independent hosts (above);
> never hold mainnet value on a single host that runs more than one signer.
---
## 4. Prerequisites & bill of materials
**[PROD] hardware:**
- **3× Coldcard Mk5** (or Q) — the current device; dual secure elements (ATECC608 + DS28C36B). HSM mode +
multisig (P2SH/P2WSH) co-signing are supported in firmware.
- **3× signer hosts** in independent failure domains (one offsite), each with a free USB port.
- **Steel backups** for 3 seeds + a durable record of the wallet descriptor.
**Software / services:**
- A **Bitcoin full node** (watch-only-capable; descriptor wallets). Mainnet for production; signet for the lab.
- **Tailscale** on every signer host + coordinator (signer agents are RPC *clients* — they don't bind the
tailnet IP, so the bind-race gotcha doesn't apply).
- `ckcc-protocol` (the `ckcc` CLI / Python lib) on each signer host.
- The coordinator + signer-agent software (`orchestrator.py` is the reference coordinator).
- (Optional) **CKBunker** as a human break-glass UI, kept **off** the automated critical path.
**Skills:** comfortable with bitcoind RPC, descriptors/PSBT, Coldcard HSM mode, and Linux service ops.
---
## 5. Initial setup
> ⚠️ **Do a complete dry-run on signet or testnet first** (the lab does exactly this). Only move to mainnet
> once you have rehearsed a refill, a failover, a policy change, and a full restore from backup.
### Step 1 — Generate three independent seeds
Generate a **distinct** seed on **each** Coldcard (never clone one seed to three devices — that defeats the
whole model). Record each 24-word seed to steel and store the three **geographically separated**. Note each
device's master fingerprint + the BIP-48 account xpub (`m/48'/0'/0'/2'` mainnet; `m/48'/1'/0'/2'` signet).
### Step 2 — Build the 2-of-3 descriptor + watch-only wallet
Construct `wsh(sortedmulti(2, key1, key2, key3))` from the three `[fingerprint/48h/0h/0h/2h]xpub` keys
(receive `/0/*` and change `/1/*` branches). Create a **watch-only** descriptor wallet on the node and import
both branches (`importdescriptors`, private keys disabled, internal=true for the change branch). Record the
descriptor durably — see §9, losing it makes recovery painful even with all three seeds.
### Step 3 — Register the 2-of-3 wallet on EACH Coldcard
Each device must have the multisig wallet **registered** so it recognises change back to the wallet as
internal (and doesn't mistake it for an external send that the whitelist/velocity would block). Export the
Coldcard multisig config and `ckcc upload -m <wallet.txt>` to each device (confirm on-device).
### Step 4 — Author the HSM policy (per device)
Policy is **JSON, versioned in this repo**, deployed to each device. Minimum viable policy (see §6 for the
full reference and the diverse-policy option):
```json
{
"must_log": true,
"period": 60,
"msg_paths": ["any"],
"rules": [
{ "max_amount": 8000,
"per_period": 33333,
"wallet": "ckms23",
"whitelist": ["<hot-wallet deposit address 1>", "<...>"] }
]
}
```
(`per_period` here is the **per-device** ceiling `V`; see §6.3 for why `V = ⅔ × global cap`.)
### Step 4b — Enrol the TOTP "owner" (only if you use the surge tier)
If your policy has a TOTP-gated **surge tier** (§6.2b — a rule with `users:["owner"], min_users:1`), enrol the
`owner` TOTP user on **each** Coldcard **before** loading the policy. All three signers must hold the **same**
secret (one authenticator code has to work for whichever two devices sign), so the device-picks-its-own-QR
path does **not** apply here — a shared secret has to be generated once and loaded onto all three.
> 🔑 **Who generates it matters (production vs demo).** **Production:** *the owner* generates the secret
> (in their authenticator app, or offline) and loads it onto each Coldcard **directly over USB during setup** —
> **never through the coordinator**. Then a fully-compromised coordinator can't mint surge codes; at spend time
> it only **relays** the owner's live 6-digit code (it never needs the secret). **Demo only:** the coordinator
> generates the secret + shows the enrolment QR for convenience — acceptable on signet with no real funds, but
> it raises a coordinator-compromise blast radius from tier-1 up to the surge ceiling, so don't do it in prod.
Mechanically: `create_user("owner", USER_AUTH_TOTP, <shared-secret>)` on each device (standard RFC-6238, so any
authenticator app works). The secret must persist so re-arming a rebooted signer re-enrols the same value.
### Step 5 — Load policy + start HSM mode (MIND THE ORDERING)
> 🪤 **Sharp edge (ckbunker issue #12):** loading an HSM policy can *delete a registered multisig wallet* on
> the Coldcard. **Order: register the multisig wallet (Step 3) → enrol the TOTP user (Step 4b, if any) → load
> the policy → verify the wallet still exists on the device.** Re-check `hsm_status` and the registered-wallet
> count after every policy load.
Load the policy and enter HSM mode on each device (the on-device approval is two-step: confirm, then a random
digit to save). HSM mode is a **one-way trip** until reboot — design for it.
### Step 6 — Configure the coordinator
- **Global velocity cap `G`** — the authoritative total-per-period (chain-derived; §6.2).
- **Per-device ceiling `V = ⅔ × G`** in each device policy (§6.3), and enable **round-robin** of the signer
pair so rotation stays even.
- **Whitelist** = your hot-wallet deposit addresses (anonymous/untrusted callers can never add to it).
- **Refill trigger** — the hot-wallet floor that starts a refill, and the refill amount.
- The coordinator's HMAC session secret lives **only on the host** (`chmod 600`, **never in git**).
### Step 7 — Verify before funding
Quorum check: **≥2 of 3** signers online **and** HSM-active **and** wallet-registered. Then, on signet/testnet:
run a within-policy refill (expect sign+broadcast), an over-cap spend (expect on-device refusal), an off-list
spend (expect refusal), and a velocity-exceeding burst (expect the coordinator to block at `G`). Confirm the
chain-derived counter increments on broadcast and the on-device refusal reasons read correctly.
---
## 6. Policy reference
### 6.1 The four on-device gates (tamper-proof, per device)
| Gate | Policy field | What it does | Notes |
|---|---|---|---|
| Per-txn cap | `max_amount` | refuses any single spend over the cap | sats |
| Velocity | `per_period` + top-level `period` | refuses once this device's signed total in the window exceeds it | **per-device**, counted locally — see §6.3 |
| Whitelist | `whitelist: [addr…]` | external outputs must be on this list; change back to `wallet` is exempt | the strongest control — confines *where* funds can go |
| Message paths | `msg_paths` | which derivation paths may sign messages (`["any"]` or specific) | proof-of-control without moving funds |
| (binding) | `wallet` | names the registered multisig so change is recognised as internal | required, or change trips the whitelist |
| (audit) | `must_log: true` | device writes a log entry per decision | feed into monitoring (§8) |
### 6.2 The coordinator global velocity cap (the authoritative limit)
On-device velocity counters **drift** under a rotating "any 2 of 3" (each device only counts what *it* signed)
and a device decrements its counter the moment it signs **even if the PSBT is later dropped and never
broadcast**. So no single device's counter reflects true global outflow. The coordinator therefore enforces
the real cap, and:
- it counts **only real broadcasts** — so dropped/refused PSBTs never burn budget;
- it **includes mempool (0-conf) sends** — they're counted the instant they broadcast, so a burst of spends
inside one ~10-min block interval can't slip past (abandoned/conflicted txns are excluded so a dropped tx
doesn't permanently burn budget);
- it is **chain-derived** — computed from the watch-only wallet's on-chain `send` total over the window
(`listtransactions`), not a single-host file, so any replica recomputes it and there is no ledger SPOF.
### 6.2b Surge tier — a TOTP-gated higher tier (human in the loop)
For occasional large moves, add a **second, ordered rule** that permits a higher `max_amount` + `per_period`
(and/or extra whitelist) but requires the owner's **TOTP** (`users:["owner"], min_users:1`). Coldcard
evaluates rules **first-match**, so routine spends match the automated tier-1 and a larger one falls through to
tier-2 and is refused unless a valid code is presented. **Secure model:** the owner reads the 6-digit code from
a normal authenticator app (standard RFC-6238); the **keyless coordinator only relays it** (`user_auth`) — the
secret lives on the devices + the owner's app, never on the coordinator, so a compromised coordinator can't
mint codes. The coordinator's global cap gets a matching surge ceiling. **Sizing:** the surge ceilings raise
the §1.4 blast-radius bound *when a code is present* — set them to the most you'd ever authorise in one
human-approved move, and treat entering a code as approving a spend up to the surge cap. *(Live for signed-in
users on the reference deployment; exercised end-to-end on real signet.)*
### 6.3 Sizing — the 1.5× rule (do not skip this)
A compromised coordinator that ignores `G` is bounded by the hardware ceilings. With any-2-of-3, worst-case
extractable = **`1.5 × V`** (each spend burns two signatures for one unit of value; `3V ÷ 2`). Therefore:
- **Set each device's `per_period` `V = ⅔ × G`.** Then worst-case `1.5 × ⅔G = G` — the hardware bound equals
your intended global velocity.
- **Round-robin the signer pair** so rotation is even, otherwise busy devices hit `⅔G` early and honest
refills stall (raising `V` for liveness margin pushes the worst case back above `G`).
- **Topology dial:**
- *any-2-of-3 auto-signers* → worst case **1.5×**, but "lose any one, keep running" unattended.
- *2 fixed auto-signers + 1 human break-glass (offline)* → every routine spend needs both online devices, so
`V ≥ G`, but the offline key can't be used unattended → worst case **1.0×**, at the cost of unattended
failover. Choose deliberately.
---
## 7. Day-to-day operations
**Automated refill (the normal path, no human):** hot wallet drops below the floor → coordinator builds a PSBT
to a whitelisted address from the watch-only wallet → checks the global cap → fans to 2 of 3 → devices
auto-sign under policy → combine → broadcast. Watch it confirm on your explorer.
**Change a policy value (cap / velocity / period):** edit the versioned policy, push to each device, **re-arm**
all signers. Re-arming restarts the HSM session and **resets the per-device velocity counters** — expected.
Re-verify the registered wallet survived (§5 Step 5 trap).
**Add a whitelisted destination:** add the address to the policy `whitelist`, push, re-arm. Only the operator
can do this; untrusted callers can never extend the whitelist.
**Take a signer down for maintenance:** quorum tolerates **one** down — the other two keep signing. Bring it
back, re-arm it (§ boot-to-signing-ready), confirm quorum is 3 again. **Never** take a second one down while
one is already offline (that freezes spending until one returns).
**Boot-to-signing-ready:** a Coldcard needs PIN + HSM-mode entry after any power loss. Unattended operation
means the signer agent must restore the device to signing-ready automatically after a reboot, and monitoring
must confirm it did — a device that silently fails to return erodes quorum. Rehearse: reboot a signer host and
confirm the agent re-arms it and quorum self-heals.
---
## 8. Monitoring & alerting (non-negotiable for unattended operation)
Wire these into your observability stack (your observability stack (e.g. Loki + Grafana + alerting)):
- **Quorum health** — are **≥2 of 3** signers online **and** HSM-active **and** wallet-registered? **Alert the
moment it drops to exactly 2** (one more failure = frozen).
- **Velocity near limit** — global cap approaching for the period; per-device counters near `V`.
- **Policy denials** — every on-device refusal (the `must_log` trail) → alert; a spike may signal an attack or
a misconfiguration.
- **USB / device health** — VMs surviving a host crash can carry latent USB/udev damage; don't repeat-restart,
restore from backup.
- **Refill anomalies** — refills outside expected cadence/size (a compromised coordinator's tell).
---
## 9. Backup, recovery & DR
- **3 seeds**, each to steel, **geographically separated**. 2-of-3 survives losing **one** seed.
- **The descriptor is load-bearing.** Losing it makes recovery painful **even with all three seeds** — store
the wallet descriptor offsite, independently of the seeds (in your vault).
- **Rehearse recovery** before funding mainnet: reconstruct the watch-only wallet from the descriptor on a
fresh node, recover a signer from seed, re-register the multisig, reload policy, sign a test spend.
- **Coordinator state is disposable** — it's chain-derived; a replacement coordinator recomputes the velocity
counter from chain history with zero handed-over state.
---
## 10. Incident response & break-glass
| Situation | Response |
|---|---|
| One signer down | Operate on the remaining two; restore + re-arm the third; do not drop a second. |
| Two signers down | Spending is **frozen (safe)**. Restore one to resume. This is the design working. |
| Coordinator compromised suspected | Funds are bounded by §1.4 (whitelist + `1.5V`). Rotate the coordinator host/secret; the hardware caps already contain the blast radius. Review the policy-denial + refill logs. |
| Large / exception spend needed | Use the **human break-glass** path (3rd human-held key / CKBunker), outside the automated policy. |
| Suspected key compromise (one device) | One key alone moves nothing. Rotate to a fresh 2-of-3 (new seeds + descriptor), sweep funds under the old policy to the new wallet. |
---
## 11. Known sharp edges (read before production)
- **USB passthrough pins a VM to its host** — a VM with a physical USB Coldcard **cannot live-migrate**. These
signer VMs deliberately break the "freely migratable" model; the multisig *is* the HA.
- **HSM + multisig together is advanced / lightly-charted** — soak on signet for a long time before mainnet.
- **CKBunker is niche** (v0.9.1, "at your own risk") — keep it as the human break-glass surface only; the
automated path is the signer-agent over `ckcc-protocol`, not CKBunker.
- **The #12 ordering trap** (§5 Step 5) — register wallet → load policy → verify wallet survived.
- **Velocity counters don't compose across devices** — that's the whole reason the authoritative cap is the
coordinator's chain-derived one; per-device velocity is the hardware bound, not the operational truth.
- **Post-host-crash latent damage** — a signer VM that survived a host hard-crash can carry subclinical FS/USB
damage; restore from backup rather than repeat-restarting.
---
## 12. Regulatory note
Running this for **your own** funds (treasury / your own refill tier) is **not** custody-of-others. Offering it
as **custody-as-a-service for third parties** is **regulated activity in AU (AUSTRAC / financial services)**
get legal advice before productising. Same flag as a regulated swap service.
---