multisig-hsm/docs/OPERATOR-MANUAL.md

# Operator's Manual — mineracks distributed policy-enforced multisig HSM

**A 2-of-3 threshold HSM that auto-signs cold→hot refills under on-device policy, with no human in the loop.**

This manual is everything you need to set the system up and **operate it safely**. Read §1–§3 before you
touch anything — the safety model is not optional context, it is the reason the design is shaped the way it
is, and operating it without understanding it will get funds lost.

| | |
|---|---|
| **Status** | A **live reference deployment** runs at `multisighsm.com` on real Bitcoin signet — every spend is a genuine on-chain 2-of-3 co-signature enforced by on-device policy. This manual documents both the **reference configuration** (**[REF]**) and the **production hardware procedure** (**[PROD]**). |
| **Audience** | The operator running the treasury / refill tier (you), and a reviewer evaluating it. |
| **Companion docs** | Design & market rationale: [`README.md`](./README.md) · live demo: `multisighsm.com` |
| **Scope** | The **cold/warm tier and the cold→hot refill pipe** — low-throughput, high-stakes. **NOT** your high-TPS hot-wallet signer (that's MPC's job — see §1.2). |

---

## 1. Read this first — the security model

### 1.1 What the system is

Three independent **Coldcard** hardware signers, each in **HSM mode** under its own spending policy, each on a
**physically separate host** (ideally one genuinely offsite). A **keyless coordinator** watches your hot
wallet, builds a refill transaction when it runs low, fans the unsigned PSBT to **any two** of the three
signers, they **auto-sign under their on-device policy with no human**, and the coordinator combines,
finalizes and broadcasts. Any 2 of 3 can move funds; lose any one signer and you keep operating; lose two and
funds are **frozen safe**.

### 1.2 What it is NOT

It is **not a hot-wallet signing engine.** Coldcard signing is seconds-per-PSBT and is not built for the
hundreds–thousands of signatures/hour a busy exchange hot wallet does, and it carries no FIPS/PKCS#11
certification an auditor or insurer will expect for the primary hot engine. **Keep using an MPC platform for
the hot wallet.** This system secures the **95% of reserves behind it** and the **refill pipe** between cold
and hot — where automation with hardware keys you physically hold beats both a manual 3am ceremony and a
custodian.

### 1.3 The three enforcement layers (and which one you actually trust)

| Layer | Where it runs | Enforces | Trust property |
|---|---|---|---|
| **On-device policy** | Each Coldcard secure element | per-txn cap · velocity · **address whitelist** · message-signing paths | **Tamper-proof. This is the safety floor.** A compromised host cannot lift it. |
| **Coordinator global velocity cap** | The keyless coordinator (software) | the *authoritative* total-per-period across all signers | **Operational, not safety.** Precise day-to-day limit; bypassable if the coordinator is compromised — but bounded by the layer above. |
| **Quorum (2-of-3)** | The protocol | no single signer can move funds; no single signer outage freezes funds | Structural. |

> **The golden rule: the coordinator may *limit*, but only the hardware may *bound*.** Size the on-device
> limits as your real safety envelope; treat the coordinator cap as a tighter operational convenience.

### 1.4 What a compromise can and cannot do

**Coordinator fully compromised (worst realistic software breach):**
- It **holds no keys** → cannot sign or forge. Every spend still needs **two Coldcards** to each pass their
  own on-device whitelist + cap + velocity.
- It **cannot redirect funds off the whitelist** → at worst it prematurely refills *your own* hot-wallet
  addresses, never an attacker's address.
- It **cannot exceed the devices' velocity ceilings.** That ceiling is the true catastrophic bound.
- **Blast radius = 1.5 × the per-device velocity ceiling `V`** (with any-2-of-3): every spend burns two
  signatures for one unit of value, so 3 devices × `V` ÷ 2 sigs = `1.5V` extractable before the hardware
  freezes it. **Size accordingly — see §6.3.**
- *Residual:* a compromised coordinator **plus** a compromised hot wallet could drain up to `1.5V` into the
  (whitelisted) hot wallet and out. That's why `V` lives on the secure element, not in software.

**One or more signer hosts compromised:**
- Owning **one** signer host moves nothing (need two; the device still enforces its own policy and holds the
  key in its secure element — the host can't extract it).
- Owning **two** signer hosts: an attacker can produce two signatures, but each device **still enforces its
  policy** — so spends are still bounded by per-txn cap + velocity and confined to the whitelist. To steal,
  an attacker would need to defeat **two independent secure elements' policies**, not two Linux hosts.

**Coordinator offline:** **fail-safe for safety.** No coordinator → no PSBT is built → no spend → the cap
trivially holds. You lose the ability to *refill* (a liveness gap, §1.5), never control of funds.

### 1.5 No single point of failure — and the one you must engineer around

- **Keys:** 2-of-3 across independent failure domains. Survives losing any one signer.
- **Global velocity counter:** **derived from the blockchain**, not a single-host file (see §6.2). Any
  coordinator replica on any host recomputes the same number from chain history → no single ledger to lose or
  tamper.
- **Coordinator liveness:** a single coordinator is a *liveness* SPOF (if it's down you can't refill). Run it
  **replicated across the same independent hosts as the signers** so any replica can drive a refill. Because
  the counter is chain-derived and the cap is bounded by hardware, replicas need no shared state and a rogue
  replica can't exceed the hardware bound.

---

## 2. Architecture & topology

```
        hot wallet (MPC / your existing engine)  ── monitored ──┐
                      ▲                                          │
                      │ signed refill broadcast                  │ "balance below floor"
                      │                                          ▼
   ┌──────────────────┴───────────────────────────────────────────────┐
   │  KEYLESS COORDINATOR  (replicated; holds NO keys)                  │
   │   • watches hot-wallet floor   • builds PSBT from watch-only wallet│
   │   • global velocity cap (chain-derived)  • fans to any 2 of 3      │
   │   • combines → finalizes → broadcasts                              │
   └───────┬───────────────────┬───────────────────┬──────────────────-┘
           │ tailnet           │ tailnet           │ tailnet
   ┌───────▼──────┐    ┌───────▼──────┐    ┌────────▼─────┐
   │ signer host A│    │ signer host B│    │ signer host C│   ← independent failure domains
   │ Coldcard +   │    │ Coldcard +   │    │ Coldcard +   │     (power / switch / site)
   │ signer-agent │    │ signer-agent │    │ signer-agent │     one ideally OFFSITE
   │ HSM policy A │    │ HSM policy B │    │ HSM policy C │
   └──────────────┘    └──────────────┘    └──────────────┘
           keys live ONLY on the secure elements; agents hold none
                              │
                   watch-only 2-of-3 descriptor wallet
                       on a Bitcoin full node
```

**Components:**
- **Signer host (×3)** — a small machine (NUC/Pi/VM) with a USB-attached Coldcard, running a **signer agent**:
  a thin authenticated tailnet service wrapping `ckcc-protocol`. Receives `{psbt, wallet_id}`, ensures its
  device is HSM-started with the 2-of-3 registered, signs, returns `partial_psbt` or `denied(reason)`.
  **Holds no keys.** **[REF]** the reference deployment runs all three as segregated `--mk5 --hsm` Coldcard
  signers, each on its own unix socket.
- **Coordinator** — builds PSBTs, enforces the global cap, fans out 2-of-3, combines/broadcasts, exposes the
  control surface. **[REF]** `orchestrator.py` (`multisig-orchestrator`, `:8099`).
- **Watch-only wallet** — a `bitcoind` descriptor wallet tracking the 2-of-3 descriptor. **Reuse, don't
  build.** **[REF]** a watch-only wallet on a signet node.
- **Bitcoin node** — provides the watch-only wallet, builds PSBTs, broadcasts, and is the source of truth for
  the velocity counter. **[REF]** a signet node (RPC over the tailnet).

---

## 3. Failure-domain placement (the make-or-break)

This is the single most important deployment decision, and the reasoning is concrete: nominally independent
hosts can still fail *together* — a shared power feed or UPS, a common chassis or drive batch, the same
hypervisor or network switch, or correlated hardware faults. **If two of three keys sit in the same failure
domain, a single event can take both down and freeze the treasury.** Therefore:

- The three signers **MUST** sit in **independent failure domains** — different physical hosts, ideally
  different power circuits / UPS / network switches.
- **At least one signer should be genuinely offsite** (a small physical box over Tailscale). A cloud VPS
  cannot host a USB Coldcard, but a Pi/NUC at a second location can be signer #3.
- Spread the coordinator replicas across those same domains.
- Do **not** co-locate two signers on hosts that share a single point of failure (same PSU, same switch, same
  rack PDU, same hypervisor). Quorum HA is worthless if one event takes out two keys.

> **[REF] note:** the reference deployment runs all three signers on one host for convenience — that has **no**
> failure-domain independence and is for functional validation. Production uses three independent hosts (above);
> never hold mainnet value on a single host that runs more than one signer.

---

## 4. Prerequisites & bill of materials

**[PROD] hardware:**
- **3× Coldcard Mk5** (or Q) — the current device; dual secure elements (ATECC608 + DS28C36B). HSM mode +
  multisig (P2SH/P2WSH) co-signing are supported in firmware.
- **3× signer hosts** in independent failure domains (one offsite), each with a free USB port.
- **Steel backups** for 3 seeds + a durable record of the wallet descriptor.

**Software / services:**
- A **Bitcoin full node** (watch-only-capable; descriptor wallets). Mainnet for production; signet for the lab.
- **Tailscale** on every signer host + coordinator (signer agents are RPC *clients* — they don't bind the
  tailnet IP, so the bind-race gotcha doesn't apply).
- `ckcc-protocol` (the `ckcc` CLI / Python lib) on each signer host.
- The coordinator + signer-agent software (`orchestrator.py` is the reference coordinator).
- (Optional) **CKBunker** as a human break-glass UI, kept **off** the automated critical path.

**Skills:** comfortable with bitcoind RPC, descriptors/PSBT, Coldcard HSM mode, and Linux service ops.

---

## 5. Initial setup

> ⚠️ **Do a complete dry-run on signet or testnet first** (the lab does exactly this). Only move to mainnet
> once you have rehearsed a refill, a failover, a policy change, and a full restore from backup.

### Step 1 — Generate three independent seeds
Generate a **distinct** seed on **each** Coldcard (never clone one seed to three devices — that defeats the
whole model). Record each 24-word seed to steel and store the three **geographically separated**. Note each
device's master fingerprint + the BIP-48 account xpub (`m/48'/0'/0'/2'` mainnet; `m/48'/1'/0'/2'` signet).

### Step 2 — Build the 2-of-3 descriptor + watch-only wallet
Construct `wsh(sortedmulti(2, key1, key2, key3))` from the three `[fingerprint/48h/0h/0h/2h]xpub` keys
(receive `/0/*` and change `/1/*` branches). Create a **watch-only** descriptor wallet on the node and import
both branches (`importdescriptors`, private keys disabled, internal=true for the change branch). Record the
descriptor durably — see §9, losing it makes recovery painful even with all three seeds.

### Step 3 — Register the 2-of-3 wallet on EACH Coldcard
Each device must have the multisig wallet **registered** so it recognises change back to the wallet as
internal (and doesn't mistake it for an external send that the whitelist/velocity would block). Export the
Coldcard multisig config and `ckcc upload -m <wallet.txt>` to each device (confirm on-device).

### Step 4 — Author the HSM policy (per device)
Policy is **JSON, versioned in this repo**, deployed to each device. Minimum viable policy (see §6 for the
full reference and the diverse-policy option):

```json
{
  "must_log": true,
  "period": 60,
  "msg_paths": ["any"],
  "rules": [
    { "max_amount": 8000,
      "per_period": 33333,
      "wallet": "ckms23",
      "whitelist": ["<hot-wallet deposit address 1>", "<...>"] }
  ]
}
```
(`per_period` here is the **per-device** ceiling `V`; see §6.3 for why `V = ⅔ × global cap`.)

### Step 4b — Enrol the TOTP "owner" (only if you use the surge tier)
If your policy has a TOTP-gated **surge tier** (§6.2b — a rule with `users:["owner"], min_users:1`), enrol the
`owner` TOTP user on **each** Coldcard **before** loading the policy. All three signers must hold the **same**
secret (one authenticator code has to work for whichever two devices sign), so the device-picks-its-own-QR
path does **not** apply here — a shared secret has to be generated once and loaded onto all three.

> 🔑 **Who generates it matters (production vs demo).** **Production:** *the owner* generates the secret
> (in their authenticator app, or offline) and loads it onto each Coldcard **directly over USB during setup** —
> **never through the coordinator**. Then a fully-compromised coordinator can't mint surge codes; at spend time
> it only **relays** the owner's live 6-digit code (it never needs the secret). **Demo only:** the coordinator
> generates the secret + shows the enrolment QR for convenience — acceptable on signet with no real funds, but
> it raises a coordinator-compromise blast radius from tier-1 up to the surge ceiling, so don't do it in prod.

Mechanically: `create_user("owner", USER_AUTH_TOTP, <shared-secret>)` on each device (standard RFC-6238, so any
authenticator app works). The secret must persist so re-arming a rebooted signer re-enrols the same value.

### Step 5 — Load policy + start HSM mode (MIND THE ORDERING)
> 🪤 **Sharp edge (ckbunker issue #12):** loading an HSM policy can *delete a registered multisig wallet* on
> the Coldcard. **Order: register the multisig wallet (Step 3) → enrol the TOTP user (Step 4b, if any) → load
> the policy → verify the wallet still exists on the device.** Re-check `hsm_status` and the registered-wallet
> count after every policy load.

Load the policy and enter HSM mode on each device (the on-device approval is two-step: confirm, then a random
digit to save). HSM mode is a **one-way trip** until reboot — design for it.

### Step 6 — Configure the coordinator
- **Global velocity cap `G`** — the authoritative total-per-period (chain-derived; §6.2).
- **Per-device ceiling `V = ⅔ × G`** in each device policy (§6.3), and enable **round-robin** of the signer
  pair so rotation stays even.
- **Whitelist** = your hot-wallet deposit addresses (anonymous/untrusted callers can never add to it).
- **Refill trigger** — the hot-wallet floor that starts a refill, and the refill amount.
- The coordinator's HMAC session secret lives **only on the host** (`chmod 600`, **never in git**).

### Step 7 — Verify before funding
Quorum check: **≥2 of 3** signers online **and** HSM-active **and** wallet-registered. Then, on signet/testnet:
run a within-policy refill (expect sign+broadcast), an over-cap spend (expect on-device refusal), an off-list
spend (expect refusal), and a velocity-exceeding burst (expect the coordinator to block at `G`). Confirm the
chain-derived counter increments on broadcast and the on-device refusal reasons read correctly.

---

## 6. Policy reference

### 6.1 The four on-device gates (tamper-proof, per device)
| Gate | Policy field | What it does | Notes |
|---|---|---|---|
| Per-txn cap | `max_amount` | refuses any single spend over the cap | sats |
| Velocity | `per_period` + top-level `period` | refuses once this device's signed total in the window exceeds it | **per-device**, counted locally — see §6.3 |
| Whitelist | `whitelist: [addr…]` | external outputs must be on this list; change back to `wallet` is exempt | the strongest control — confines *where* funds can go |
| Message paths | `msg_paths` | which derivation paths may sign messages (`["any"]` or specific) | proof-of-control without moving funds |
| (binding) | `wallet` | names the registered multisig so change is recognised as internal | required, or change trips the whitelist |
| (audit) | `must_log: true` | device writes a log entry per decision | feed into monitoring (§8) |

### 6.2 The coordinator global velocity cap (the authoritative limit)
On-device velocity counters **drift** under a rotating "any 2 of 3" (each device only counts what *it* signed)
and a device decrements its counter the moment it signs **even if the PSBT is later dropped and never
broadcast**. So no single device's counter reflects true global outflow. The coordinator therefore enforces
the real cap, and:
- it counts **only real broadcasts** — so dropped/refused PSBTs never burn budget;
- it **includes mempool (0-conf) sends** — they're counted the instant they broadcast, so a burst of spends
  inside one ~10-min block interval can't slip past (abandoned/conflicted txns are excluded so a dropped tx
  doesn't permanently burn budget);
- it is **chain-derived** — computed from the watch-only wallet's on-chain `send` total over the window
  (`listtransactions`), not a single-host file, so any replica recomputes it and there is no ledger SPOF.

### 6.2b Surge tier — a TOTP-gated higher tier (human in the loop)
For occasional large moves, add a **second, ordered rule** that permits a higher `max_amount` + `per_period`
(and/or extra whitelist) but requires the owner's **TOTP** (`users:["owner"], min_users:1`). Coldcard
evaluates rules **first-match**, so routine spends match the automated tier-1 and a larger one falls through to
tier-2 and is refused unless a valid code is presented. **Secure model:** the owner reads the 6-digit code from
a normal authenticator app (standard RFC-6238); the **keyless coordinator only relays it** (`user_auth`) — the
secret lives on the devices + the owner's app, never on the coordinator, so a compromised coordinator can't
mint codes. The coordinator's global cap gets a matching surge ceiling. **Sizing:** the surge ceilings raise
the §1.4 blast-radius bound *when a code is present* — set them to the most you'd ever authorise in one
human-approved move, and treat entering a code as approving a spend up to the surge cap. *(Live for signed-in
users on the reference deployment; exercised end-to-end on real signet.)*

### 6.3 Sizing — the 1.5× rule (do not skip this)
A compromised coordinator that ignores `G` is bounded by the hardware ceilings. With any-2-of-3, worst-case
extractable = **`1.5 × V`** (each spend burns two signatures for one unit of value; `3V ÷ 2`). Therefore:

- **Set each device's `per_period` `V = ⅔ × G`.** Then worst-case `1.5 × ⅔G = G` — the hardware bound equals
  your intended global velocity.
- **Round-robin the signer pair** so rotation is even, otherwise busy devices hit `⅔G` early and honest
  refills stall (raising `V` for liveness margin pushes the worst case back above `G`).
- **Topology dial:**
  - *any-2-of-3 auto-signers* → worst case **1.5×**, but "lose any one, keep running" unattended.
  - *2 fixed auto-signers + 1 human break-glass (offline)* → every routine spend needs both online devices, so
    `V ≥ G`, but the offline key can't be used unattended → worst case **1.0×**, at the cost of unattended
    failover. Choose deliberately.

---

## 7. Day-to-day operations

**Automated refill (the normal path, no human):** hot wallet drops below the floor → coordinator builds a PSBT
to a whitelisted address from the watch-only wallet → checks the global cap → fans to 2 of 3 → devices
auto-sign under policy → combine → broadcast. Watch it confirm on your explorer.

**Change a policy value (cap / velocity / period):** edit the versioned policy, push to each device, **re-arm**
all signers. Re-arming restarts the HSM session and **resets the per-device velocity counters** — expected.
Re-verify the registered wallet survived (§5 Step 5 trap).

**Add a whitelisted destination:** add the address to the policy `whitelist`, push, re-arm. Only the operator
can do this; untrusted callers can never extend the whitelist.

**Take a signer down for maintenance:** quorum tolerates **one** down — the other two keep signing. Bring it
back, re-arm it (§ boot-to-signing-ready), confirm quorum is 3 again. **Never** take a second one down while
one is already offline (that freezes spending until one returns).

**Boot-to-signing-ready:** a Coldcard needs PIN + HSM-mode entry after any power loss. Unattended operation
means the signer agent must restore the device to signing-ready automatically after a reboot, and monitoring
must confirm it did — a device that silently fails to return erodes quorum. Rehearse: reboot a signer host and
confirm the agent re-arms it and quorum self-heals.

---

## 8. Monitoring & alerting (non-negotiable for unattended operation)

Wire these into your observability stack (your observability stack (e.g. Loki + Grafana + alerting)):
- **Quorum health** — are **≥2 of 3** signers online **and** HSM-active **and** wallet-registered? **Alert the
  moment it drops to exactly 2** (one more failure = frozen).
- **Velocity near limit** — global cap approaching for the period; per-device counters near `V`.
- **Policy denials** — every on-device refusal (the `must_log` trail) → alert; a spike may signal an attack or
  a misconfiguration.
- **USB / device health** — VMs surviving a host crash can carry latent USB/udev damage; don't repeat-restart,
  restore from backup.
- **Refill anomalies** — refills outside expected cadence/size (a compromised coordinator's tell).

---

## 9. Backup, recovery & DR

- **3 seeds**, each to steel, **geographically separated**. 2-of-3 survives losing **one** seed.
- **The descriptor is load-bearing.** Losing it makes recovery painful **even with all three seeds** — store
  the wallet descriptor offsite, independently of the seeds (in your vault).
- **Rehearse recovery** before funding mainnet: reconstruct the watch-only wallet from the descriptor on a
  fresh node, recover a signer from seed, re-register the multisig, reload policy, sign a test spend.
- **Coordinator state is disposable** — it's chain-derived; a replacement coordinator recomputes the velocity
  counter from chain history with zero handed-over state.

---

## 10. Incident response & break-glass

| Situation | Response |
|---|---|
| One signer down | Operate on the remaining two; restore + re-arm the third; do not drop a second. |
| Two signers down | Spending is **frozen (safe)**. Restore one to resume. This is the design working. |
| Coordinator compromised suspected | Funds are bounded by §1.4 (whitelist + `1.5V`). Rotate the coordinator host/secret; the hardware caps already contain the blast radius. Review the policy-denial + refill logs. |
| Large / exception spend needed | Use the **human break-glass** path (3rd human-held key / CKBunker), outside the automated policy. |
| Suspected key compromise (one device) | One key alone moves nothing. Rotate to a fresh 2-of-3 (new seeds + descriptor), sweep funds under the old policy to the new wallet. |

---

## 11. Known sharp edges (read before production)

- **USB passthrough pins a VM to its host** — a VM with a physical USB Coldcard **cannot live-migrate**. These
  signer VMs deliberately break the "freely migratable" model; the multisig *is* the HA.
- **HSM + multisig together is advanced / lightly-charted** — soak on signet for a long time before mainnet.
- **CKBunker is niche** (v0.9.1, "at your own risk") — keep it as the human break-glass surface only; the
  automated path is the signer-agent over `ckcc-protocol`, not CKBunker.
- **The #12 ordering trap** (§5 Step 5) — register wallet → load policy → verify wallet survived.
- **Velocity counters don't compose across devices** — that's the whole reason the authoritative cap is the
  coordinator's chain-derived one; per-device velocity is the hardware bound, not the operational truth.
- **Post-host-crash latent damage** — a signer VM that survived a host hard-crash can carry subclinical FS/USB
  damage; restore from backup rather than repeat-restarting.

---

## 12. Regulatory note

Running this for **your own** funds (treasury / your own refill tier) is **not** custody-of-others. Offering it
as **custody-as-a-service for third parties** is **regulated activity in AU (AUSTRAC / financial services)** —
get legal advice before productising. Same flag as a regulated swap service.

---