joshp123 280744ce0c infra: slim clawdinators aws footprint

What:
- bound CLAWDINATOR image artifact retention with S3 lifecycle, AMI pruning, and import provenance tags
- reduce the AWS fleet to Babelfish-only and make GitHub credentials opt-in per host
- disable the AMI build, nix-openclaw bump, and release workflows by moving them out of .github/workflows/
- update operator docs for the new explicit build and deploy model

Why:
- stop unbounded S3 and snapshot growth from image builds
- remove unattended resurrection paths and shut down the unused t3.large instances
- keep the remaining Babelfish host running without GitHub App credentials or sync timers

Tests:
- `nix shell nixpkgs#shellcheck nixpkgs#shfmt -c bash scripts/lint-shell.sh` (pass)
- `nix build .#nixosConfigurations.clawdinator-babelfish.config.system.build.toplevel .#nixosConfigurations.clawdinator-1.config.system.build.toplevel .#nixosConfigurations.clawdinator-2.config.system.build.toplevel` (pass)
- `AWS_PROFILE=homelab-admin TF_VAR_aws_region=eu-central-1 TF_VAR_ami_id=ami-0a9abe17feeee0079 TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)" nix shell nixpkgs#opentofu -c sh -lc 'tofu fmt -check && tofu validate'` (pass)
- live AWS apply: destroyed `clawdinator-1` and `clawdinator-2`, replaced Babelfish, and verified only `Fleet Deploy` remains active in GitHub Actions

2026-04-03 15:38:57 +02:00

7.8 KiB

Raw Permalink Blame History

Control Plane

Goal: manage CLAWDINATOR host lifecycle (create/recreate/replace) from CLAWDINATOR chat (Telegram/Discord) using an out‑of‑band control API. CLAWDINATOR agents can edit IaC, but deploys run OOB with no AWS creds inside agents.

Goals

Plane‑safe control from CLAWDINATOR chat (chat‑only).
OOB execution (no CLAWDINATOR agent has infra creds).
Repo is the source of truth for fleet state.
Static fleet (Discord token pool constraint).
Simple, auditable deploy flow.

Non‑Goals

Task routing, agent scheduling, or tool execution.
Elastic scaling (no arbitrary cattle instances).
Runtime config changes (agents handle their own work).

Constraints

Each CLAWDINATOR instance requires a unique Discord bot token.
Fleet size == token pool size (static list).
Persistent changes must land in repo + AMI.
Infra state must be out‑of‑band and locked.

Control Plane Components (KISS)

Control API (AWS Lambda)
- Authenticated by a shared control token.
- Dispatches GitHub Actions workflows (deploy only).
Fleet status
- Fetched locally via AWS CLI using control invoker credentials.
Fleet Control Skill (runs inside CLAWDINATOR)
- Calls the Control API via scripts/fleet-control.sh (AWS IAM invoke).
- Enforces policy (no self‑deploy) before calling.
GitHub Actions (execution)
- Runs OpenTofu apply.
OpenTofu (infra state)
- Remote state in S3 + Dynamo lock table.
Instance Registry (desired state)
- nix/instances.json (authoritative map).
Bootstrap + Secrets
- S3 bootstrap prefix per instance.
- Agenix secrets per instance token.

Control API Auth

Shared control token stored as clawdinator-control-token.age.
Control API is invoked via AWS IAM using a minimal invoker key:
- clawdinator-control-aws-access-key-id.age
- clawdinator-control-aws-secret-access-key.age
Token is injected into instances via bootstrap and read from /run/agenix/clawdinator-control-token.

Control API Env (Lambda)

CONTROL_API_TOKEN
GITHUB_TOKEN
GITHUB_REPO (default openclaw/clawdinators)
GITHUB_WORKFLOW (default fleet-deploy.yml)
GITHUB_REF (default main)

Desired State (Fleet Registry)

nix/instances.json is the fleet map (single source of truth for infra + host configs).

Example:

{
  "clawdinator-1": {
    "host": "clawdinator-1",
    "instanceType": "t3.large",
    "bootstrapPrefix": "bootstrap/clawdinator-1",
    "discordTokenSecret": "clawdinator-discord-token-1"
  },
  "clawdinator-2": {
    "host": "clawdinator-2",
    "instanceType": "t3.large",
    "bootstrapPrefix": "bootstrap/clawdinator-2",
    "discordTokenSecret": "clawdinator-discord-token-2"
  }
}

Command Semantics (Minimal)

`/fleet deploy <target>`

Target required (no implicit default): all or <id>.
Always runs tofu apply.
all: replace all instances using latest successful AMI.
<id>: replace only that instance using latest successful AMI.
Also creates new instances if present in desired state.

`/fleet status`

Returns live fleet status via AWS CLI (EC2 describe by tag).

Access Control (Policy)

Shared control token authorizes calls to the Control API.
Policy enforced by the fleet-control skill:
- Humans: deploy any target (including all).
- Bots: deploy only the other instance (no self‑deploy).
Control API also rejects target == caller when caller is provided.

Lifecycle Flows

Add a new instance (static token pool)

Create Discord bot token → clawdinator-discord-token-2.age.
Add entry to nix/instances.json.
Add host file nix/hosts/clawdinator-2.nix.
Run /fleet deploy all or /fleet deploy clawdinator-2.
Host boots, pulls its bootstrap prefix, starts CLAWDINATOR.

Recreate a single instance

/fleet deploy clawdinator-2 (forces replace for that host).

Roll the fleet

/fleet deploy all replaces every host with latest AMI.
Old AMI history is intentionally bounded. Normal operations keep the currently used fleet AMI plus a small recent rollback window; deeper rollback requires an explicit preserved AMI id.

Self‑Recycle (Out‑of‑Band)

Agents call the Control API (no AWS creds) via the fleet-control skill.
Control API dispatches GitHub Actions; AWS creds live in CI only.

State + Audit

Desired state: Git repo (nix/instances.json).
Actual state: OpenTofu S3 backend.
Audit trail: Git + Actions logs.

AMI Selection (KISS)

Use latest AMI tagged clawdinator=true.
Optional override via workflow input ami_override for rollback.
Automatic retention keeps the newest few tagged AMIs plus any AMI still backing a live CLAWDINATOR instance.

Deploy Execution (Workflow)

Single workflow fleet-deploy.yml.
Inputs: target, ami_override (optional).
Concurrency group fleet-deploy (no overlaps).
target=all runs tofu apply normally.
target=<id> runs tofu apply -replace aws_instance.clawdinator["<id>"] (implementation detail).

Bootstrap (Per‑Instance)

Upload per instance:
- bootstrap/clawdinator-1
- bootstrap/clawdinator-2
Each bundle contains only that instance’s Discord token.

EC2 User-Data (Instance Boot)

OpenTofu renders a per-instance user‑data script.
Script writes /etc/clawdinator/bootstrap-prefix.
Script writes /etc/clawdinator/control-api-url.
Script starts clawdinator-bootstrap.service + clawdinator-repo-seed.service.
Script runs nixos-rebuild switch --flake /var/lib/clawd/repos/clawdinators#<host>.

Plane Ops Runbook (Chat‑only)

Preflight (before flight)

Control API Lambda exists; URL is written to /etc/clawdinator/control-api-url.
Control secrets exist in nix-secrets and are in bootstrap bundles:
- clawdinator-control-token.age
- clawdinator-control-aws-access-key-id.age
- clawdinator-control-aws-secret-access-key.age
GitHub Action fleet-deploy.yml exists and can be dispatched.
nix/instances.json includes all desired instances.
Discord tokens are encrypted in nix-secrets and synced to S3 age-secrets/.
Latest AMI build succeeded (tagged clawdinator=true).
/fleet status returns the current fleet.

On the plane

/fleet status → verify fleet + AMI.
/fleet deploy clawdinator-2 → bring up new host.
/fleet deploy all → roll the fleet to latest AMI.
If rollback needed: rerun deploy with ami_override (exact AMI id).
If the exact rollback AMI is older than the bounded retention window, preserve it intentionally before relying on it.

Implementation Checklist (From Design → Works)

Add nix/instances.json (clawdinator‑1 + clawdinator‑2).
Add nix/hosts/clawdinator-2.nix and wire host configs to read registry values.
Update OpenTofu:
- multi‑instance for_each using nix/instances.json.
- S3 backend + Dynamo lock table.
- Control API Lambda.
- Control invoker IAM user (lambda invoke only).
Add control secrets to nix-secrets and include in bootstrap bundles:
- clawdinator-control-token.age
- clawdinator-control-aws-access-key-id.age
- clawdinator-control-aws-secret-access-key.age
Add workflow fleet-deploy.yml:
- inputs: target, ami_override (optional).
- resolves latest AMI by tag when override not set.
- runs tofu apply (replace when target != all).
Add fleet-control skill + script (scripts/fleet-control.sh).
Validate:
- /fleet status
- /fleet deploy clawdinator-2
- verify new host in AWS + CLAWDINATOR service active.

Decisions

Control endpoint: AWS Lambda (Function URL).
OpenTofu state: S3 backend + Dynamo lock table.
Control auth: shared bearer token (clawdinator-control-token.age).
Plane ops: CLAWDINATOR chat → fleet-control skill → Control API.
Deploy command requires explicit target.

7.8 KiB Raw Permalink Blame History Unescape Escape