What: - bound CLAWDINATOR image artifact retention with S3 lifecycle, AMI pruning, and import provenance tags - reduce the AWS fleet to Babelfish-only and make GitHub credentials opt-in per host - disable the AMI build, nix-openclaw bump, and release workflows by moving them out of .github/workflows/ - update operator docs for the new explicit build and deploy model Why: - stop unbounded S3 and snapshot growth from image builds - remove unattended resurrection paths and shut down the unused t3.large instances - keep the remaining Babelfish host running without GitHub App credentials or sync timers Tests: - `nix shell nixpkgs#shellcheck nixpkgs#shfmt -c bash scripts/lint-shell.sh` (pass) - `nix build .#nixosConfigurations.clawdinator-babelfish.config.system.build.toplevel .#nixosConfigurations.clawdinator-1.config.system.build.toplevel .#nixosConfigurations.clawdinator-2.config.system.build.toplevel` (pass) - `AWS_PROFILE=homelab-admin TF_VAR_aws_region=eu-central-1 TF_VAR_ami_id=ami-0a9abe17feeee0079 TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)" nix shell nixpkgs#opentofu -c sh -lc 'tofu fmt -check && tofu validate'` (pass) - live AWS apply: destroyed `clawdinator-1` and `clawdinator-2`, replaced Babelfish, and verified only `Fleet Deploy` remains active in GitHub Actions
7.8 KiB
7.8 KiB
Control Plane
Goal: manage CLAWDINATOR host lifecycle (create/recreate/replace) from CLAWDINATOR chat (Telegram/Discord) using an out‑of‑band control API. CLAWDINATOR agents can edit IaC, but deploys run OOB with no AWS creds inside agents.
Goals
- Plane‑safe control from CLAWDINATOR chat (chat‑only).
- OOB execution (no CLAWDINATOR agent has infra creds).
- Repo is the source of truth for fleet state.
- Static fleet (Discord token pool constraint).
- Simple, auditable deploy flow.
Non‑Goals
- Task routing, agent scheduling, or tool execution.
- Elastic scaling (no arbitrary cattle instances).
- Runtime config changes (agents handle their own work).
Constraints
- Each CLAWDINATOR instance requires a unique Discord bot token.
- Fleet size == token pool size (static list).
- Persistent changes must land in repo + AMI.
- Infra state must be out‑of‑band and locked.
Control Plane Components (KISS)
- Control API (AWS Lambda)
- Authenticated by a shared control token.
- Dispatches GitHub Actions workflows (deploy only).
- Fleet status
- Fetched locally via AWS CLI using control invoker credentials.
- Fleet Control Skill (runs inside CLAWDINATOR)
- Calls the Control API via
scripts/fleet-control.sh(AWS IAM invoke). - Enforces policy (no self‑deploy) before calling.
- Calls the Control API via
- GitHub Actions (execution)
- Runs OpenTofu apply.
- OpenTofu (infra state)
- Remote state in S3 + Dynamo lock table.
- Instance Registry (desired state)
nix/instances.json(authoritative map).
- Bootstrap + Secrets
- S3 bootstrap prefix per instance.
- Agenix secrets per instance token.
Control API Auth
- Shared control token stored as
clawdinator-control-token.age. - Control API is invoked via AWS IAM using a minimal invoker key:
clawdinator-control-aws-access-key-id.ageclawdinator-control-aws-secret-access-key.age
- Token is injected into instances via bootstrap and read from
/run/agenix/clawdinator-control-token.
Control API Env (Lambda)
CONTROL_API_TOKENGITHUB_TOKENGITHUB_REPO(defaultopenclaw/clawdinators)GITHUB_WORKFLOW(defaultfleet-deploy.yml)GITHUB_REF(defaultmain)
Desired State (Fleet Registry)
nix/instances.json is the fleet map (single source of truth for infra + host configs).
Example:
{
"clawdinator-1": {
"host": "clawdinator-1",
"instanceType": "t3.large",
"bootstrapPrefix": "bootstrap/clawdinator-1",
"discordTokenSecret": "clawdinator-discord-token-1"
},
"clawdinator-2": {
"host": "clawdinator-2",
"instanceType": "t3.large",
"bootstrapPrefix": "bootstrap/clawdinator-2",
"discordTokenSecret": "clawdinator-discord-token-2"
}
}
Command Semantics (Minimal)
/fleet deploy <target>
- Target required (no implicit default):
allor<id>. - Always runs
tofu apply. all: replace all instances using latest successful AMI.<id>: replace only that instance using latest successful AMI.- Also creates new instances if present in desired state.
/fleet status
- Returns live fleet status via AWS CLI (EC2 describe by tag).
Access Control (Policy)
- Shared control token authorizes calls to the Control API.
- Policy enforced by the fleet-control skill:
- Humans: deploy any target (including
all). - Bots: deploy only the other instance (no self‑deploy).
- Humans: deploy any target (including
- Control API also rejects
target == callerwhencalleris provided.
Lifecycle Flows
Add a new instance (static token pool)
- Create Discord bot token →
clawdinator-discord-token-2.age. - Add entry to
nix/instances.json. - Add host file
nix/hosts/clawdinator-2.nix. - Run
/fleet deploy allor/fleet deploy clawdinator-2. - Host boots, pulls its bootstrap prefix, starts CLAWDINATOR.
Recreate a single instance
/fleet deploy clawdinator-2(forces replace for that host).
Roll the fleet
/fleet deploy allreplaces every host with latest AMI.- Old AMI history is intentionally bounded. Normal operations keep the currently used fleet AMI plus a small recent rollback window; deeper rollback requires an explicit preserved AMI id.
Self‑Recycle (Out‑of‑Band)
- Agents call the Control API (no AWS creds) via the fleet-control skill.
- Control API dispatches GitHub Actions; AWS creds live in CI only.
State + Audit
- Desired state: Git repo (
nix/instances.json). - Actual state: OpenTofu S3 backend.
- Audit trail: Git + Actions logs.
AMI Selection (KISS)
- Use latest AMI tagged
clawdinator=true. - Optional override via workflow input
ami_overridefor rollback. - Automatic retention keeps the newest few tagged AMIs plus any AMI still backing a live CLAWDINATOR instance.
Deploy Execution (Workflow)
- Single workflow
fleet-deploy.yml. - Inputs:
target,ami_override(optional). - Concurrency group
fleet-deploy(no overlaps). target=allrunstofu applynormally.target=<id>runstofu apply -replace aws_instance.clawdinator["<id>"](implementation detail).
Bootstrap (Per‑Instance)
- Upload per instance:
bootstrap/clawdinator-1bootstrap/clawdinator-2
- Each bundle contains only that instance’s Discord token.
EC2 User-Data (Instance Boot)
- OpenTofu renders a per-instance user‑data script.
- Script writes
/etc/clawdinator/bootstrap-prefix. - Script writes
/etc/clawdinator/control-api-url. - Script starts
clawdinator-bootstrap.service+clawdinator-repo-seed.service. - Script runs
nixos-rebuild switch --flake /var/lib/clawd/repos/clawdinators#<host>.
Plane Ops Runbook (Chat‑only)
Preflight (before flight)
- Control API Lambda exists; URL is written to
/etc/clawdinator/control-api-url. - Control secrets exist in
nix-secretsand are in bootstrap bundles:clawdinator-control-token.ageclawdinator-control-aws-access-key-id.ageclawdinator-control-aws-secret-access-key.age
- GitHub Action
fleet-deploy.ymlexists and can be dispatched. nix/instances.jsonincludes all desired instances.- Discord tokens are encrypted in
nix-secretsand synced to S3age-secrets/. - Latest AMI build succeeded (tagged
clawdinator=true). /fleet statusreturns the current fleet.
On the plane
/fleet status→ verify fleet + AMI./fleet deploy clawdinator-2→ bring up new host./fleet deploy all→ roll the fleet to latest AMI.- If rollback needed: rerun deploy with
ami_override(exact AMI id). - If the exact rollback AMI is older than the bounded retention window, preserve it intentionally before relying on it.
Implementation Checklist (From Design → Works)
- Add
nix/instances.json(clawdinator‑1 + clawdinator‑2). - Add
nix/hosts/clawdinator-2.nixand wire host configs to read registry values. - Update OpenTofu:
- multi‑instance
for_eachusingnix/instances.json. - S3 backend + Dynamo lock table.
- Control API Lambda.
- Control invoker IAM user (lambda invoke only).
- multi‑instance
- Add control secrets to
nix-secretsand include in bootstrap bundles:clawdinator-control-token.ageclawdinator-control-aws-access-key-id.ageclawdinator-control-aws-secret-access-key.age
- Add workflow
fleet-deploy.yml:- inputs:
target,ami_override(optional). - resolves latest AMI by tag when override not set.
- runs
tofu apply(replace when target != all).
- inputs:
- Add fleet-control skill + script (
scripts/fleet-control.sh). - Validate:
/fleet status/fleet deploy clawdinator-2- verify new host in AWS + CLAWDINATOR service active.
Decisions
- Control endpoint: AWS Lambda (Function URL).
- OpenTofu state: S3 backend + Dynamo lock table.
- Control auth: shared bearer token (
clawdinator-control-token.age). - Plane ops: CLAWDINATOR chat → fleet-control skill → Control API.
- Deploy command requires explicit target.