joshp123 280744ce0c infra: slim clawdinators aws footprint

What:
- bound CLAWDINATOR image artifact retention with S3 lifecycle, AMI pruning, and import provenance tags
- reduce the AWS fleet to Babelfish-only and make GitHub credentials opt-in per host
- disable the AMI build, nix-openclaw bump, and release workflows by moving them out of .github/workflows/
- update operator docs for the new explicit build and deploy model

Why:
- stop unbounded S3 and snapshot growth from image builds
- remove unattended resurrection paths and shut down the unused t3.large instances
- keep the remaining Babelfish host running without GitHub App credentials or sync timers

Tests:
- `nix shell nixpkgs#shellcheck nixpkgs#shfmt -c bash scripts/lint-shell.sh` (pass)
- `nix build .#nixosConfigurations.clawdinator-babelfish.config.system.build.toplevel .#nixosConfigurations.clawdinator-1.config.system.build.toplevel .#nixosConfigurations.clawdinator-2.config.system.build.toplevel` (pass)
- `AWS_PROFILE=homelab-admin TF_VAR_aws_region=eu-central-1 TF_VAR_ami_id=ami-0a9abe17feeee0079 TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)" nix shell nixpkgs#opentofu -c sh -lc 'tofu fmt -check && tofu validate'` (pass)
- live AWS apply: destroyed `clawdinator-1` and `clawdinator-2`, replaced Babelfish, and verified only `Fleet Deploy` remains active in GitHub Actions

2026-04-03 15:38:57 +02:00

7.4 KiB

Raw Permalink Blame History

CLAWDINATOR Agent Notes

Read these before acting:

docs/PHILOSOPHY.md
docs/ARCHITECTURE.md
docs/SHARED_MEMORY.md
docs/SECRETS.md
docs/POC.md
BOOTSTRAP.md
IDENTITY.md
SOUL.md
TOOLS.md
USER.md

Memory references:

For project goals, read memory/project.md
For architecture decisions, read memory/architecture.md
For ops runbook, read memory/ops.md
For Discord context, also read memory/discord.md

Repo rule: no inline scripting languages (Python/Node/etc.) in Nix or shell blocks; put logic in script files and call them.

System ownership (3 repos):

openclaw: upstream runtime and behavior.
nix-openclaw: packaging/build fixes for clawbot.
clawdinators: infra, NixOS config, secrets wiring, deployment flow.

Maintainer role:

Monitor issues + PRs and keep an inventory of what needs human attention.
Surface priorities and context; do not file issues or modify code unless asked.
Track running versions (openclaw/nix-openclaw/clawdinators) and note them in memory/ops.md.

Toolchain workflow (repo source of truth):

Add/remove tools in nix/tools/clawdinator-tools.nix (packages + descriptions).
Tools list is rendered into /etc/clawdinator/tools.md by Nix and appended to workspace TOOLS.md at seed time.
Keep clawdinator/workspace/TOOLS.md aligned with upstream template; do not hardcode tool lists there.
When you add a new tool, verify it appears in /etc/clawdinator/tools.md and in the workspace TOOLS.md after seed.

The Zen of ~~Python~~ Moltbot, by shamelessly stolen from Tim Peters:

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than right now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Deploy flow (automation-first):

Use devenv.nix for tooling (nixos-generators, awscli2).
Build a bootstrap NixOS image with nixos-generators (raw) and upload it to S3.
- Use nix/hosts/clawdinator-1-image.nix for image builds.
The old CI AMI/update/release workflows are intentionally disabled under .github/workflows-disabled/; AMI builds and deploys now require an explicit code change or a local operator run.
Image history is bounded on purpose: raw clawdinator-nixos-* uploads expire automatically, and old CLAWDINATOR AMIs/snapshots are pruned after successful builds while keeping the live fleet AMI plus a short rollback window.
Resume AMI pipeline work immediately if it stalls; do not use rsync as a workaround. Host edits are allowed but must be committed and baked into a new AMI to persist.
CI must provide CLAWDINATOR_AGE_KEY to build + upload the runtime bootstrap bundle to S3.
Bootstrap bundle location: s3://${S3_BUCKET}/bootstrap/<instance>/ (secrets + repo seeds).
Bootstrap S3 bucket + scoped IAM user + VM Import role with infra/opentofu/aws (use homelab-admin creds).
Bootstrap AWS instances from the AMI with infra/opentofu/aws (set TF_VAR_ami_id).
Import the image into AWS as an AMI (snapshot import + register image).
Ensure secrets are encrypted to the baked agenix key (see ../nix/nix-secrets/secrets.nix).
Ensure required secrets exist: clawdinator-github-app.pem, clawdinator-discord-token-<n>, clawdinator-control-token, clawdinator-control-aws-*, clawdinator-anthropic-api-key.
Update nix/hosts/<host>.nix (Discord allowlist, GitHub App installationId, identity name).
Discord must use messages.queue.byChannel.discord = "interrupt"; queue delays replies to heartbeat and makes the bot appear dead.
Ensure /var/lib/clawd/repos/clawdinators contains this repo (self-update requires it).
Verify systemd services: clawdinator; clawdinator-github-app-token only on hosts that explicitly enable GitHub App auth.
Commit and push changes; repo is the source of truth.

Bootstrap (local):

Agenix identity is ~/.ssh/id_ed25519 (primary SSH key).
Decrypt homelab admin creds:
- RULES=../nix/nix-secrets/secrets.nix agenix -d homelab-admin.age -i ~/.ssh/id_ed25519
OpenTofu env:
- TF_VAR_aws_region=eu-central-1
- TF_VAR_ami_id=ami-... (empty string skips instance creation)
- TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)" (required when ami_id is set)
- TF_VAR_root_volume_size_gb=40 (bump if Nix store runs out of space)
Run tofu init + tofu apply in infra/opentofu/aws.
After apply, update CI secrets from outputs:
- tofu output -raw access_key_id → clawdinator-image-uploader-access-key-id.age
- tofu output -raw secret_access_key → clawdinator-image-uploader-secret-access-key.age
- tofu output -raw bucket_name → clawdinator-image-bucket-name.age
- tofu output -raw aws_region → clawdinator-image-bucket-region.age
- Then gh secret set for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, S3_BUCKET.
Get the latest AMI ID:
- aws ec2 describe-images --region eu-central-1 --owners self --filters "Name=tag:clawdinator,Values=true" --query "Images | sort_by(@,&CreationDate)[-1].[ImageId,Name,CreationDate]" --output text

End-to-end SDLC (local → AMI → host) (verified):

Decrypt AWS creds (homelab admin) and export:
- cd ~/code/nix/nix-secrets
- RULES=./secrets.nix agenix -d homelab-admin.age -i ~/.ssh/id_ed25519 > /tmp/homelab-admin.env
- set -a; source /tmp/homelab-admin.env; set +a
- Cleanup: trash /tmp/homelab-admin.env
Build/import a new AMI explicitly. The old GitHub Actions build/deploy paths are disabled under .github/workflows-disabled/.
Redeploy from the new AMI (instance replacement):
- devenv shell -- bash -lc "cd infra/opentofu/aws && TF_VAR_ami_id=<AMI_ID> TF_VAR_ssh_public_key=\"$(cat ~/.ssh/id_ed25519.pub)\" TF_VAR_aws_region=eu-central-1 tofu apply -auto-approve"
New IP:
- tofu output -json instance_public_ips | jq -r '."clawdinator-1"'
- ssh -o StrictHostKeyChecking=accept-new root@<ip>
Post-deploy sanity:
- systemctl is-active clawdinator
- systemctl is-active clawdinator-github-app-token.timer only if the target host explicitly enables githubApp
- GH_CONFIG_DIR=/var/lib/clawd/gh gh auth status -h github.com only if the target host explicitly enables GitHub auth

Important:

Repo/workspace on host is seeded from the AMI snapshot. git pull is ephemeral; rebuild AMI for persistent changes.
Any manual host fix is triage-only; always rebuild the AMI and redeploy before calling it done.
If SSH access is lost, use SSM (instance profile is attached via OpenTofu) to re-add /root/.ssh/authorized_keys.

Key principle: mental notes don’t survive restarts — write it to a file.

Cattle vs pets: hosts are disposable. Prefer re-provisioning from OpenTofu + NixOS configs over in-place manual fixes. One way only: AWS AMI pipeline via S3 + VM Import. This is a greenfield repo. Do not reference alternate paths anywhere in code or docs.

7.4 KiB Raw Permalink Blame History Unescape Escape

CLAWDINATOR Agent Notes

7.4 KiB

Raw Permalink Blame History