clawdinators/AGENTS.md
joshp123 280744ce0c infra: slim clawdinators aws footprint
What:
- bound CLAWDINATOR image artifact retention with S3 lifecycle, AMI pruning, and import provenance tags
- reduce the AWS fleet to Babelfish-only and make GitHub credentials opt-in per host
- disable the AMI build, nix-openclaw bump, and release workflows by moving them out of .github/workflows/
- update operator docs for the new explicit build and deploy model

Why:
- stop unbounded S3 and snapshot growth from image builds
- remove unattended resurrection paths and shut down the unused t3.large instances
- keep the remaining Babelfish host running without GitHub App credentials or sync timers

Tests:
- `nix shell nixpkgs#shellcheck nixpkgs#shfmt -c bash scripts/lint-shell.sh` (pass)
- `nix build .#nixosConfigurations.clawdinator-babelfish.config.system.build.toplevel .#nixosConfigurations.clawdinator-1.config.system.build.toplevel .#nixosConfigurations.clawdinator-2.config.system.build.toplevel` (pass)
- `AWS_PROFILE=homelab-admin TF_VAR_aws_region=eu-central-1 TF_VAR_ami_id=ami-0a9abe17feeee0079 TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)" nix shell nixpkgs#opentofu -c sh -lc 'tofu fmt -check && tofu validate'` (pass)
- live AWS apply: destroyed `clawdinator-1` and `clawdinator-2`, replaced Babelfish, and verified only `Fleet Deploy` remains active in GitHub Actions
2026-04-03 15:38:57 +02:00

7.4 KiB
Raw Permalink Blame History

CLAWDINATOR Agent Notes

Read these before acting:

  • docs/PHILOSOPHY.md
  • docs/ARCHITECTURE.md
  • docs/SHARED_MEMORY.md
  • docs/SECRETS.md
  • docs/POC.md
  • BOOTSTRAP.md
  • IDENTITY.md
  • SOUL.md
  • TOOLS.md
  • USER.md

Memory references:

  • For project goals, read memory/project.md
  • For architecture decisions, read memory/architecture.md
  • For ops runbook, read memory/ops.md
  • For Discord context, also read memory/discord.md

Repo rule: no inline scripting languages (Python/Node/etc.) in Nix or shell blocks; put logic in script files and call them.

System ownership (3 repos):

  • openclaw: upstream runtime and behavior.
  • nix-openclaw: packaging/build fixes for clawbot.
  • clawdinators: infra, NixOS config, secrets wiring, deployment flow.

Maintainer role:

  • Monitor issues + PRs and keep an inventory of what needs human attention.
  • Surface priorities and context; do not file issues or modify code unless asked.
  • Track running versions (openclaw/nix-openclaw/clawdinators) and note them in memory/ops.md.

Toolchain workflow (repo source of truth):

  • Add/remove tools in nix/tools/clawdinator-tools.nix (packages + descriptions).
  • Tools list is rendered into /etc/clawdinator/tools.md by Nix and appended to workspace TOOLS.md at seed time.
  • Keep clawdinator/workspace/TOOLS.md aligned with upstream template; do not hardcode tool lists there.
  • When you add a new tool, verify it appears in /etc/clawdinator/tools.md and in the workspace TOOLS.md after seed.

The Zen of Python Moltbot, by shamelessly stolen from Tim Peters:

  • Beautiful is better than ugly.
  • Explicit is better than implicit.
  • Simple is better than complex.
  • Complex is better than complicated.
  • Flat is better than nested.
  • Sparse is better than dense.
  • Readability counts.
  • Special cases aren't special enough to break the rules.
  • Although practicality beats purity.
  • Errors should never pass silently.
  • Unless explicitly silenced.
  • In the face of ambiguity, refuse the temptation to guess.
  • There should be one-- and preferably only one --obvious way to do it.
  • Although that way may not be obvious at first unless you're Dutch.
  • Now is better than never.
  • Although never is often better than right now.
  • If the implementation is hard to explain, it's a bad idea.
  • If the implementation is easy to explain, it may be a good idea.
  • Namespaces are one honking great idea -- let's do more of those!

Deploy flow (automation-first):

  • Use devenv.nix for tooling (nixos-generators, awscli2).
  • Build a bootstrap NixOS image with nixos-generators (raw) and upload it to S3.
    • Use nix/hosts/clawdinator-1-image.nix for image builds.
  • The old CI AMI/update/release workflows are intentionally disabled under .github/workflows-disabled/; AMI builds and deploys now require an explicit code change or a local operator run.
  • Image history is bounded on purpose: raw clawdinator-nixos-* uploads expire automatically, and old CLAWDINATOR AMIs/snapshots are pruned after successful builds while keeping the live fleet AMI plus a short rollback window.
  • Resume AMI pipeline work immediately if it stalls; do not use rsync as a workaround. Host edits are allowed but must be committed and baked into a new AMI to persist.
  • CI must provide CLAWDINATOR_AGE_KEY to build + upload the runtime bootstrap bundle to S3.
  • Bootstrap bundle location: s3://${S3_BUCKET}/bootstrap/<instance>/ (secrets + repo seeds).
  • Bootstrap S3 bucket + scoped IAM user + VM Import role with infra/opentofu/aws (use homelab-admin creds).
  • Bootstrap AWS instances from the AMI with infra/opentofu/aws (set TF_VAR_ami_id).
  • Import the image into AWS as an AMI (snapshot import + register image).
  • Ensure secrets are encrypted to the baked agenix key (see ../nix/nix-secrets/secrets.nix).
  • Ensure required secrets exist: clawdinator-github-app.pem, clawdinator-discord-token-<n>, clawdinator-control-token, clawdinator-control-aws-*, clawdinator-anthropic-api-key.
  • Update nix/hosts/<host>.nix (Discord allowlist, GitHub App installationId, identity name).
  • Discord must use messages.queue.byChannel.discord = "interrupt"; queue delays replies to heartbeat and makes the bot appear dead.
  • Ensure /var/lib/clawd/repos/clawdinators contains this repo (self-update requires it).
  • Verify systemd services: clawdinator; clawdinator-github-app-token only on hosts that explicitly enable GitHub App auth.
  • Commit and push changes; repo is the source of truth.

Bootstrap (local):

  • Agenix identity is ~/.ssh/id_ed25519 (primary SSH key).
  • Decrypt homelab admin creds:
    • RULES=../nix/nix-secrets/secrets.nix agenix -d homelab-admin.age -i ~/.ssh/id_ed25519
  • OpenTofu env:
    • TF_VAR_aws_region=eu-central-1
    • TF_VAR_ami_id=ami-... (empty string skips instance creation)
    • TF_VAR_ssh_public_key="$(cat ~/.ssh/id_ed25519.pub)" (required when ami_id is set)
    • TF_VAR_root_volume_size_gb=40 (bump if Nix store runs out of space)
  • Run tofu init + tofu apply in infra/opentofu/aws.
  • After apply, update CI secrets from outputs:
    • tofu output -raw access_key_idclawdinator-image-uploader-access-key-id.age
    • tofu output -raw secret_access_keyclawdinator-image-uploader-secret-access-key.age
    • tofu output -raw bucket_nameclawdinator-image-bucket-name.age
    • tofu output -raw aws_regionclawdinator-image-bucket-region.age
    • Then gh secret set for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_REGION, S3_BUCKET.
  • Get the latest AMI ID:
    • aws ec2 describe-images --region eu-central-1 --owners self --filters "Name=tag:clawdinator,Values=true" --query "Images | sort_by(@,&CreationDate)[-1].[ImageId,Name,CreationDate]" --output text

End-to-end SDLC (local → AMI → host) (verified):

  1. Decrypt AWS creds (homelab admin) and export:
    • cd ~/code/nix/nix-secrets
    • RULES=./secrets.nix agenix -d homelab-admin.age -i ~/.ssh/id_ed25519 > /tmp/homelab-admin.env
    • set -a; source /tmp/homelab-admin.env; set +a
    • Cleanup: trash /tmp/homelab-admin.env
  2. Build/import a new AMI explicitly. The old GitHub Actions build/deploy paths are disabled under .github/workflows-disabled/.
  3. Redeploy from the new AMI (instance replacement):
    • devenv shell -- bash -lc "cd infra/opentofu/aws && TF_VAR_ami_id=<AMI_ID> TF_VAR_ssh_public_key=\"$(cat ~/.ssh/id_ed25519.pub)\" TF_VAR_aws_region=eu-central-1 tofu apply -auto-approve"
  4. New IP:
    • tofu output -json instance_public_ips | jq -r '."clawdinator-1"'
    • ssh -o StrictHostKeyChecking=accept-new root@<ip>
  5. Post-deploy sanity:
    • systemctl is-active clawdinator
    • systemctl is-active clawdinator-github-app-token.timer only if the target host explicitly enables githubApp
    • GH_CONFIG_DIR=/var/lib/clawd/gh gh auth status -h github.com only if the target host explicitly enables GitHub auth

Important:

  • Repo/workspace on host is seeded from the AMI snapshot. git pull is ephemeral; rebuild AMI for persistent changes.
  • Any manual host fix is triage-only; always rebuild the AMI and redeploy before calling it done.
  • If SSH access is lost, use SSM (instance profile is attached via OpenTofu) to re-add /root/.ssh/authorized_keys.

Key principle: mental notes dont survive restarts — write it to a file.

Cattle vs pets: hosts are disposable. Prefer re-provisioning from OpenTofu + NixOS configs over in-place manual fixes. One way only: AWS AMI pipeline via S3 + VM Import. This is a greenfield repo. Do not reference alternate paths anywhere in code or docs.