discrawl/docs/guides/git-snapshots.md
2026-05-08 08:29:38 +01:00

3.9 KiB

Git-backed snapshots

Discrawl can publish the SQLite archive as sharded, compressed NDJSON snapshots in a private Git repo, then auto-import that repo before local read commands. This gives readers org memory without Discord credentials.

Snapshot packing/import and git mirror mechanics are shared through crawlkit. Discrawl still owns Discord-specific privacy policy: @me direct messages, wiretap sync state, and local-only desktop rows are excluded from published snapshots and are preserved locally on import.

Publisher

discrawl publish --remote https://github.com/example/discord-archive.git --push
discrawl publish --readme path/to/discord-backup/README.md --push

The publisher uses your existing bot-synced archive. It exports non-DM tables only.

Subscriber

discrawl subscribe https://github.com/example/discord-archive.git
discrawl search "launch checklist"
discrawl messages --channel general --hours 24

subscribe is the Git-only setup path. It writes a config with discord.token_source = "none", imports the snapshot, and does not require a Discord bot token. sync and tail remain disabled in this mode because they need live Discord access.

Auto-update

Once share.remote is configured, read commands auto-fetch and import when the local share import is older than share.stale_after (default 15m):

discrawl subscribe --stale-after 15m https://github.com/example/discord-archive.git
discrawl subscribe --no-auto-update https://github.com/example/discord-archive.git

discrawl update forces the same pull/import step manually. Snapshot imports are delta-planned from crawlkit shard fingerprints. Older manifests without those fields fall back to Git blob identity, so the common publish shape only imports the changed message tail shard plus small cursor tables. Unsafe table-shape changes still fall back to a full import.

discrawl sync does not auto-import the share unless --update=auto or --update=force is provided, so routine live refreshes stay fast.

Hybrid mode

Keep normal Discord credentials configured and set share.remote:

discrawl sync --update=auto       # import snapshot delta first, then live deltas
discrawl messages --sync          # blocking pre-query sync for matched scope
discrawl sync --all-channels      # broader live repair
discrawl sync --full              # historical backfill

What is published

  • non-DM archive tables (DM @me rows are always excluded)
  • README activity block - latest update time, latest archived message, archive totals, day/week/month activity
  • embedding_jobs is never exported

Backing up vectors

discrawl publish --with-embeddings --push
discrawl subscribe --with-embeddings https://github.com/example/discord-archive.git
discrawl update --with-embeddings

Stored under embeddings/<provider>/<model>/<input_version>/.... Import only restores matching identities; Ollama/nomic subscribers do not accidentally pick up OpenAI/text-embedding vectors. Publishing without --with-embeddings omits embedding manifests instead of carrying forward an older bundle.

CI

The Docker smoke test installs discrawl in a clean Go container, subscribes to a Git snapshot repo, then checks search, messages, sql, and report:

DISCRAWL_DOCKER_TEST=1 go test ./internal/cli -run TestDockerGitSourceSmoke -count=1

The backup workflows restore and save .discrawl-ci/discrawl.db with actions/cache. On a warm runner cache, scheduled publishers skip the pre-sync snapshot import and go straight to the live latest-message delta before publishing. Cache misses still import the latest published snapshot first so --latest-only has channel cursors to resume from.

See also