discrawl/docs/commands/sync.md
2026-05-08 08:29:38 +01:00

83 lines
4.6 KiB
Markdown

# `sync`
Refreshes SQLite from one or both archive sources.
By default, `sync` runs both live/local sources and does **not** import the Git snapshot first:
- Discord bot-token sync for bot-visible guild data
- local Discord Desktop cache import for classifiable cached messages and proven DMs
Use [`update`](update.html) when you want to pull/import the shared Git snapshot. Snapshot imports normally use changed-shard deltas, but unsafe table changes fall back to a full import. If you intentionally want a sync run to import the snapshot before live deltas, pass `--update=auto` (only when stale) or `--update=force` (always). `--no-update` is accepted as an explicit no-op alias for the default.
Run one explicit `--full` pass when you want a complete historical guild archive. Use plain `sync` afterward for frequent latest-message and desktop-cache refreshes.
## Usage
```bash
discrawl sync
discrawl sync --update=auto
discrawl sync --update=force
discrawl sync --no-update
discrawl sync --full
discrawl sync --full --all
discrawl sync --guild 123456789012345678
discrawl sync --guilds 123,456 --concurrency 8
discrawl sync --source both # default: bot API + desktop cache
discrawl sync --source discord # bot API only; aliases: key, bot, api
discrawl sync --source wiretap # desktop cache only; aliases: desktop, cache
discrawl sync --guild 123456789012345678 --all-channels
discrawl sync --channels 111,222 --since 2026-03-01T00:00:00Z
discrawl sync --with-embeddings
```
## Sources
| Source | Reads from | Stores |
| --- | --- | --- |
| `both` | Discord bot API and local Discord Desktop cache | bot-visible guild data plus classifiable cached desktop messages |
| `discord` / `key` | Discord bot API | guilds, channels, threads, members, and messages the bot can access |
| `wiretap` | local Discord Desktop cache files | classifiable cached messages; proven DMs are stored under `@me` |
## Bot sync modes
| Command | Use when | Behavior |
| --- | --- | --- |
| `discrawl sync` | routine refresh | skips member refreshes, checks live top-level channels plus active threads, only fetches new messages for channels with a stored cursor |
| `discrawl sync --update=auto` | hybrid Git/live refresh | imports a stale Git snapshot first, then runs the routine live refresh |
| `discrawl sync --all-channels` | repair pass | broad incremental sweep across every stored channel/thread, including archived threads |
| `discrawl sync --full` | historical backfill | crawls older history until channels are complete |
## Flags
- `--source <both|discord|wiretap>` - which archive sources to read
- `--update <auto|force|none>` - whether to import the Git snapshot before live deltas
- `--full` - historical backfill (slow on large guilds)
- `--all-channels` - broader incremental sweep across every stored channel/thread
- `--latest-only` - explicit latest-only run (also the default for untargeted `sync`)
- `--all` - ignore `default_guild_id` and fan out across every discovered guild
- `--guild <id>` / `--guilds <id,id>` - target specific guilds
- `--channels <id,id>` - target specific channels (forum ids expand to threads)
- `--since <RFC3339>` - limit initial history and `--full` backfill to messages at or after this timestamp
- `--concurrency <n>` - override worker count (default auto-sized: floor 8, cap 32)
- `--skip-members` - refresh guild/channel/message data without crawling members
- `--with-embeddings` - also enqueue changed messages into `embedding_jobs`
## Notes
- `--latest-only` is the default for untargeted `sync`. Use `--all-channels` to opt out without doing a full historical crawl.
- `--since` does not mark older history as complete, so a later `sync --full` without `--since` can continue the backfill.
- Long runs emit periodic progress logs to stderr.
- Heartbeat logs (`message sync waiting`) name the oldest active channel and per-channel page activity if in-flight channels stop completing for a while.
- Every run ends with a `message sync finished` summary.
- Each channel crawl has a bounded runtime budget; pathological channels are deferred and retried next sync.
- Retryable failures and unavailable-channel markers are tracked per channel; stale unavailable markers are cleared after a later successful crawl.
- Marker cleanup is best-effort, so one missing local sync-state row cannot crash the run.
- Full sync member refresh is best-effort and gives up after five minutes without a caller-supplied deadline.
- When the archive is already complete, `sync --full` reuses backlog markers and limits steady-state refresh to live top-level channels plus active threads.
## See also
- [Sync sources](../guides/sync-sources.html)
- [`tail`](tail.html)
- [`update`](update.html)