[BREAKGLASS] cli for Discord with sqlite backend https://discrawl.sh
Go to file
2026-04-21 10:03:30 +01:00
.github ci: refresh backup before publish readme 2026-04-21 10:03:30 +01:00
cmd/discrawl build: tighten go lint and format rules 2026-03-07 17:58:21 +00:00
docs docs: prefer Homebrew install flow 2026-03-26 15:57:41 +00:00
internal test: make share push retry portable 2026-04-21 09:46:52 +01:00
scripts fix: stabilize discord backup reporting 2026-04-21 09:44:02 +01:00
.gitignore ci: add repo hygiene and verification workflow 2026-03-07 14:44:54 +00:00
.golangci.yml build: tighten go lint and format rules 2026-03-07 17:58:21 +00:00
.goreleaser.yaml release: v0.1.0 2026-03-08 02:07:27 +00:00
CHANGELOG.md feat: add sync all flag 2026-04-03 09:34:31 +01:00
go.mod build: require go 1.26.2 2026-04-21 05:10:10 +01:00
go.sum build(deps): bump modernc.org/sqlite to v1.48.0 2026-04-02 22:57:26 +09:00
LICENSE Initial commit 2026-03-07 12:13:49 +00:00
README.md fix: allow local-only report queries 2026-04-21 08:56:54 +01:00
SPEC.md docs: add usage guide and build spec 2026-03-07 14:44:51 +00:00

discrawl 🛰️ — Mirror Discord into SQLite; search server history locally

discrawl mirrors Discord guild data into local SQLite so you can search, inspect, and query server history without depending on Discord search.

It is a bot-token crawler. No user-token hacks. Data stays local.

What It Does

  • discovers every guild the configured bot can access
  • syncs channels, threads, members, and message history into SQLite
  • maintains FTS5 search indexes for fast local text search
  • builds an offline member directory from archived profile payloads
  • extracts small text-like attachments into the local search index
  • records structured user and role mentions for direct querying
  • tails Gateway events for live updates, with periodic repair syncs
  • exposes read-only SQL for ad hoc analysis
  • keeps schema multi-guild ready while preserving a simple single-guild default UX

Search defaults to all guilds. sync and tail default to the configured default guild when one exists, otherwise they fan out to all discovered guilds.

Requirements

  • Go 1.26+
  • for publishing/syncing: a Discord bot token the bot can use to read the target guilds
  • for read-only Git-backed access: access to a private snapshot repo, no Discord credentials required
  • bot permissions for the channels you want archived when running sync or tail

Discord Bot Setup

discrawl needs a real bot token. Not a user token.

Minimum practical setup:

  1. Create or reuse a Discord application in the Discord developer portal.
  2. Add a bot user to that application.
  3. Invite the bot to the target guilds.
  4. Enable these intents for the bot:
    • Server Members Intent
    • Message Content Intent
  5. Ensure the bot can at least:
    • view channels
    • read message history

Without those intents/permissions, sync, tail, member snapshots, or message content archiving will be partial or fail.

Bot Token Sources

Token resolution:

  1. OpenClaw config, if discord.token_source is not env
  2. DISCORD_BOT_TOKEN or the configured discord.token_env

discrawl accepts either raw token text or a value prefixed with Bot . It normalizes that automatically.

Fastest env-only path:

export DISCORD_BOT_TOKEN="your-bot-token"
discrawl doctor
discrawl init

If you keep shell secrets in ~/.profile, add:

export DISCORD_BOT_TOKEN="your-bot-token"

Then reload your shell before running discrawl.

If you already use OpenClaw, discrawl can reuse the Discord token from ~/.openclaw/openclaw.json by default.

Default runtime paths:

  • config: ~/.discrawl/config.toml
  • database: ~/.discrawl/discrawl.db
  • cache: ~/.discrawl/cache/
  • logs: ~/.discrawl/logs/

Install

Homebrew (recommended):

brew install steipete/tap/discrawl  # auto-taps steipete/tap
discrawl --version

Build from source:

git clone https://github.com/steipete/discrawl.git
cd discrawl
go build -o bin/discrawl ./cmd/discrawl
./bin/discrawl --version

Examples below assume discrawl is on PATH. If you built from source without installing it, replace discrawl with ./bin/discrawl.

Quick Start

Reuse an existing OpenClaw Discord bot config:

discrawl init --from-openclaw ~/.openclaw/openclaw.json
discrawl doctor
discrawl sync --full
discrawl search "panic: nil pointer"
discrawl tail

Multi-account OpenClaw setup:

discrawl init --from-openclaw ~/.openclaw/openclaw.json --account atlas

Env-only setup:

export DISCORD_BOT_TOKEN="..."
discrawl doctor
discrawl init
discrawl sync --full

init discovers accessible guilds and writes ~/.discrawl/config.toml. If exactly one guild is available, that guild becomes the default automatically.

doctor is the fastest sanity check:

  • confirms config can be loaded
  • shows where the token was resolved from
  • verifies bot auth
  • shows how many guilds the bot can access
  • verifies DB + FTS wiring

Commands

init

Creates the local config and discovers accessible guilds.

discrawl init
discrawl init --from-openclaw ~/.openclaw/openclaw.json
discrawl init --from-openclaw ~/.openclaw/openclaw.json --account atlas
discrawl init --guild 123456789012345678
discrawl init --db ~/data/discrawl.db

When OpenClaw config tokens use ${ENV_VAR} placeholders, init and doctor resolve them before auth.

sync

Backfills guild state into SQLite.

discrawl sync --full
discrawl sync --full --all
discrawl sync --guild 123456789012345678
discrawl sync --guilds 123,456 --concurrency 8
discrawl sync --guild 123456789012345678 --skip-members --latest-only
discrawl sync --channels 111,222 --since 2026-03-01T00:00:00Z

sync already uses parallel channel workers. --concurrency overrides the default, and the default is auto-sized from GOMAXPROCS with a floor of 8 and a cap of 32. --all ignores default_guild_id and fans out across every discovered guild the bot can access. --skip-members refreshes guild/channel/message data without crawling the full member list, which is useful for frequent Git snapshot publishers that only need latest messages. --latest-only skips message bootstrapping for channels without a stored latest cursor, so Git-backed publisher jobs only fill deltas on already-archived channels. When --channels includes a forum channel id, discrawl expands that forum's threads and syncs their messages as part of the targeted run. --since limits initial history/bootstrap and full-history backfill to messages at or after the given RFC3339 timestamp. It does not mark older history as complete, so a later sync --full without --since can continue the backfill. Long runs now emit periodic progress logs to stderr so large backfills do not look hung. If in-flight channels stop completing for a while, discrawl now emits message sync waiting heartbeat logs with the oldest active channel, per-channel page activity, and skip/defer counters, and every run ends with a message sync finished summary. Each channel crawl also has a bounded runtime budget, so a pathological channel is deferred and retried on the next sync instead of pinning a worker forever. Full sync member refresh is best-effort and currently gives up after five minutes without a caller-supplied deadline, so message sync completion is not held hostage by a slow guild member crawl. When the archive is already complete, sync --full now reuses the stored backlog markers and limits steady-state refresh to live top-level channels plus active threads instead of revisiting every stored archived thread. If a guild already has a local member snapshot, routine syncs reuse it and skip another full member crawl until that snapshot ages out.

tail

Runs the live Gateway tail and periodic repair loop.

discrawl tail
discrawl tail --guild 123456789012345678
discrawl tail --repair-every 30m

Runs FTS search over archived messages.

discrawl search "panic: nil pointer"
discrawl search --guild 123456789012345678 "payment failed"
discrawl search --channel billing --author steipete --limit 50 "invoice"
discrawl search --include-empty "GitHub"
discrawl --json search "websocket closed"

By default, search skips rows with no searchable content. Attachment text, attachment filenames, embeds, and replies still count as content. Use --include-empty to opt back in. Search returns the newest matching messages first so large local archives stay responsive.

messages

Lists exact message slices by channel, author, and time range.

discrawl messages --channel maintainers --days 7 --all
discrawl messages --channel maintainers --hours 6 --all
discrawl messages --channel "#maintainers" --since 2026-03-01T00:00:00Z
discrawl messages --channel 1456744319972282449 --author steipete --limit 50
discrawl messages --channel maintainers --last 100 --sync
discrawl messages --channel maintainers --days 7 --all --include-empty
discrawl --json messages --channel maintainers --days 3

Notes:

  • --channel accepts a channel id, exact name, #name, or partial name match
  • --hours is shorthand for "since now minus N hours"
  • --days is shorthand for "since now minus N days"
  • --last returns the newest N matching messages, then prints them oldest-to-newest
  • --all removes the safety limit; default is 200
  • --sync runs a blocking pre-query sync for the matching channel or guild scope before reading the local DB
  • rows with no displayable/searchable content are skipped by default; --include-empty opts back in
  • at least one filter is required

mentions

Lists structured user and role mentions.

discrawl mentions --channel maintainers --days 7
discrawl mentions --target steipete --type user --limit 50
discrawl mentions --target 1456406468898197625
discrawl --json mentions --type role --days 1

Notes:

  • --target accepts an id, exact name, or partial name match
  • --type can be user or role
  • same guild/time filters as messages

sql

Runs read-only SQL against the local database.

discrawl sql 'select count(*) as messages from messages'
echo 'select guild_id, count(*) from messages group by guild_id' | discrawl sql -

members

discrawl members list
discrawl members show 123456789012345678
discrawl members show --messages 10 steipete
discrawl members search "peter"
discrawl members search "github"
discrawl members search "https://github.com/steipete"

Notes:

  • search matches names plus any offline profile fields present in the archived member payload
  • show accepts a user id or query; if it resolves to one member, it also shows recent messages
  • extracted profile fields may include bio, pronouns, location, website, x, github, and discovered URLs
  • if the bot cannot see a field from Discord, discrawl cannot invent it; this is strictly archive-based offline data

Typical workflow:

discrawl sync --full
discrawl members search "design engineer"
discrawl members search "github"
discrawl members show --messages 25 steipete
discrawl messages --author steipete --days 30 --all

Typical members show output:

guild=1456350064065904867
user=37658261826043904
username=steipete
display=Peter Steinberger
joined=2026-03-08T16:03:14Z
bot=false
x=steipete
github=steipete
website=https://steipete.me
bio=Builds native apps and tooling.
urls=https://steipete.me, https://github.com/steipete
message_count=1284
first_message=2026-02-01T09:00:00Z
last_message=2026-03-08T15:59:58Z

Searchable member data comes from:

  • Discord member/user payload fields archived into members.raw_json
  • explicit profile fields when Discord exposes them
  • URLs and social handles inferred from archived profile text
  • current member snapshot data such as names, nick, roles, and join time

channels

discrawl channels list
discrawl channels show 123456789012345678

status

Shows local archive status.

discrawl status

Git-backed sharing

discrawl can publish the SQLite archive as sharded, compressed NDJSON snapshots in a private Git repo, then auto-import that repo before local read commands.

Publisher:

discrawl publish --remote https://github.com/openclaw/discord-backup.git --push
discrawl publish --readme path/to/discord-backup/README.md --push

Subscriber:

discrawl subscribe https://github.com/openclaw/discord-backup.git
discrawl search "launch checklist"
discrawl messages --channel general --hours 24

subscribe is the Git-only setup path. It writes a config with discord.token_source = "none", imports the snapshot, and does not require a Discord bot token. sync and tail remain disabled in this mode because they need live Discord access.

Configure freshness:

discrawl subscribe --stale-after 15m https://github.com/openclaw/discord-backup.git
discrawl subscribe --no-auto-update https://github.com/openclaw/discord-backup.git

Once share.remote is configured, read commands auto-fetch and import when the local share import is older than share.stale_after (default 15m). discrawl update forces the same pull/import step manually.

Hybrid mode is supported too: keep normal Discord credentials configured and set share.remote. discrawl sync and discrawl messages --sync import the Git snapshot first, then use live Discord only to fill anything newer or missing. This keeps day-to-day sync fast while preserving live repair behavior.

The Docker smoke test installs discrawl in a clean Go container, subscribes to a Git snapshot repo, then checks search, messages, sql, and report:

DISCRAWL_DOCKER_TEST=1 go test ./internal/cli -run TestDockerGitSourceSmoke -count=1

report

Generates the Markdown activity block used by the shared backup repo README.

discrawl report
discrawl report --readme path/to/discord-backup/README.md

Every scheduled snapshot publish updates deterministic README stats: latest update time, latest archived message, archive totals, and day/week/month activity.

The backup README field notes are intentionally a separate daily workflow, not part of discrawl report, so model latency or quota cannot block the 15-minute data publish path. .github/workflows/discord-backup-report.yml installs openclaw@latest, runs openclaw agent --local with OpenAI, and inserts a separate discrawl-field-notes block with:

  • what people seem to love
  • what people complain about
  • complaint topics correlated with recent GitHub issue and PR clusters
  • the likely best PR to watch

Configure OPENAI_API_KEY in the discrawl repo secrets to enable agent-written field notes. DISCORD_BACKUP_TOKEN still needs write access to openclaw/discord-backup. If the GitHub repo used for issue/PR correlation is private, also set DISCORD_FIELD_NOTES_GITHUB_TOKEN with read access to that repo.

doctor

Checks config, auth, DB, and FTS wiring.

discrawl doctor

Configuration

init writes a complete config, so most users should not hand-edit anything initially.

Typical config shape:

version = 1
default_guild_id = ""
guild_ids = []
db_path = "~/.discrawl/discrawl.db"
cache_dir = "~/.discrawl/cache"
log_dir = "~/.discrawl/logs"

[discord]
token_source = "openclaw" # use "none" for Git-only read access
openclaw_config = "~/.openclaw/openclaw.json"
account = "default"
token_env = "DISCORD_BOT_TOKEN"

[sync]
concurrency = 16
repair_every = "6h"
full_history = true
attachment_text = true

[search]
default_mode = "fts"

[search.embeddings]
enabled = false
provider = "openai"
model = "text-embedding-3-small"
api_key_env = "OPENAI_API_KEY"
batch_size = 64

[share]
remote = ""
repo_path = "~/.discrawl/share"
branch = "main"
auto_update = true
stale_after = "15m"

The value above is an example. init writes an auto-sized default based on the host: min(32, max(8, GOMAXPROCS*2)).

Config override rules:

  • --config beats everything
  • DISCRAWL_CONFIG overrides the default config path
  • discord.token_source = "env" forces env-only token lookup
  • DISCRAWL_NO_AUTO_UPDATE=1 disables Git snapshot auto-update for read commands in one process, useful for report jobs that already imported a fresh backup.

Embeddings

Embeddings are optional. FTS is the default search path and the primary verification target.

If enabled, embeddings are intended to enrich recall in background batches, not block the hot sync path.

export OPENAI_API_KEY="..."
discrawl init --with-embeddings
discrawl sync --with-embeddings

Data Stored Locally

  • guild metadata
  • channels and threads in one table
  • current member snapshot
  • canonical message rows
  • append-only message event records
  • FTS index rows
  • optional embedding backlog metadata

SQLite schema migrations are versioned with PRAGMA user_version. Startup now fails fast when a local DB schema is newer than the supported binary.

Attachment binaries are not stored in SQLite.

Set sync.attachment_text = false if you want to keep attachment metadata and filenames but disable attachment body fetches for text indexing.

Security

  • do not commit bot tokens or API keys
  • default config lives in your home directory, not inside the repo
  • CI runs secret scanning with gitleaks
  • doctor reports token source, not token contents

Development

Local gate:

go run github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.11.1 run
go test ./... -coverprofile=/tmp/discrawl.cover
go tool cover -func=/tmp/discrawl.cover | tail -n 1
go build ./cmd/discrawl

Target coverage is >= 80%.

CI runs:

  • golangci-lint
  • go test with coverage threshold enforcement
  • go build ./cmd/discrawl
  • gitleaks against git history and the working tree

Notes

  • the schema is multi-guild ready even when the common UX stays single-guild simple
  • threads are stored as channels because that matches the Discord model
  • archived threads are part of the sync surface
  • live sync is resumable; large guilds still take time because Discord rate limits history backfill

License

MIT. See LICENSE.