diff --git a/README.md b/README.md new file mode 100644 index 0000000..9356a5a --- /dev/null +++ b/README.md @@ -0,0 +1,256 @@ +# discrawl + +`discrawl` mirrors Discord guild data into local SQLite so you can search, inspect, and query server history without depending on Discord search. + +It is a bot-token crawler. No user-token hacks. Data stays local. + +## What It Does + +- discovers every guild the configured bot can access +- syncs channels, threads, members, and message history into SQLite +- maintains FTS5 search indexes for fast local text search +- tails Gateway events for live updates, with periodic repair syncs +- exposes read-only SQL for ad hoc analysis +- keeps schema multi-guild ready while preserving a simple single-guild default UX + +Search defaults to all guilds. `sync` and `tail` default to the configured default guild when one exists, otherwise they fan out to all discovered guilds. + +## Requirements + +- Go `1.26+` +- a Discord bot token the bot can use to read the target guilds +- bot permissions for the channels you want archived + +Token resolution: + +1. OpenClaw config, if `discord.token_source` is not `env` +2. `DISCORD_BOT_TOKEN` or the configured `discord.token_env` + +Default runtime paths: + +- config: `~/.discrawl/config.toml` +- database: `~/.discrawl/discrawl.db` +- cache: `~/.discrawl/cache/` +- logs: `~/.discrawl/logs/` + +## Install + +Build from source: + +```bash +git clone https://github.com/steipete/discrawl.git +cd discrawl +go build -o bin/discrawl ./cmd/discrawl +./bin/discrawl --version +``` + +## Quick Start + +Reuse an existing OpenClaw Discord bot config: + +```bash +bin/discrawl init --from-openclaw ~/.openclaw/openclaw.json +bin/discrawl doctor +bin/discrawl sync --full +bin/discrawl search "panic: nil pointer" +bin/discrawl tail +``` + +Env-only setup: + +```bash +export DISCORD_BOT_TOKEN="..." +bin/discrawl init +bin/discrawl sync --full +``` + +`init` discovers accessible guilds and writes `~/.discrawl/config.toml`. If exactly one guild is available, that guild becomes the default automatically. + +## Commands + +### `init` + +Creates the local config and discovers accessible guilds. + +```bash +bin/discrawl init +bin/discrawl init --from-openclaw ~/.openclaw/openclaw.json +bin/discrawl init --guild 123456789012345678 +bin/discrawl init --db ~/data/discrawl.db +``` + +### `sync` + +Backfills guild state into SQLite. + +```bash +bin/discrawl sync --full +bin/discrawl sync --guild 123456789012345678 +bin/discrawl sync --guilds 123,456 --concurrency 8 +bin/discrawl sync --channels 111,222 --since 2026-03-01T00:00:00Z +``` + +### `tail` + +Runs the live Gateway tail and periodic repair loop. + +```bash +bin/discrawl tail +bin/discrawl tail --guild 123456789012345678 +bin/discrawl tail --repair-every 30m +``` + +### `search` + +Runs FTS search over archived messages. + +```bash +bin/discrawl search "panic: nil pointer" +bin/discrawl search --guild 123456789012345678 "payment failed" +bin/discrawl search --channel billing --author steipete --limit 50 "invoice" +bin/discrawl --json search "websocket closed" +``` + +### `sql` + +Runs read-only SQL against the local database. + +```bash +bin/discrawl sql 'select count(*) as messages from messages' +echo 'select guild_id, count(*) from messages group by guild_id' | bin/discrawl sql - +``` + +### `members` + +```bash +bin/discrawl members list +bin/discrawl members show 123456789012345678 +bin/discrawl members search "peter" +``` + +### `channels` + +```bash +bin/discrawl channels list +bin/discrawl channels show 123456789012345678 +``` + +### `status` + +Shows local archive status. + +```bash +bin/discrawl status +``` + +### `doctor` + +Checks config, auth, DB, and FTS wiring. + +```bash +bin/discrawl doctor +``` + +## Configuration + +`init` writes a complete config, so most users should not hand-edit anything initially. + +Typical config shape: + +```toml +version = 1 +default_guild_id = "" +guild_ids = [] +db_path = "~/.discrawl/discrawl.db" +cache_dir = "~/.discrawl/cache" +log_dir = "~/.discrawl/logs" + +[discord] +token_source = "openclaw" +openclaw_config = "~/.openclaw/openclaw.json" +account = "default" +token_env = "DISCORD_BOT_TOKEN" + +[sync] +concurrency = 4 +repair_every = "6h" +full_history = true + +[search] +default_mode = "fts" + +[search.embeddings] +enabled = false +provider = "openai" +model = "text-embedding-3-small" +api_key_env = "OPENAI_API_KEY" +batch_size = 64 +``` + +Config override rules: + +- `--config` beats everything +- `DISCRAWL_CONFIG` overrides the default config path +- `discord.token_source = "env"` forces env-only token lookup + +## Embeddings + +Embeddings are optional. FTS is the default search path and the primary verification target. + +If enabled, embeddings are intended to enrich recall in background batches, not block the hot sync path. + +```bash +export OPENAI_API_KEY="..." +bin/discrawl init --with-embeddings +bin/discrawl sync --with-embeddings +``` + +## Data Stored Locally + +- guild metadata +- channels and threads in one table +- current member snapshot +- canonical message rows +- append-only message event records +- FTS index rows +- optional embedding backlog metadata + +Attachment binaries are not stored in SQLite. + +## Security + +- do not commit bot tokens or API keys +- default config lives in your home directory, not inside the repo +- CI runs secret scanning with `gitleaks` +- `doctor` reports token source, not token contents + +## Development + +Local gate: + +```bash +go run github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.11.1 run +go test ./... -coverprofile=/tmp/discrawl.cover +go tool cover -func=/tmp/discrawl.cover | tail -n 1 +go build ./cmd/discrawl +``` + +Target coverage is `>= 80%`. + +CI runs: + +- `golangci-lint` +- `go test` with coverage threshold enforcement +- `go build ./cmd/discrawl` +- `gitleaks` against git history and the working tree + +## Notes + +- the schema is multi-guild ready even when the common UX stays single-guild simple +- threads are stored as channels because that matches the Discord model +- archived threads are part of the sync surface +- live sync is resumable; large guilds still take time because Discord rate limits history backfill + +## License + +MIT. See [LICENSE](LICENSE). diff --git a/SPEC.md b/SPEC.md new file mode 100644 index 0000000..ba088bb --- /dev/null +++ b/SPEC.md @@ -0,0 +1,783 @@ +# discrawl Spec + +This file is the build contract for an AI agent working in this repo. + +Goal: + +- build a local-first Discord guild crawler +- mirror all guild data the configured bot can access +- store it in SQLite +- support fast text search, semantic search, and raw SQL +- support one-shot backfill and long-running live sync + +This spec is intentionally detailed so an agent can keep shipping without re-asking foundational questions. + +## Product Summary + +`discrawl` is a Go CLI that mirrors Discord guild data into local SQLite. + +V1 scope: + +- one guild at a time +- all accessible text channels +- all accessible announcement channels +- all accessible forum channels and their posts +- all accessible public threads +- all accessible private threads +- archived thread coverage +- full message history +- current member snapshot +- FTS5 search +- optional OpenAI embeddings with local vector search +- raw SQL access + +Out of scope for V1: + +- personal-account DMs +- reactions as primary indexed entities +- attachment blob downloads by default +- cross-guild unified sync UX +- write-back or moderation actions + +## Requirements Already Chosen + +These are settled unless the user explicitly changes them: + +- config format: `TOML` +- config location: `~/.discrawl/config.toml` +- DB location: `~/.discrawl/discrawl.db` +- cache dir: `~/.discrawl/cache/` +- log dir: `~/.discrawl/logs/` +- token source: reuse Molty / existing OpenClaw Discord bot config +- guild model: one guild in CLI UX, multi-guild-ready schema +- search: hybrid, with FTS first and embeddings optional +- embedding provider: OpenAI +- API key source: `OPENAI_API_KEY` from shell env +- message retention: current canonical row + append-only event log +- member retention: current snapshot only +- files: metadata only in DB, fetch binaries later on demand +- reactions: not important for V1 +- polls: flatten into text during normalization + +## Local Environment Contract + +An agent should assume: + +- repo path: `~/Projects/discrawl` +- shell: `zsh` +- Go is installed and modern +- user is Peter +- user keeps many secrets in `~/.profile` +- an existing OpenClaw install may already contain usable Discord bot config + +### Key file paths + +- `~/.discrawl/config.toml` +- `~/.discrawl/discrawl.db` +- `~/.profile` +- `~/.openclaw/openclaw.json` +- `~/.openclaw/openclaw.json.bak*` + +### Existing bot config + +The current bot token source is expected in: + +- `~/.openclaw/openclaw.json` + +Expected path inside JSON: + +- `channels.discord.token` + +Expected guild selection path: + +- `channels.discord.guilds` + +The current intended default mode is: + +- `discrawl init --from-openclaw ~/.openclaw/openclaw.json` + +### OpenAI embeddings key + +Do not store raw API keys in repo files. + +Expected source: + +- env var `OPENAI_API_KEY` + +Typical place to discover it locally: + +- `~/.profile` + +The code should read the env var at runtime, not copy the value into config by default. + +## Discord Data Model Notes + +Important Discord facts that drive the schema: + +- channels and threads are closely related; threads should be stored as channels +- forum posts are threads under a forum parent +- message history is paginated and must be backfilled incrementally +- live updates come from Gateway events, not from polling alone +- archived public and private threads must be enumerated explicitly +- private archived thread access may require elevated bot perms like `Manage Threads` + +### Entities to mirror + +- guild +- categories +- channels +- threads +- members +- messages +- message lifecycle events + +### Channel kinds worth preserving + +- category +- text +- announcement +- forum +- thread public +- thread private +- thread announcement + +Voice channels can be mirrored as metadata rows, but there is no need to crawl message history because there is none. + +## Database Design + +Use SQLite. + +Requirements: + +- WAL mode +- foreign keys on +- FTS5 enabled +- vector extension optional + +### Tables + +At minimum: + +- `guilds` +- `channels` +- `members` +- `messages` +- `message_events` +- `sync_state` +- `embedding_jobs` +- `message_fts` + +Optional once vectors are wired: + +- `message_embeddings` + +### `guilds` + +Suggested shape: + +```sql +create table guilds ( + id text primary key, + name text not null, + icon text, + raw_json text not null, + updated_at text not null +); +``` + +### `channels` + +Threads should live in the same table. + +Suggested shape: + +```sql +create table channels ( + id text primary key, + guild_id text not null, + parent_id text, + kind text not null, + name text not null, + topic text, + position integer, + is_nsfw integer not null default 0, + is_archived integer not null default 0, + is_locked integer not null default 0, + is_private_thread integer not null default 0, + thread_parent_id text, + archive_timestamp text, + raw_json text not null, + updated_at text not null +); +``` + +### `members` + +Suggested shape: + +```sql +create table members ( + guild_id text not null, + user_id text not null, + username text not null, + global_name text, + display_name text, + nick text, + discriminator text, + avatar text, + bot integer not null default 0, + joined_at text, + role_ids_json text not null, + raw_json text not null, + updated_at text not null, + primary key (guild_id, user_id) +); +``` + +### `messages` + +Suggested shape: + +```sql +create table messages ( + id text primary key, + guild_id text not null, + channel_id text not null, + author_id text, + message_type integer not null, + created_at text not null, + edited_at text, + deleted_at text, + content text not null, + normalized_content text not null, + reply_to_message_id text, + pinned integer not null default 0, + has_attachments integer not null default 0, + raw_json text not null, + updated_at text not null +); +``` + +### `message_events` + +Suggested shape: + +```sql +create table message_events ( + event_id integer primary key autoincrement, + guild_id text not null, + channel_id text not null, + message_id text not null, + event_type text not null, + event_at text not null, + payload_json text not null +); +``` + +### `sync_state` + +Suggested shape: + +```sql +create table sync_state ( + scope text primary key, + cursor text, + updated_at text not null +); +``` + +Examples of `scope`: + +- `guild::members` +- `channel::messages` +- `tail:` + +### `embedding_jobs` + +Suggested shape: + +```sql +create table embedding_jobs ( + message_id text primary key, + state text not null, + attempts integer not null default 0, + updated_at text not null +); +``` + +### FTS + +Recommended pattern: + +- content table = `messages` +- FTS virtual table = `message_fts` +- keep it updated explicitly, not by fragile magic + +Suggested columns: + +- `message_id` +- `guild_id` +- `channel_id` +- `author_id` +- `author_name` +- `channel_name` +- `content` + +## Search Design + +### Modes + +Support three modes: + +- `fts` +- `semantic` +- `hybrid` + +Default: + +- `hybrid` when embeddings are enabled +- `fts` otherwise + +### FTS behavior + +FTS is mandatory. + +It should be good enough that the tool is useful before embeddings exist. + +Expected use cases: + +- exact terms +- commands +- stack traces +- URLs +- model names +- channel names +- user names + +### Semantic behavior + +Embeddings are optional but planned from day one. + +Recommended provider: + +- OpenAI `text-embedding-3-small` + +Implementation guidance: + +- batch embedding jobs +- keep embedding generation out of the hot sync path +- store vectors locally +- semantic search should degrade gracefully when vectors are absent + +### Vector store choice + +Prefer SQLite-local vector search so the whole product stays portable. + +Recommended direction: + +- `sqlite-vec` + +This can be wired after the base crawler and FTS system work. + +## CLI Spec + +Design goals: + +- simple for humans +- composable for scripts +- obvious nouns and verbs +- no secrets in flags + +Usage: + +```text +discrawl [global flags] [args] +``` + +### Global flags + +- `-h, --help` +- `--version` +- `--config ` +- `--json` +- `--plain` +- `-q, --quiet` +- `-v, --verbose` +- `--no-color` + +### Commands + +- `init` +- `sync` +- `tail` +- `search` +- `sql` +- `members` +- `channels` +- `status` +- `doctor` + +### `init` + +Purpose: + +- create `~/.discrawl/config.toml` +- import defaults from OpenClaw +- persist guild id and DB path + +Expected flags: + +- `--from-openclaw ` +- `--guild ` +- `--db ` +- `--with-embeddings` + +### `sync` + +Purpose: + +- one-shot crawl + +Expected flags: + +- `--full` +- `--since ` +- `--concurrency ` +- `--with-embeddings` + +Requirements: + +- idempotent +- restart-safe +- shows progress on stderr + +### `tail` + +Purpose: + +- live sync from Gateway + +Expected flags: + +- `--repair-every ` +- `--with-embeddings` + +Requirements: + +- reconnect automatically +- write checkpoints +- periodic repair sync + +### `search` + +Purpose: + +- query mirrored messages + +Expected flags: + +- `--mode fts|semantic|hybrid` +- `--channel ` +- `--author ` +- `--limit ` +- `--json` +- `--plain` + +### `sql` + +Purpose: + +- run read-only SQL + +Requirements: + +- support query arg or stdin +- block non-read-only statements by default + +### `members` + +Subcommands: + +- `list` +- `show ` +- `search ` + +### `channels` + +Subcommands: + +- `list` +- `show ` + +### `status` + +Must show: + +- guild id +- guild name if known +- db path +- total channels +- total threads +- total messages +- total members +- last sync time +- last tail event time +- embedding backlog + +### `doctor` + +Must check: + +- config file readable +- OpenClaw token source readable +- Discord auth valid +- guild reachable +- DB openable +- FTS present +- vector extension present if configured + +## Config Spec + +Format: + +- TOML + +Location: + +- `~/.discrawl/config.toml` + +Suggested shape: + +```toml +version = 1 +guild_id = "1456350064065904867" +db_path = "~/.discrawl/discrawl.db" +cache_dir = "~/.discrawl/cache" +log_dir = "~/.discrawl/logs" + +[discord] +token_source = "openclaw" +openclaw_config = "~/.openclaw/openclaw.json" +channel_account = "discord" + +[sync] +concurrency = 4 +repair_every = "6h" +full_history = true + +[search] +default_mode = "hybrid" + +[search.embeddings] +enabled = true +provider = "openai" +model = "text-embedding-3-small" +api_key_env = "OPENAI_API_KEY" +batch_size = 64 +``` + +Config precedence: + +1. flags +2. environment +3. config file + +Environment variables: + +- `DISCRAWL_CONFIG` +- `OPENAI_API_KEY` + +## Token Handling Rules + +Do not: + +- put bot tokens in git +- put API keys in git +- print secrets in normal logs + +Do: + +- load bot token from OpenClaw config path +- load OpenAI key from env +- redact secrets in debug and doctor output + +## Discord Sync Algorithm + +### Initial full sync + +1. load config +2. resolve token +3. fetch bot identity +4. fetch guild metadata +5. fetch guild channels +6. fetch active threads +7. enumerate archived public threads per parent channel +8. enumerate archived private threads per parent channel +9. fetch member snapshot +10. backfill messages for every crawlable channel and thread +11. normalize message content +12. upsert `messages` +13. append `message_events` where relevant +14. update FTS rows +15. enqueue embedding jobs +16. write checkpoints + +### Message crawl strategy + +Use REST pagination with `before`. + +Rules: + +- fetch newest page first for incremental runs +- fetch oldest via repeated `before` paging for full runs +- stop when no messages remain +- handle rate limits centrally + +### Live tail strategy + +Use Gateway events for: + +- new messages +- edited messages +- deleted messages +- channel updates +- thread updates +- member updates + +Tail should: + +- upsert live state +- append lifecycle events +- keep retrying on disconnect +- periodically run repair sync + +## Message Normalization + +`normalized_content` should flatten Discord payloads into searchable text. + +Include: + +- message content +- embed titles and descriptions where helpful +- poll question and answers +- attachment filenames +- referenced message hints if available + +Do not overcomplicate: + +- reactions can be ignored +- attachment binary contents are not indexed in V1 + +## Member Query Design + +Members matter for AI workflows. + +Expected use cases: + +- “who is this user” +- “find messages by this person” +- “find maintainers” +- “find everyone with a display name containing X” + +At minimum, store: + +- user id +- username +- display name +- nick +- roles +- bot flag + +## Recommended Go Package Layout + +```text +cmd/discrawl/ +internal/cli/ +internal/config/ +internal/discord/ +internal/store/ +internal/search/ +internal/syncer/ +internal/embed/ +``` + +Responsibilities: + +- `internal/cli`: command wiring, output modes +- `internal/config`: parse and validate config +- `internal/discord`: REST + Gateway client wrappers +- `internal/store`: SQLite schema, migrations, queries +- `internal/search`: FTS and result ranking +- `internal/syncer`: full sync and repair orchestration +- `internal/embed`: embedding queue and provider integration + +## Recommended Dependencies + +Reasonable picks: + +- Discord client: `github.com/bwmarrin/discordgo` +- TOML parser: something small and maintained +- SQLite driver: pick one path and stay consistent +- vector search: `sqlite-vec` + +Guidance: + +- keep dependency count low +- prefer boring stable libraries +- avoid frameworks + +## Milestones + +### Milestone 1 + +- config loader +- `init` +- `status` +- DB open + migrations + +### Milestone 2 + +- guild metadata sync +- channel sync +- member sync + +### Milestone 3 + +- full message backfill +- incremental checkpoints +- FTS indexing + +### Milestone 4 + +- `search` +- `sql` +- `members` +- `channels` + +### Milestone 5 + +- `tail` +- reconnect logic +- repair loop + +### Milestone 6 + +- embedding queue +- vector search +- hybrid ranking + +## What The Repo Must Eventually Contain + +For an AI agent to finish the product without external memory, this repo should contain: + +- this spec +- README with user-facing overview +- schema and migration files +- config sample +- CLI contract +- implementation package layout +- token discovery rules +- API key discovery rules +- milestone order + +This file is the authoritative engineering spec for now.