docs: add usage guide and build spec

This commit is contained in:
Peter Steinberger 2026-03-07 14:44:51 +00:00
parent 26f9d03705
commit d2b5c7e668
2 changed files with 1039 additions and 0 deletions

256
README.md Normal file
View File

@ -0,0 +1,256 @@
# discrawl
`discrawl` mirrors Discord guild data into local SQLite so you can search, inspect, and query server history without depending on Discord search.
It is a bot-token crawler. No user-token hacks. Data stays local.
## What It Does
- discovers every guild the configured bot can access
- syncs channels, threads, members, and message history into SQLite
- maintains FTS5 search indexes for fast local text search
- tails Gateway events for live updates, with periodic repair syncs
- exposes read-only SQL for ad hoc analysis
- keeps schema multi-guild ready while preserving a simple single-guild default UX
Search defaults to all guilds. `sync` and `tail` default to the configured default guild when one exists, otherwise they fan out to all discovered guilds.
## Requirements
- Go `1.26+`
- a Discord bot token the bot can use to read the target guilds
- bot permissions for the channels you want archived
Token resolution:
1. OpenClaw config, if `discord.token_source` is not `env`
2. `DISCORD_BOT_TOKEN` or the configured `discord.token_env`
Default runtime paths:
- config: `~/.discrawl/config.toml`
- database: `~/.discrawl/discrawl.db`
- cache: `~/.discrawl/cache/`
- logs: `~/.discrawl/logs/`
## Install
Build from source:
```bash
git clone https://github.com/steipete/discrawl.git
cd discrawl
go build -o bin/discrawl ./cmd/discrawl
./bin/discrawl --version
```
## Quick Start
Reuse an existing OpenClaw Discord bot config:
```bash
bin/discrawl init --from-openclaw ~/.openclaw/openclaw.json
bin/discrawl doctor
bin/discrawl sync --full
bin/discrawl search "panic: nil pointer"
bin/discrawl tail
```
Env-only setup:
```bash
export DISCORD_BOT_TOKEN="..."
bin/discrawl init
bin/discrawl sync --full
```
`init` discovers accessible guilds and writes `~/.discrawl/config.toml`. If exactly one guild is available, that guild becomes the default automatically.
## Commands
### `init`
Creates the local config and discovers accessible guilds.
```bash
bin/discrawl init
bin/discrawl init --from-openclaw ~/.openclaw/openclaw.json
bin/discrawl init --guild 123456789012345678
bin/discrawl init --db ~/data/discrawl.db
```
### `sync`
Backfills guild state into SQLite.
```bash
bin/discrawl sync --full
bin/discrawl sync --guild 123456789012345678
bin/discrawl sync --guilds 123,456 --concurrency 8
bin/discrawl sync --channels 111,222 --since 2026-03-01T00:00:00Z
```
### `tail`
Runs the live Gateway tail and periodic repair loop.
```bash
bin/discrawl tail
bin/discrawl tail --guild 123456789012345678
bin/discrawl tail --repair-every 30m
```
### `search`
Runs FTS search over archived messages.
```bash
bin/discrawl search "panic: nil pointer"
bin/discrawl search --guild 123456789012345678 "payment failed"
bin/discrawl search --channel billing --author steipete --limit 50 "invoice"
bin/discrawl --json search "websocket closed"
```
### `sql`
Runs read-only SQL against the local database.
```bash
bin/discrawl sql 'select count(*) as messages from messages'
echo 'select guild_id, count(*) from messages group by guild_id' | bin/discrawl sql -
```
### `members`
```bash
bin/discrawl members list
bin/discrawl members show 123456789012345678
bin/discrawl members search "peter"
```
### `channels`
```bash
bin/discrawl channels list
bin/discrawl channels show 123456789012345678
```
### `status`
Shows local archive status.
```bash
bin/discrawl status
```
### `doctor`
Checks config, auth, DB, and FTS wiring.
```bash
bin/discrawl doctor
```
## Configuration
`init` writes a complete config, so most users should not hand-edit anything initially.
Typical config shape:
```toml
version = 1
default_guild_id = ""
guild_ids = []
db_path = "~/.discrawl/discrawl.db"
cache_dir = "~/.discrawl/cache"
log_dir = "~/.discrawl/logs"
[discord]
token_source = "openclaw"
openclaw_config = "~/.openclaw/openclaw.json"
account = "default"
token_env = "DISCORD_BOT_TOKEN"
[sync]
concurrency = 4
repair_every = "6h"
full_history = true
[search]
default_mode = "fts"
[search.embeddings]
enabled = false
provider = "openai"
model = "text-embedding-3-small"
api_key_env = "OPENAI_API_KEY"
batch_size = 64
```
Config override rules:
- `--config` beats everything
- `DISCRAWL_CONFIG` overrides the default config path
- `discord.token_source = "env"` forces env-only token lookup
## Embeddings
Embeddings are optional. FTS is the default search path and the primary verification target.
If enabled, embeddings are intended to enrich recall in background batches, not block the hot sync path.
```bash
export OPENAI_API_KEY="..."
bin/discrawl init --with-embeddings
bin/discrawl sync --with-embeddings
```
## Data Stored Locally
- guild metadata
- channels and threads in one table
- current member snapshot
- canonical message rows
- append-only message event records
- FTS index rows
- optional embedding backlog metadata
Attachment binaries are not stored in SQLite.
## Security
- do not commit bot tokens or API keys
- default config lives in your home directory, not inside the repo
- CI runs secret scanning with `gitleaks`
- `doctor` reports token source, not token contents
## Development
Local gate:
```bash
go run github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.11.1 run
go test ./... -coverprofile=/tmp/discrawl.cover
go tool cover -func=/tmp/discrawl.cover | tail -n 1
go build ./cmd/discrawl
```
Target coverage is `>= 80%`.
CI runs:
- `golangci-lint`
- `go test` with coverage threshold enforcement
- `go build ./cmd/discrawl`
- `gitleaks` against git history and the working tree
## Notes
- the schema is multi-guild ready even when the common UX stays single-guild simple
- threads are stored as channels because that matches the Discord model
- archived threads are part of the sync surface
- live sync is resumable; large guilds still take time because Discord rate limits history backfill
## License
MIT. See [LICENSE](LICENSE).

783
SPEC.md Normal file
View File

@ -0,0 +1,783 @@
# discrawl Spec
This file is the build contract for an AI agent working in this repo.
Goal:
- build a local-first Discord guild crawler
- mirror all guild data the configured bot can access
- store it in SQLite
- support fast text search, semantic search, and raw SQL
- support one-shot backfill and long-running live sync
This spec is intentionally detailed so an agent can keep shipping without re-asking foundational questions.
## Product Summary
`discrawl` is a Go CLI that mirrors Discord guild data into local SQLite.
V1 scope:
- one guild at a time
- all accessible text channels
- all accessible announcement channels
- all accessible forum channels and their posts
- all accessible public threads
- all accessible private threads
- archived thread coverage
- full message history
- current member snapshot
- FTS5 search
- optional OpenAI embeddings with local vector search
- raw SQL access
Out of scope for V1:
- personal-account DMs
- reactions as primary indexed entities
- attachment blob downloads by default
- cross-guild unified sync UX
- write-back or moderation actions
## Requirements Already Chosen
These are settled unless the user explicitly changes them:
- config format: `TOML`
- config location: `~/.discrawl/config.toml`
- DB location: `~/.discrawl/discrawl.db`
- cache dir: `~/.discrawl/cache/`
- log dir: `~/.discrawl/logs/`
- token source: reuse Molty / existing OpenClaw Discord bot config
- guild model: one guild in CLI UX, multi-guild-ready schema
- search: hybrid, with FTS first and embeddings optional
- embedding provider: OpenAI
- API key source: `OPENAI_API_KEY` from shell env
- message retention: current canonical row + append-only event log
- member retention: current snapshot only
- files: metadata only in DB, fetch binaries later on demand
- reactions: not important for V1
- polls: flatten into text during normalization
## Local Environment Contract
An agent should assume:
- repo path: `~/Projects/discrawl`
- shell: `zsh`
- Go is installed and modern
- user is Peter
- user keeps many secrets in `~/.profile`
- an existing OpenClaw install may already contain usable Discord bot config
### Key file paths
- `~/.discrawl/config.toml`
- `~/.discrawl/discrawl.db`
- `~/.profile`
- `~/.openclaw/openclaw.json`
- `~/.openclaw/openclaw.json.bak*`
### Existing bot config
The current bot token source is expected in:
- `~/.openclaw/openclaw.json`
Expected path inside JSON:
- `channels.discord.token`
Expected guild selection path:
- `channels.discord.guilds`
The current intended default mode is:
- `discrawl init --from-openclaw ~/.openclaw/openclaw.json`
### OpenAI embeddings key
Do not store raw API keys in repo files.
Expected source:
- env var `OPENAI_API_KEY`
Typical place to discover it locally:
- `~/.profile`
The code should read the env var at runtime, not copy the value into config by default.
## Discord Data Model Notes
Important Discord facts that drive the schema:
- channels and threads are closely related; threads should be stored as channels
- forum posts are threads under a forum parent
- message history is paginated and must be backfilled incrementally
- live updates come from Gateway events, not from polling alone
- archived public and private threads must be enumerated explicitly
- private archived thread access may require elevated bot perms like `Manage Threads`
### Entities to mirror
- guild
- categories
- channels
- threads
- members
- messages
- message lifecycle events
### Channel kinds worth preserving
- category
- text
- announcement
- forum
- thread public
- thread private
- thread announcement
Voice channels can be mirrored as metadata rows, but there is no need to crawl message history because there is none.
## Database Design
Use SQLite.
Requirements:
- WAL mode
- foreign keys on
- FTS5 enabled
- vector extension optional
### Tables
At minimum:
- `guilds`
- `channels`
- `members`
- `messages`
- `message_events`
- `sync_state`
- `embedding_jobs`
- `message_fts`
Optional once vectors are wired:
- `message_embeddings`
### `guilds`
Suggested shape:
```sql
create table guilds (
id text primary key,
name text not null,
icon text,
raw_json text not null,
updated_at text not null
);
```
### `channels`
Threads should live in the same table.
Suggested shape:
```sql
create table channels (
id text primary key,
guild_id text not null,
parent_id text,
kind text not null,
name text not null,
topic text,
position integer,
is_nsfw integer not null default 0,
is_archived integer not null default 0,
is_locked integer not null default 0,
is_private_thread integer not null default 0,
thread_parent_id text,
archive_timestamp text,
raw_json text not null,
updated_at text not null
);
```
### `members`
Suggested shape:
```sql
create table members (
guild_id text not null,
user_id text not null,
username text not null,
global_name text,
display_name text,
nick text,
discriminator text,
avatar text,
bot integer not null default 0,
joined_at text,
role_ids_json text not null,
raw_json text not null,
updated_at text not null,
primary key (guild_id, user_id)
);
```
### `messages`
Suggested shape:
```sql
create table messages (
id text primary key,
guild_id text not null,
channel_id text not null,
author_id text,
message_type integer not null,
created_at text not null,
edited_at text,
deleted_at text,
content text not null,
normalized_content text not null,
reply_to_message_id text,
pinned integer not null default 0,
has_attachments integer not null default 0,
raw_json text not null,
updated_at text not null
);
```
### `message_events`
Suggested shape:
```sql
create table message_events (
event_id integer primary key autoincrement,
guild_id text not null,
channel_id text not null,
message_id text not null,
event_type text not null,
event_at text not null,
payload_json text not null
);
```
### `sync_state`
Suggested shape:
```sql
create table sync_state (
scope text primary key,
cursor text,
updated_at text not null
);
```
Examples of `scope`:
- `guild:<guild_id>:members`
- `channel:<channel_id>:messages`
- `tail:<guild_id>`
### `embedding_jobs`
Suggested shape:
```sql
create table embedding_jobs (
message_id text primary key,
state text not null,
attempts integer not null default 0,
updated_at text not null
);
```
### FTS
Recommended pattern:
- content table = `messages`
- FTS virtual table = `message_fts`
- keep it updated explicitly, not by fragile magic
Suggested columns:
- `message_id`
- `guild_id`
- `channel_id`
- `author_id`
- `author_name`
- `channel_name`
- `content`
## Search Design
### Modes
Support three modes:
- `fts`
- `semantic`
- `hybrid`
Default:
- `hybrid` when embeddings are enabled
- `fts` otherwise
### FTS behavior
FTS is mandatory.
It should be good enough that the tool is useful before embeddings exist.
Expected use cases:
- exact terms
- commands
- stack traces
- URLs
- model names
- channel names
- user names
### Semantic behavior
Embeddings are optional but planned from day one.
Recommended provider:
- OpenAI `text-embedding-3-small`
Implementation guidance:
- batch embedding jobs
- keep embedding generation out of the hot sync path
- store vectors locally
- semantic search should degrade gracefully when vectors are absent
### Vector store choice
Prefer SQLite-local vector search so the whole product stays portable.
Recommended direction:
- `sqlite-vec`
This can be wired after the base crawler and FTS system work.
## CLI Spec
Design goals:
- simple for humans
- composable for scripts
- obvious nouns and verbs
- no secrets in flags
Usage:
```text
discrawl [global flags] <command> [args]
```
### Global flags
- `-h, --help`
- `--version`
- `--config <path>`
- `--json`
- `--plain`
- `-q, --quiet`
- `-v, --verbose`
- `--no-color`
### Commands
- `init`
- `sync`
- `tail`
- `search`
- `sql`
- `members`
- `channels`
- `status`
- `doctor`
### `init`
Purpose:
- create `~/.discrawl/config.toml`
- import defaults from OpenClaw
- persist guild id and DB path
Expected flags:
- `--from-openclaw <path>`
- `--guild <id>`
- `--db <path>`
- `--with-embeddings`
### `sync`
Purpose:
- one-shot crawl
Expected flags:
- `--full`
- `--since <timestamp>`
- `--concurrency <n>`
- `--with-embeddings`
Requirements:
- idempotent
- restart-safe
- shows progress on stderr
### `tail`
Purpose:
- live sync from Gateway
Expected flags:
- `--repair-every <duration>`
- `--with-embeddings`
Requirements:
- reconnect automatically
- write checkpoints
- periodic repair sync
### `search`
Purpose:
- query mirrored messages
Expected flags:
- `--mode fts|semantic|hybrid`
- `--channel <name-or-id>`
- `--author <name-or-id>`
- `--limit <n>`
- `--json`
- `--plain`
### `sql`
Purpose:
- run read-only SQL
Requirements:
- support query arg or stdin
- block non-read-only statements by default
### `members`
Subcommands:
- `list`
- `show <user-id>`
- `search <query>`
### `channels`
Subcommands:
- `list`
- `show <channel-id>`
### `status`
Must show:
- guild id
- guild name if known
- db path
- total channels
- total threads
- total messages
- total members
- last sync time
- last tail event time
- embedding backlog
### `doctor`
Must check:
- config file readable
- OpenClaw token source readable
- Discord auth valid
- guild reachable
- DB openable
- FTS present
- vector extension present if configured
## Config Spec
Format:
- TOML
Location:
- `~/.discrawl/config.toml`
Suggested shape:
```toml
version = 1
guild_id = "1456350064065904867"
db_path = "~/.discrawl/discrawl.db"
cache_dir = "~/.discrawl/cache"
log_dir = "~/.discrawl/logs"
[discord]
token_source = "openclaw"
openclaw_config = "~/.openclaw/openclaw.json"
channel_account = "discord"
[sync]
concurrency = 4
repair_every = "6h"
full_history = true
[search]
default_mode = "hybrid"
[search.embeddings]
enabled = true
provider = "openai"
model = "text-embedding-3-small"
api_key_env = "OPENAI_API_KEY"
batch_size = 64
```
Config precedence:
1. flags
2. environment
3. config file
Environment variables:
- `DISCRAWL_CONFIG`
- `OPENAI_API_KEY`
## Token Handling Rules
Do not:
- put bot tokens in git
- put API keys in git
- print secrets in normal logs
Do:
- load bot token from OpenClaw config path
- load OpenAI key from env
- redact secrets in debug and doctor output
## Discord Sync Algorithm
### Initial full sync
1. load config
2. resolve token
3. fetch bot identity
4. fetch guild metadata
5. fetch guild channels
6. fetch active threads
7. enumerate archived public threads per parent channel
8. enumerate archived private threads per parent channel
9. fetch member snapshot
10. backfill messages for every crawlable channel and thread
11. normalize message content
12. upsert `messages`
13. append `message_events` where relevant
14. update FTS rows
15. enqueue embedding jobs
16. write checkpoints
### Message crawl strategy
Use REST pagination with `before`.
Rules:
- fetch newest page first for incremental runs
- fetch oldest via repeated `before` paging for full runs
- stop when no messages remain
- handle rate limits centrally
### Live tail strategy
Use Gateway events for:
- new messages
- edited messages
- deleted messages
- channel updates
- thread updates
- member updates
Tail should:
- upsert live state
- append lifecycle events
- keep retrying on disconnect
- periodically run repair sync
## Message Normalization
`normalized_content` should flatten Discord payloads into searchable text.
Include:
- message content
- embed titles and descriptions where helpful
- poll question and answers
- attachment filenames
- referenced message hints if available
Do not overcomplicate:
- reactions can be ignored
- attachment binary contents are not indexed in V1
## Member Query Design
Members matter for AI workflows.
Expected use cases:
- “who is this user”
- “find messages by this person”
- “find maintainers”
- “find everyone with a display name containing X”
At minimum, store:
- user id
- username
- display name
- nick
- roles
- bot flag
## Recommended Go Package Layout
```text
cmd/discrawl/
internal/cli/
internal/config/
internal/discord/
internal/store/
internal/search/
internal/syncer/
internal/embed/
```
Responsibilities:
- `internal/cli`: command wiring, output modes
- `internal/config`: parse and validate config
- `internal/discord`: REST + Gateway client wrappers
- `internal/store`: SQLite schema, migrations, queries
- `internal/search`: FTS and result ranking
- `internal/syncer`: full sync and repair orchestration
- `internal/embed`: embedding queue and provider integration
## Recommended Dependencies
Reasonable picks:
- Discord client: `github.com/bwmarrin/discordgo`
- TOML parser: something small and maintained
- SQLite driver: pick one path and stay consistent
- vector search: `sqlite-vec`
Guidance:
- keep dependency count low
- prefer boring stable libraries
- avoid frameworks
## Milestones
### Milestone 1
- config loader
- `init`
- `status`
- DB open + migrations
### Milestone 2
- guild metadata sync
- channel sync
- member sync
### Milestone 3
- full message backfill
- incremental checkpoints
- FTS indexing
### Milestone 4
- `search`
- `sql`
- `members`
- `channels`
### Milestone 5
- `tail`
- reconnect logic
- repair loop
### Milestone 6
- embedding queue
- vector search
- hybrid ranking
## What The Repo Must Eventually Contain
For an AI agent to finish the product without external memory, this repo should contain:
- this spec
- README with user-facing overview
- schema and migration files
- config sample
- CLI contract
- implementation package layout
- token discovery rules
- API key discovery rules
- milestone order
This file is the authoritative engineering spec for now.