docs: add usage guide and build spec
This commit is contained in:
parent
26f9d03705
commit
d2b5c7e668
256
README.md
Normal file
256
README.md
Normal file
@ -0,0 +1,256 @@
|
||||
# discrawl
|
||||
|
||||
`discrawl` mirrors Discord guild data into local SQLite so you can search, inspect, and query server history without depending on Discord search.
|
||||
|
||||
It is a bot-token crawler. No user-token hacks. Data stays local.
|
||||
|
||||
## What It Does
|
||||
|
||||
- discovers every guild the configured bot can access
|
||||
- syncs channels, threads, members, and message history into SQLite
|
||||
- maintains FTS5 search indexes for fast local text search
|
||||
- tails Gateway events for live updates, with periodic repair syncs
|
||||
- exposes read-only SQL for ad hoc analysis
|
||||
- keeps schema multi-guild ready while preserving a simple single-guild default UX
|
||||
|
||||
Search defaults to all guilds. `sync` and `tail` default to the configured default guild when one exists, otherwise they fan out to all discovered guilds.
|
||||
|
||||
## Requirements
|
||||
|
||||
- Go `1.26+`
|
||||
- a Discord bot token the bot can use to read the target guilds
|
||||
- bot permissions for the channels you want archived
|
||||
|
||||
Token resolution:
|
||||
|
||||
1. OpenClaw config, if `discord.token_source` is not `env`
|
||||
2. `DISCORD_BOT_TOKEN` or the configured `discord.token_env`
|
||||
|
||||
Default runtime paths:
|
||||
|
||||
- config: `~/.discrawl/config.toml`
|
||||
- database: `~/.discrawl/discrawl.db`
|
||||
- cache: `~/.discrawl/cache/`
|
||||
- logs: `~/.discrawl/logs/`
|
||||
|
||||
## Install
|
||||
|
||||
Build from source:
|
||||
|
||||
```bash
|
||||
git clone https://github.com/steipete/discrawl.git
|
||||
cd discrawl
|
||||
go build -o bin/discrawl ./cmd/discrawl
|
||||
./bin/discrawl --version
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
Reuse an existing OpenClaw Discord bot config:
|
||||
|
||||
```bash
|
||||
bin/discrawl init --from-openclaw ~/.openclaw/openclaw.json
|
||||
bin/discrawl doctor
|
||||
bin/discrawl sync --full
|
||||
bin/discrawl search "panic: nil pointer"
|
||||
bin/discrawl tail
|
||||
```
|
||||
|
||||
Env-only setup:
|
||||
|
||||
```bash
|
||||
export DISCORD_BOT_TOKEN="..."
|
||||
bin/discrawl init
|
||||
bin/discrawl sync --full
|
||||
```
|
||||
|
||||
`init` discovers accessible guilds and writes `~/.discrawl/config.toml`. If exactly one guild is available, that guild becomes the default automatically.
|
||||
|
||||
## Commands
|
||||
|
||||
### `init`
|
||||
|
||||
Creates the local config and discovers accessible guilds.
|
||||
|
||||
```bash
|
||||
bin/discrawl init
|
||||
bin/discrawl init --from-openclaw ~/.openclaw/openclaw.json
|
||||
bin/discrawl init --guild 123456789012345678
|
||||
bin/discrawl init --db ~/data/discrawl.db
|
||||
```
|
||||
|
||||
### `sync`
|
||||
|
||||
Backfills guild state into SQLite.
|
||||
|
||||
```bash
|
||||
bin/discrawl sync --full
|
||||
bin/discrawl sync --guild 123456789012345678
|
||||
bin/discrawl sync --guilds 123,456 --concurrency 8
|
||||
bin/discrawl sync --channels 111,222 --since 2026-03-01T00:00:00Z
|
||||
```
|
||||
|
||||
### `tail`
|
||||
|
||||
Runs the live Gateway tail and periodic repair loop.
|
||||
|
||||
```bash
|
||||
bin/discrawl tail
|
||||
bin/discrawl tail --guild 123456789012345678
|
||||
bin/discrawl tail --repair-every 30m
|
||||
```
|
||||
|
||||
### `search`
|
||||
|
||||
Runs FTS search over archived messages.
|
||||
|
||||
```bash
|
||||
bin/discrawl search "panic: nil pointer"
|
||||
bin/discrawl search --guild 123456789012345678 "payment failed"
|
||||
bin/discrawl search --channel billing --author steipete --limit 50 "invoice"
|
||||
bin/discrawl --json search "websocket closed"
|
||||
```
|
||||
|
||||
### `sql`
|
||||
|
||||
Runs read-only SQL against the local database.
|
||||
|
||||
```bash
|
||||
bin/discrawl sql 'select count(*) as messages from messages'
|
||||
echo 'select guild_id, count(*) from messages group by guild_id' | bin/discrawl sql -
|
||||
```
|
||||
|
||||
### `members`
|
||||
|
||||
```bash
|
||||
bin/discrawl members list
|
||||
bin/discrawl members show 123456789012345678
|
||||
bin/discrawl members search "peter"
|
||||
```
|
||||
|
||||
### `channels`
|
||||
|
||||
```bash
|
||||
bin/discrawl channels list
|
||||
bin/discrawl channels show 123456789012345678
|
||||
```
|
||||
|
||||
### `status`
|
||||
|
||||
Shows local archive status.
|
||||
|
||||
```bash
|
||||
bin/discrawl status
|
||||
```
|
||||
|
||||
### `doctor`
|
||||
|
||||
Checks config, auth, DB, and FTS wiring.
|
||||
|
||||
```bash
|
||||
bin/discrawl doctor
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
`init` writes a complete config, so most users should not hand-edit anything initially.
|
||||
|
||||
Typical config shape:
|
||||
|
||||
```toml
|
||||
version = 1
|
||||
default_guild_id = ""
|
||||
guild_ids = []
|
||||
db_path = "~/.discrawl/discrawl.db"
|
||||
cache_dir = "~/.discrawl/cache"
|
||||
log_dir = "~/.discrawl/logs"
|
||||
|
||||
[discord]
|
||||
token_source = "openclaw"
|
||||
openclaw_config = "~/.openclaw/openclaw.json"
|
||||
account = "default"
|
||||
token_env = "DISCORD_BOT_TOKEN"
|
||||
|
||||
[sync]
|
||||
concurrency = 4
|
||||
repair_every = "6h"
|
||||
full_history = true
|
||||
|
||||
[search]
|
||||
default_mode = "fts"
|
||||
|
||||
[search.embeddings]
|
||||
enabled = false
|
||||
provider = "openai"
|
||||
model = "text-embedding-3-small"
|
||||
api_key_env = "OPENAI_API_KEY"
|
||||
batch_size = 64
|
||||
```
|
||||
|
||||
Config override rules:
|
||||
|
||||
- `--config` beats everything
|
||||
- `DISCRAWL_CONFIG` overrides the default config path
|
||||
- `discord.token_source = "env"` forces env-only token lookup
|
||||
|
||||
## Embeddings
|
||||
|
||||
Embeddings are optional. FTS is the default search path and the primary verification target.
|
||||
|
||||
If enabled, embeddings are intended to enrich recall in background batches, not block the hot sync path.
|
||||
|
||||
```bash
|
||||
export OPENAI_API_KEY="..."
|
||||
bin/discrawl init --with-embeddings
|
||||
bin/discrawl sync --with-embeddings
|
||||
```
|
||||
|
||||
## Data Stored Locally
|
||||
|
||||
- guild metadata
|
||||
- channels and threads in one table
|
||||
- current member snapshot
|
||||
- canonical message rows
|
||||
- append-only message event records
|
||||
- FTS index rows
|
||||
- optional embedding backlog metadata
|
||||
|
||||
Attachment binaries are not stored in SQLite.
|
||||
|
||||
## Security
|
||||
|
||||
- do not commit bot tokens or API keys
|
||||
- default config lives in your home directory, not inside the repo
|
||||
- CI runs secret scanning with `gitleaks`
|
||||
- `doctor` reports token source, not token contents
|
||||
|
||||
## Development
|
||||
|
||||
Local gate:
|
||||
|
||||
```bash
|
||||
go run github.com/golangci/golangci-lint/v2/cmd/golangci-lint@v2.11.1 run
|
||||
go test ./... -coverprofile=/tmp/discrawl.cover
|
||||
go tool cover -func=/tmp/discrawl.cover | tail -n 1
|
||||
go build ./cmd/discrawl
|
||||
```
|
||||
|
||||
Target coverage is `>= 80%`.
|
||||
|
||||
CI runs:
|
||||
|
||||
- `golangci-lint`
|
||||
- `go test` with coverage threshold enforcement
|
||||
- `go build ./cmd/discrawl`
|
||||
- `gitleaks` against git history and the working tree
|
||||
|
||||
## Notes
|
||||
|
||||
- the schema is multi-guild ready even when the common UX stays single-guild simple
|
||||
- threads are stored as channels because that matches the Discord model
|
||||
- archived threads are part of the sync surface
|
||||
- live sync is resumable; large guilds still take time because Discord rate limits history backfill
|
||||
|
||||
## License
|
||||
|
||||
MIT. See [LICENSE](LICENSE).
|
||||
783
SPEC.md
Normal file
783
SPEC.md
Normal file
@ -0,0 +1,783 @@
|
||||
# discrawl Spec
|
||||
|
||||
This file is the build contract for an AI agent working in this repo.
|
||||
|
||||
Goal:
|
||||
|
||||
- build a local-first Discord guild crawler
|
||||
- mirror all guild data the configured bot can access
|
||||
- store it in SQLite
|
||||
- support fast text search, semantic search, and raw SQL
|
||||
- support one-shot backfill and long-running live sync
|
||||
|
||||
This spec is intentionally detailed so an agent can keep shipping without re-asking foundational questions.
|
||||
|
||||
## Product Summary
|
||||
|
||||
`discrawl` is a Go CLI that mirrors Discord guild data into local SQLite.
|
||||
|
||||
V1 scope:
|
||||
|
||||
- one guild at a time
|
||||
- all accessible text channels
|
||||
- all accessible announcement channels
|
||||
- all accessible forum channels and their posts
|
||||
- all accessible public threads
|
||||
- all accessible private threads
|
||||
- archived thread coverage
|
||||
- full message history
|
||||
- current member snapshot
|
||||
- FTS5 search
|
||||
- optional OpenAI embeddings with local vector search
|
||||
- raw SQL access
|
||||
|
||||
Out of scope for V1:
|
||||
|
||||
- personal-account DMs
|
||||
- reactions as primary indexed entities
|
||||
- attachment blob downloads by default
|
||||
- cross-guild unified sync UX
|
||||
- write-back or moderation actions
|
||||
|
||||
## Requirements Already Chosen
|
||||
|
||||
These are settled unless the user explicitly changes them:
|
||||
|
||||
- config format: `TOML`
|
||||
- config location: `~/.discrawl/config.toml`
|
||||
- DB location: `~/.discrawl/discrawl.db`
|
||||
- cache dir: `~/.discrawl/cache/`
|
||||
- log dir: `~/.discrawl/logs/`
|
||||
- token source: reuse Molty / existing OpenClaw Discord bot config
|
||||
- guild model: one guild in CLI UX, multi-guild-ready schema
|
||||
- search: hybrid, with FTS first and embeddings optional
|
||||
- embedding provider: OpenAI
|
||||
- API key source: `OPENAI_API_KEY` from shell env
|
||||
- message retention: current canonical row + append-only event log
|
||||
- member retention: current snapshot only
|
||||
- files: metadata only in DB, fetch binaries later on demand
|
||||
- reactions: not important for V1
|
||||
- polls: flatten into text during normalization
|
||||
|
||||
## Local Environment Contract
|
||||
|
||||
An agent should assume:
|
||||
|
||||
- repo path: `~/Projects/discrawl`
|
||||
- shell: `zsh`
|
||||
- Go is installed and modern
|
||||
- user is Peter
|
||||
- user keeps many secrets in `~/.profile`
|
||||
- an existing OpenClaw install may already contain usable Discord bot config
|
||||
|
||||
### Key file paths
|
||||
|
||||
- `~/.discrawl/config.toml`
|
||||
- `~/.discrawl/discrawl.db`
|
||||
- `~/.profile`
|
||||
- `~/.openclaw/openclaw.json`
|
||||
- `~/.openclaw/openclaw.json.bak*`
|
||||
|
||||
### Existing bot config
|
||||
|
||||
The current bot token source is expected in:
|
||||
|
||||
- `~/.openclaw/openclaw.json`
|
||||
|
||||
Expected path inside JSON:
|
||||
|
||||
- `channels.discord.token`
|
||||
|
||||
Expected guild selection path:
|
||||
|
||||
- `channels.discord.guilds`
|
||||
|
||||
The current intended default mode is:
|
||||
|
||||
- `discrawl init --from-openclaw ~/.openclaw/openclaw.json`
|
||||
|
||||
### OpenAI embeddings key
|
||||
|
||||
Do not store raw API keys in repo files.
|
||||
|
||||
Expected source:
|
||||
|
||||
- env var `OPENAI_API_KEY`
|
||||
|
||||
Typical place to discover it locally:
|
||||
|
||||
- `~/.profile`
|
||||
|
||||
The code should read the env var at runtime, not copy the value into config by default.
|
||||
|
||||
## Discord Data Model Notes
|
||||
|
||||
Important Discord facts that drive the schema:
|
||||
|
||||
- channels and threads are closely related; threads should be stored as channels
|
||||
- forum posts are threads under a forum parent
|
||||
- message history is paginated and must be backfilled incrementally
|
||||
- live updates come from Gateway events, not from polling alone
|
||||
- archived public and private threads must be enumerated explicitly
|
||||
- private archived thread access may require elevated bot perms like `Manage Threads`
|
||||
|
||||
### Entities to mirror
|
||||
|
||||
- guild
|
||||
- categories
|
||||
- channels
|
||||
- threads
|
||||
- members
|
||||
- messages
|
||||
- message lifecycle events
|
||||
|
||||
### Channel kinds worth preserving
|
||||
|
||||
- category
|
||||
- text
|
||||
- announcement
|
||||
- forum
|
||||
- thread public
|
||||
- thread private
|
||||
- thread announcement
|
||||
|
||||
Voice channels can be mirrored as metadata rows, but there is no need to crawl message history because there is none.
|
||||
|
||||
## Database Design
|
||||
|
||||
Use SQLite.
|
||||
|
||||
Requirements:
|
||||
|
||||
- WAL mode
|
||||
- foreign keys on
|
||||
- FTS5 enabled
|
||||
- vector extension optional
|
||||
|
||||
### Tables
|
||||
|
||||
At minimum:
|
||||
|
||||
- `guilds`
|
||||
- `channels`
|
||||
- `members`
|
||||
- `messages`
|
||||
- `message_events`
|
||||
- `sync_state`
|
||||
- `embedding_jobs`
|
||||
- `message_fts`
|
||||
|
||||
Optional once vectors are wired:
|
||||
|
||||
- `message_embeddings`
|
||||
|
||||
### `guilds`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```sql
|
||||
create table guilds (
|
||||
id text primary key,
|
||||
name text not null,
|
||||
icon text,
|
||||
raw_json text not null,
|
||||
updated_at text not null
|
||||
);
|
||||
```
|
||||
|
||||
### `channels`
|
||||
|
||||
Threads should live in the same table.
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```sql
|
||||
create table channels (
|
||||
id text primary key,
|
||||
guild_id text not null,
|
||||
parent_id text,
|
||||
kind text not null,
|
||||
name text not null,
|
||||
topic text,
|
||||
position integer,
|
||||
is_nsfw integer not null default 0,
|
||||
is_archived integer not null default 0,
|
||||
is_locked integer not null default 0,
|
||||
is_private_thread integer not null default 0,
|
||||
thread_parent_id text,
|
||||
archive_timestamp text,
|
||||
raw_json text not null,
|
||||
updated_at text not null
|
||||
);
|
||||
```
|
||||
|
||||
### `members`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```sql
|
||||
create table members (
|
||||
guild_id text not null,
|
||||
user_id text not null,
|
||||
username text not null,
|
||||
global_name text,
|
||||
display_name text,
|
||||
nick text,
|
||||
discriminator text,
|
||||
avatar text,
|
||||
bot integer not null default 0,
|
||||
joined_at text,
|
||||
role_ids_json text not null,
|
||||
raw_json text not null,
|
||||
updated_at text not null,
|
||||
primary key (guild_id, user_id)
|
||||
);
|
||||
```
|
||||
|
||||
### `messages`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```sql
|
||||
create table messages (
|
||||
id text primary key,
|
||||
guild_id text not null,
|
||||
channel_id text not null,
|
||||
author_id text,
|
||||
message_type integer not null,
|
||||
created_at text not null,
|
||||
edited_at text,
|
||||
deleted_at text,
|
||||
content text not null,
|
||||
normalized_content text not null,
|
||||
reply_to_message_id text,
|
||||
pinned integer not null default 0,
|
||||
has_attachments integer not null default 0,
|
||||
raw_json text not null,
|
||||
updated_at text not null
|
||||
);
|
||||
```
|
||||
|
||||
### `message_events`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```sql
|
||||
create table message_events (
|
||||
event_id integer primary key autoincrement,
|
||||
guild_id text not null,
|
||||
channel_id text not null,
|
||||
message_id text not null,
|
||||
event_type text not null,
|
||||
event_at text not null,
|
||||
payload_json text not null
|
||||
);
|
||||
```
|
||||
|
||||
### `sync_state`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```sql
|
||||
create table sync_state (
|
||||
scope text primary key,
|
||||
cursor text,
|
||||
updated_at text not null
|
||||
);
|
||||
```
|
||||
|
||||
Examples of `scope`:
|
||||
|
||||
- `guild:<guild_id>:members`
|
||||
- `channel:<channel_id>:messages`
|
||||
- `tail:<guild_id>`
|
||||
|
||||
### `embedding_jobs`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```sql
|
||||
create table embedding_jobs (
|
||||
message_id text primary key,
|
||||
state text not null,
|
||||
attempts integer not null default 0,
|
||||
updated_at text not null
|
||||
);
|
||||
```
|
||||
|
||||
### FTS
|
||||
|
||||
Recommended pattern:
|
||||
|
||||
- content table = `messages`
|
||||
- FTS virtual table = `message_fts`
|
||||
- keep it updated explicitly, not by fragile magic
|
||||
|
||||
Suggested columns:
|
||||
|
||||
- `message_id`
|
||||
- `guild_id`
|
||||
- `channel_id`
|
||||
- `author_id`
|
||||
- `author_name`
|
||||
- `channel_name`
|
||||
- `content`
|
||||
|
||||
## Search Design
|
||||
|
||||
### Modes
|
||||
|
||||
Support three modes:
|
||||
|
||||
- `fts`
|
||||
- `semantic`
|
||||
- `hybrid`
|
||||
|
||||
Default:
|
||||
|
||||
- `hybrid` when embeddings are enabled
|
||||
- `fts` otherwise
|
||||
|
||||
### FTS behavior
|
||||
|
||||
FTS is mandatory.
|
||||
|
||||
It should be good enough that the tool is useful before embeddings exist.
|
||||
|
||||
Expected use cases:
|
||||
|
||||
- exact terms
|
||||
- commands
|
||||
- stack traces
|
||||
- URLs
|
||||
- model names
|
||||
- channel names
|
||||
- user names
|
||||
|
||||
### Semantic behavior
|
||||
|
||||
Embeddings are optional but planned from day one.
|
||||
|
||||
Recommended provider:
|
||||
|
||||
- OpenAI `text-embedding-3-small`
|
||||
|
||||
Implementation guidance:
|
||||
|
||||
- batch embedding jobs
|
||||
- keep embedding generation out of the hot sync path
|
||||
- store vectors locally
|
||||
- semantic search should degrade gracefully when vectors are absent
|
||||
|
||||
### Vector store choice
|
||||
|
||||
Prefer SQLite-local vector search so the whole product stays portable.
|
||||
|
||||
Recommended direction:
|
||||
|
||||
- `sqlite-vec`
|
||||
|
||||
This can be wired after the base crawler and FTS system work.
|
||||
|
||||
## CLI Spec
|
||||
|
||||
Design goals:
|
||||
|
||||
- simple for humans
|
||||
- composable for scripts
|
||||
- obvious nouns and verbs
|
||||
- no secrets in flags
|
||||
|
||||
Usage:
|
||||
|
||||
```text
|
||||
discrawl [global flags] <command> [args]
|
||||
```
|
||||
|
||||
### Global flags
|
||||
|
||||
- `-h, --help`
|
||||
- `--version`
|
||||
- `--config <path>`
|
||||
- `--json`
|
||||
- `--plain`
|
||||
- `-q, --quiet`
|
||||
- `-v, --verbose`
|
||||
- `--no-color`
|
||||
|
||||
### Commands
|
||||
|
||||
- `init`
|
||||
- `sync`
|
||||
- `tail`
|
||||
- `search`
|
||||
- `sql`
|
||||
- `members`
|
||||
- `channels`
|
||||
- `status`
|
||||
- `doctor`
|
||||
|
||||
### `init`
|
||||
|
||||
Purpose:
|
||||
|
||||
- create `~/.discrawl/config.toml`
|
||||
- import defaults from OpenClaw
|
||||
- persist guild id and DB path
|
||||
|
||||
Expected flags:
|
||||
|
||||
- `--from-openclaw <path>`
|
||||
- `--guild <id>`
|
||||
- `--db <path>`
|
||||
- `--with-embeddings`
|
||||
|
||||
### `sync`
|
||||
|
||||
Purpose:
|
||||
|
||||
- one-shot crawl
|
||||
|
||||
Expected flags:
|
||||
|
||||
- `--full`
|
||||
- `--since <timestamp>`
|
||||
- `--concurrency <n>`
|
||||
- `--with-embeddings`
|
||||
|
||||
Requirements:
|
||||
|
||||
- idempotent
|
||||
- restart-safe
|
||||
- shows progress on stderr
|
||||
|
||||
### `tail`
|
||||
|
||||
Purpose:
|
||||
|
||||
- live sync from Gateway
|
||||
|
||||
Expected flags:
|
||||
|
||||
- `--repair-every <duration>`
|
||||
- `--with-embeddings`
|
||||
|
||||
Requirements:
|
||||
|
||||
- reconnect automatically
|
||||
- write checkpoints
|
||||
- periodic repair sync
|
||||
|
||||
### `search`
|
||||
|
||||
Purpose:
|
||||
|
||||
- query mirrored messages
|
||||
|
||||
Expected flags:
|
||||
|
||||
- `--mode fts|semantic|hybrid`
|
||||
- `--channel <name-or-id>`
|
||||
- `--author <name-or-id>`
|
||||
- `--limit <n>`
|
||||
- `--json`
|
||||
- `--plain`
|
||||
|
||||
### `sql`
|
||||
|
||||
Purpose:
|
||||
|
||||
- run read-only SQL
|
||||
|
||||
Requirements:
|
||||
|
||||
- support query arg or stdin
|
||||
- block non-read-only statements by default
|
||||
|
||||
### `members`
|
||||
|
||||
Subcommands:
|
||||
|
||||
- `list`
|
||||
- `show <user-id>`
|
||||
- `search <query>`
|
||||
|
||||
### `channels`
|
||||
|
||||
Subcommands:
|
||||
|
||||
- `list`
|
||||
- `show <channel-id>`
|
||||
|
||||
### `status`
|
||||
|
||||
Must show:
|
||||
|
||||
- guild id
|
||||
- guild name if known
|
||||
- db path
|
||||
- total channels
|
||||
- total threads
|
||||
- total messages
|
||||
- total members
|
||||
- last sync time
|
||||
- last tail event time
|
||||
- embedding backlog
|
||||
|
||||
### `doctor`
|
||||
|
||||
Must check:
|
||||
|
||||
- config file readable
|
||||
- OpenClaw token source readable
|
||||
- Discord auth valid
|
||||
- guild reachable
|
||||
- DB openable
|
||||
- FTS present
|
||||
- vector extension present if configured
|
||||
|
||||
## Config Spec
|
||||
|
||||
Format:
|
||||
|
||||
- TOML
|
||||
|
||||
Location:
|
||||
|
||||
- `~/.discrawl/config.toml`
|
||||
|
||||
Suggested shape:
|
||||
|
||||
```toml
|
||||
version = 1
|
||||
guild_id = "1456350064065904867"
|
||||
db_path = "~/.discrawl/discrawl.db"
|
||||
cache_dir = "~/.discrawl/cache"
|
||||
log_dir = "~/.discrawl/logs"
|
||||
|
||||
[discord]
|
||||
token_source = "openclaw"
|
||||
openclaw_config = "~/.openclaw/openclaw.json"
|
||||
channel_account = "discord"
|
||||
|
||||
[sync]
|
||||
concurrency = 4
|
||||
repair_every = "6h"
|
||||
full_history = true
|
||||
|
||||
[search]
|
||||
default_mode = "hybrid"
|
||||
|
||||
[search.embeddings]
|
||||
enabled = true
|
||||
provider = "openai"
|
||||
model = "text-embedding-3-small"
|
||||
api_key_env = "OPENAI_API_KEY"
|
||||
batch_size = 64
|
||||
```
|
||||
|
||||
Config precedence:
|
||||
|
||||
1. flags
|
||||
2. environment
|
||||
3. config file
|
||||
|
||||
Environment variables:
|
||||
|
||||
- `DISCRAWL_CONFIG`
|
||||
- `OPENAI_API_KEY`
|
||||
|
||||
## Token Handling Rules
|
||||
|
||||
Do not:
|
||||
|
||||
- put bot tokens in git
|
||||
- put API keys in git
|
||||
- print secrets in normal logs
|
||||
|
||||
Do:
|
||||
|
||||
- load bot token from OpenClaw config path
|
||||
- load OpenAI key from env
|
||||
- redact secrets in debug and doctor output
|
||||
|
||||
## Discord Sync Algorithm
|
||||
|
||||
### Initial full sync
|
||||
|
||||
1. load config
|
||||
2. resolve token
|
||||
3. fetch bot identity
|
||||
4. fetch guild metadata
|
||||
5. fetch guild channels
|
||||
6. fetch active threads
|
||||
7. enumerate archived public threads per parent channel
|
||||
8. enumerate archived private threads per parent channel
|
||||
9. fetch member snapshot
|
||||
10. backfill messages for every crawlable channel and thread
|
||||
11. normalize message content
|
||||
12. upsert `messages`
|
||||
13. append `message_events` where relevant
|
||||
14. update FTS rows
|
||||
15. enqueue embedding jobs
|
||||
16. write checkpoints
|
||||
|
||||
### Message crawl strategy
|
||||
|
||||
Use REST pagination with `before`.
|
||||
|
||||
Rules:
|
||||
|
||||
- fetch newest page first for incremental runs
|
||||
- fetch oldest via repeated `before` paging for full runs
|
||||
- stop when no messages remain
|
||||
- handle rate limits centrally
|
||||
|
||||
### Live tail strategy
|
||||
|
||||
Use Gateway events for:
|
||||
|
||||
- new messages
|
||||
- edited messages
|
||||
- deleted messages
|
||||
- channel updates
|
||||
- thread updates
|
||||
- member updates
|
||||
|
||||
Tail should:
|
||||
|
||||
- upsert live state
|
||||
- append lifecycle events
|
||||
- keep retrying on disconnect
|
||||
- periodically run repair sync
|
||||
|
||||
## Message Normalization
|
||||
|
||||
`normalized_content` should flatten Discord payloads into searchable text.
|
||||
|
||||
Include:
|
||||
|
||||
- message content
|
||||
- embed titles and descriptions where helpful
|
||||
- poll question and answers
|
||||
- attachment filenames
|
||||
- referenced message hints if available
|
||||
|
||||
Do not overcomplicate:
|
||||
|
||||
- reactions can be ignored
|
||||
- attachment binary contents are not indexed in V1
|
||||
|
||||
## Member Query Design
|
||||
|
||||
Members matter for AI workflows.
|
||||
|
||||
Expected use cases:
|
||||
|
||||
- “who is this user”
|
||||
- “find messages by this person”
|
||||
- “find maintainers”
|
||||
- “find everyone with a display name containing X”
|
||||
|
||||
At minimum, store:
|
||||
|
||||
- user id
|
||||
- username
|
||||
- display name
|
||||
- nick
|
||||
- roles
|
||||
- bot flag
|
||||
|
||||
## Recommended Go Package Layout
|
||||
|
||||
```text
|
||||
cmd/discrawl/
|
||||
internal/cli/
|
||||
internal/config/
|
||||
internal/discord/
|
||||
internal/store/
|
||||
internal/search/
|
||||
internal/syncer/
|
||||
internal/embed/
|
||||
```
|
||||
|
||||
Responsibilities:
|
||||
|
||||
- `internal/cli`: command wiring, output modes
|
||||
- `internal/config`: parse and validate config
|
||||
- `internal/discord`: REST + Gateway client wrappers
|
||||
- `internal/store`: SQLite schema, migrations, queries
|
||||
- `internal/search`: FTS and result ranking
|
||||
- `internal/syncer`: full sync and repair orchestration
|
||||
- `internal/embed`: embedding queue and provider integration
|
||||
|
||||
## Recommended Dependencies
|
||||
|
||||
Reasonable picks:
|
||||
|
||||
- Discord client: `github.com/bwmarrin/discordgo`
|
||||
- TOML parser: something small and maintained
|
||||
- SQLite driver: pick one path and stay consistent
|
||||
- vector search: `sqlite-vec`
|
||||
|
||||
Guidance:
|
||||
|
||||
- keep dependency count low
|
||||
- prefer boring stable libraries
|
||||
- avoid frameworks
|
||||
|
||||
## Milestones
|
||||
|
||||
### Milestone 1
|
||||
|
||||
- config loader
|
||||
- `init`
|
||||
- `status`
|
||||
- DB open + migrations
|
||||
|
||||
### Milestone 2
|
||||
|
||||
- guild metadata sync
|
||||
- channel sync
|
||||
- member sync
|
||||
|
||||
### Milestone 3
|
||||
|
||||
- full message backfill
|
||||
- incremental checkpoints
|
||||
- FTS indexing
|
||||
|
||||
### Milestone 4
|
||||
|
||||
- `search`
|
||||
- `sql`
|
||||
- `members`
|
||||
- `channels`
|
||||
|
||||
### Milestone 5
|
||||
|
||||
- `tail`
|
||||
- reconnect logic
|
||||
- repair loop
|
||||
|
||||
### Milestone 6
|
||||
|
||||
- embedding queue
|
||||
- vector search
|
||||
- hybrid ranking
|
||||
|
||||
## What The Repo Must Eventually Contain
|
||||
|
||||
For an AI agent to finish the product without external memory, this repo should contain:
|
||||
|
||||
- this spec
|
||||
- README with user-facing overview
|
||||
- schema and migration files
|
||||
- config sample
|
||||
- CLI contract
|
||||
- implementation package layout
|
||||
- token discovery rules
|
||||
- API key discovery rules
|
||||
- milestone order
|
||||
|
||||
This file is the authoritative engineering spec for now.
|
||||
Loading…
Reference in New Issue
Block a user