docs(search): document semantic and hybrid search

This commit is contained in:
MrBrain 2026-04-22 20:59:05 +08:00 committed by Peter Steinberger
parent 3ea1d4aa7f
commit 4437514537
2 changed files with 36 additions and 3 deletions

View File

@ -12,6 +12,7 @@ All notable changes to `discrawl` will be documented in this file.
- Git-backed snapshots now keep embedding queue state and generated vectors local to each archive, so subscribers no longer inherit misleading embedding backlog metadata. (#38) Thanks @GaosCode.
- semantic message search now ranks across the full compatible local vector set instead of only the newest candidate window. (#36) Thanks @GaosCode.
- hybrid message search now fuses FTS with local semantic vectors while avoiding embedding-provider calls when no local vectors exist. (#37) Thanks @GaosCode.
- docs now cover semantic and hybrid search setup, embedding privacy, Git snapshot behavior, and local vector rebuilds. (#39) Thanks @GaosCode.
## 0.3.0 - 2026-04-21

View File

@ -200,10 +200,13 @@ discrawl tail --repair-every 30m
### `search`
Runs FTS search over archived messages.
Searches archived messages. FTS is the default mode and works without embeddings.
```bash
discrawl search "panic: nil pointer"
discrawl search --mode fts "panic: nil pointer"
discrawl search --mode semantic "missing launch checklist"
discrawl search --mode hybrid "database timeout"
discrawl search --guild 123456789012345678 "payment failed"
discrawl search --channel billing --author steipete --limit 50 "invoice"
discrawl search --include-empty "GitHub"
@ -211,7 +214,14 @@ discrawl --json search "websocket closed"
```
By default, `search` skips rows with no searchable content. Attachment text, attachment filenames, embeds, and replies still count as content. Use `--include-empty` to opt back in.
Search returns the newest matching messages first so large local archives stay responsive.
Modes:
- `fts` searches the local FTS index and returns the newest matching messages first.
- `semantic` embeds the query, searches locally stored message vectors, and returns a clear error if embeddings are disabled or no compatible vectors exist.
- `hybrid` runs FTS and semantic search, deduplicates by message id, and falls back to FTS when semantic search is unavailable.
Semantic and hybrid search require `[search.embeddings]` plus local `message_embeddings` rows for the configured provider, model, and input version. Run `discrawl sync --with-embeddings` to enqueue changed messages, then `discrawl embed` to generate vectors.
### `messages`
@ -365,7 +375,7 @@ Once `share.remote` is configured, read commands auto-fetch and import when the
Hybrid mode is supported too: keep normal Discord credentials configured and set `share.remote`. `discrawl sync` and `discrawl messages --sync` import the Git snapshot first, then use live Discord only to fill anything newer or missing. This keeps day-to-day sync fast while preserving live repair behavior.
Git snapshots publish archive tables only. Embedding queue state and generated vectors stay local to each machine.
Git snapshots publish archive tables only. Embedding queue state and generated vectors stay local to each machine. Git-only readers can use FTS immediately. To use semantic or hybrid search with semantic recall, configure a local embedding provider and run `discrawl embed --rebuild`. Hybrid search falls back to FTS when no local vectors exist.
The Docker smoke test installs `discrawl` in a clean Go container, subscribes to a Git snapshot repo, then checks `search`, `messages`, `sql`, and `report`:
@ -466,8 +476,30 @@ If enabled, embeddings are intended to enrich recall in background batches, not
export OPENAI_API_KEY="..."
discrawl init --with-embeddings
discrawl sync --with-embeddings
discrawl embed --limit 1000
discrawl search --mode semantic "launch checklist"
discrawl search --mode hybrid "launch checklist"
```
`sync --with-embeddings` only queues changed messages. It does not call the embedding provider. `discrawl embed` drains that queue explicitly, using the configured provider and model.
Use `--rebuild` when changing provider, model, or input settings and you want to regenerate vectors for the existing archive:
```bash
discrawl embed --rebuild --limit 1000
```
Local providers can keep message and query embedding on the same machine:
```toml
[search.embeddings]
enabled = true
provider = "ollama"
model = "nomic-embed-text"
```
With remote providers, message text is sent during `discrawl embed`, and search query text is sent when using `--mode semantic` or `--mode hybrid`. Stored message text is not sent during local vector scoring.
## Data Stored Locally
- guild metadata