2.9 KiB
Embeddings
Embeddings are optional. FTS is the default search path and the primary verification target. Embeddings enrich recall in background batches; they do not block the hot sync path.
Quick path
export OPENAI_API_KEY="..."
discrawl init --with-embeddings
discrawl sync --with-embeddings
discrawl embed --limit 1000
discrawl search --mode semantic "launch checklist"
discrawl search --mode hybrid "launch checklist"
Two-phase pipeline
- Queue -
sync --with-embeddingswritesembedding_jobsrows for new messages, changed normalized text, and messages without an existing job. The embedding provider is not called in this phase. - Drain -
discrawl embedclaims pending jobs with a short lock so overlapping runs do not process the same batch. It calls the configured provider, writes vectors tomessage_embeddingswith provider, model, input version, dimensions, and binary vector data.
Behavior during drain:
- rate limits requeue the batch and stop that drain run cleanly
- provider or validation failures retry up to three attempts before marking the job failed
- messages with no normalized text are marked done and any stale vector for that message is removed
Identity (provider, model, input version)
Stored on each job and vector. If you change provider or model:
- pending jobs are retargeted to the new identity
- prior attempts are reset
- existing vectors for another identity remain in SQLite but are not used for semantic search
Use --rebuild when you want to regenerate vectors for the existing archive after a config change:
discrawl embed --rebuild --limit 1000
Local provider example
[search.embeddings]
enabled = true
provider = "ollama"
model = "nomic-embed-text"
With local providers, message and query embedding both happen on the same machine. With remote providers, message text is sent during discrawl embed, and search query text is sent during --mode semantic or --mode hybrid calls.
Git snapshot interaction
By default, publish does not export embeddings. Use --with-embeddings:
discrawl publish --with-embeddings --push
discrawl subscribe --with-embeddings https://github.com/example/discord-archive.git
discrawl update --with-embeddings
The snapshot stores vectors under embeddings/<provider>/<model>/<input_version>/... and records that identity in manifest.json. Only vectors for non-DM messages are exported. Import only restores matching embedding manifests, so an Ollama/nomic subscriber does not accidentally import OpenAI/text-embedding vectors. embedding_jobs is never exported; subscribers that want fresh local vectors run discrawl embed --rebuild. Publishing without --with-embeddings omits embedding manifests instead of carrying forward an older bundle.