docs: define crawlkit app boundary

This commit is contained in:
Vincent Koc 2026-05-05 17:16:57 -07:00
parent 43454a8af2
commit 59c0033fc7
No known key found for this signature in database
4 changed files with 122 additions and 0 deletions

View File

@ -11,6 +11,9 @@ It is not a provider crawler. Keep Slack, Discord, Notion, GitHub, and other
provider-specific behavior in the downstream apps unless the abstraction is
clearly reusable across at least two apps.
Use `docs/boundary.md` as the working ownership map when deciding whether a
feature belongs in `crawlkit` or a downstream crawl app.
## Development Rules
- Keep public package nouns stable and small: `config`, `store`, `snapshot`,

View File

@ -3,6 +3,8 @@
## Unreleased
- Initial `crawlkit` module scaffold.
- Document the `crawlkit` versus crawl-app boundary for embeddings, search,
inference, sync state, snapshots, SQLite, and git mirrors.
- Add `tui`, a shared Bubble Tea terminal archive browser used by the crawl apps for consistent `tui` command behavior.
- Improve `tui` rows with compact column rendering, pane-specific scrolling, and full-height pane borders.
- Tune `tui` pane colors and mouse-wheel buffering to better match the `gitcrawl` terminal browser feel.

View File

@ -16,6 +16,7 @@ go get github.com/vincentkoc/crawlkit@latest
Go packages are published by tagging this repository. There is no separate
package registry step. See `docs/publishing.md` for the release commands.
See `docs/boundary.md` for the crawlkit-versus-app ownership boundary.
## Packages

116
docs/boundary.md Normal file
View File

@ -0,0 +1,116 @@
# crawlkit boundary
`crawlkit` is the shared mechanics layer for local-first crawler archives. It
should make each crawl app smaller and more uniform without turning into a
generic Slack, Discord, Notion, or GitHub crawler.
The rule is simple: move behavior into `crawlkit` only when it is provider
neutral, reusable by at least two apps, and can preserve the app's existing
database and CLI contracts. Keep provider schemas, auth, API clients, cache
parsers, and product-specific ranking in the apps.
## owns
`crawlkit` should own these surfaces:
- Config paths, TOML loading defaults, runtime directories, and token
diagnostics that are the same across apps.
- SQLite connection hygiene: read-only opens, busy timeouts, WAL pragmas,
schema-version checks, transactions, safe identifier quoting, and generic
query helpers.
- Snapshot packing: manifest format, JSONL/Gzip shards, table filters,
import progress, sidecar registration, backward-compatible manifest reads,
and import callbacks.
- Git mirror mechanics: clone/init, pull, origin management, path-scoped
commits, push retry behavior, and portable SQLite checkout cleanup.
- Sync freshness semantics: cursor/freshness records, stale checks, manifest
import state, and adapters for legacy table shapes.
- Embedding provider clients and vector math once extracted: OpenAI-compatible,
Ollama, llama.cpp, probe diagnostics, cosine search, top-k selection,
reciprocal-rank fusion, vector encoding, and dimension validation.
- FTS utilities that do not know app schemas: query escaping, snippets,
rebuild/optimize helpers, deferred refresh orchestration, and progress logs.
- Terminal archive browsing primitives: pane layout, sorting, focus, mouse
actions, menus, detail rendering primitives, and local/remote status chrome.
- Safe read-only desktop-cache snapshot helpers. The provider-specific parsing
of those snapshots stays in the apps.
## does not own
`crawlkit` should not own these surfaces:
- Slack, Discord, Notion, GitHub, or future provider API clients.
- App-specific auth flows, token scopes, rate-limit policy, and provider
object normalization.
- App database schemas for messages, pages, threads, issues, members, blocks,
comments, channels, guilds, or workspaces.
- Provider desktop-cache parsing such as Slack LevelDB records, Discord cache
rows, or Notion SQLite object trees.
- App-specific FTS bodies and ranking, such as Notion display-tree ordering,
Slack mention normalization, Discord member search, and GitHub issue/PR
syntax.
- Summarization, clustering, triage inference, or prompts until the same
behavior exists in more than one app.
- App CLI command contracts. Shared helpers can format JSON/text/log output,
but the apps decide command names, flags, backward-compatible aliases, and
deprecation behavior.
## current app seams
| app | embeddings/search/inference | sync state | snapshot, sqlite, remote |
| --- | --- | --- | --- |
| `gitcrawl` | Has the richest inference path: OpenAI-only embeddings, local thread vectors, exact cosine neighbors, durable clusters, and GitHub thread/document FTS. The vector math and portable embedding client should move to `crawlkit`; GitHub thread task construction, clustering, and prompts stay app-owned. | Uses app-owned repo sync and portable metadata. Do not force it into the shared `sync_state` table. | Has the most mature portable-store git behavior: clone/pull, dirty checkout recovery, SQLite sidecar cleanup, and portable payload pruning. The generic git/SQLite checkout pieces belong in `mirror`; GitHub portable schema pruning stays app-owned. |
| `discrawl` | Has the best reusable embedding provider surface: OpenAI, OpenAI-compatible, Ollama, llama.cpp, probe checks, float32 blobs, semantic search, hybrid search, and RRF. Provider clients, vector encoding, cosine, top-k, and RRF should be extracted. Discord message/member FTS and privacy boundaries stay app-owned. | Uses a single `scope -> cursor` table with local-only scopes such as `wiretap:*`. Shared state should adapt to this shape, not migrate it. | Uses `snapshot` and `mirror`, with important app filters for DMs and local-only sync state. Embedding bundles are sidecars today; generic sidecar/binary-vector mechanics should move to `snapshot`, while DM exclusion remains in `discrawl`. |
| `slacrawl` | Has Slack FTS and Slack text/mention normalization. Embeddings are only reserved placeholders. Slack normalization and message FTS stay app-owned. | Closest to `crawlkit/state`: `source_name`, `entity_type`, `entity_id`, `value`, `updated_at`. It is the first app that can consume shared state directly. | Uses `snapshot` and `mirror` cleanly. Its remaining share logic is mostly table lists, search-index rebuilds, and import freshness. |
| `notcrawl` | Has page/comment FTS, display-tree page bodies, deferred FTS refresh, and maintain/rebuild commands. No embeddings yet. Deferred FTS orchestration can become shared; Notion page/comment FTS content stays app-owned. | Uses `source`, `entity_type`, `entity_id`, `cursor`, `synced_at`. Shared state needs column mapping or adapters before this can be de-duped safely. | Still carries custom manifest, JSONL/Gzip, Markdown sidecars, generated-path commits, and origin update behavior. The snapshot sidecar model and mirror path-scoped commit/origin helpers should let this converge without changing the Notion DB schema. |
## extraction order
1. Harden `mirror` first.
Add origin update semantics for existing checkouts, path-scoped commits so
publish never stages unrelated files, existing-origin pull for update flows,
and portable SQLite sidecar cleanup. This is the lowest-risk de-dupe because
every app already shells out to git in similar ways.
2. Expand `snapshot` sidecars.
Keep table export/import generic, but add first-class sidecar/bundle helpers
for Markdown pages and embedding JSONL/Gzip bundles. Apps still provide
filters, table lists, delete callbacks, FTS rebuild callbacks, and privacy
rules.
3. Add state adapters instead of one forced schema.
Keep the current source/entity/value schema as the canonical new shape, but
add adapters for `scope -> cursor` and `source/entity/cursor/synced_at`
stores. This avoids risky migrations while making freshness and stale checks
uniform.
4. Extract embeddings and vector search.
Start from `discrawl/internal/embed` for provider clients and from
`gitcrawl/internal/vector` plus discrawl search helpers for cosine, top-k,
vector encoding, and reciprocal-rank fusion. Apps keep task selection,
content hashing policy, provider config placement, and result persistence.
5. Add generic FTS helpers.
Provide query escaping, snippets, rebuild/optimize wrappers, deferred refresh
orchestration, and progress logging. Do not move entity-specific FTS schemas
or ranking into `crawlkit`.
6. Keep inference app-owned until there are two implementations.
`gitcrawl` clustering and summary-oriented work should not be generalized
yet. Extract only the provider/vector primitives it shares with chat/document
crawlers.
## compatibility gates
Every extraction must keep these constraints:
- Do not change existing app table shapes unless the app migration is explicitly
backward-compatible and tested against old fixtures.
- Do not change app command names, flags, JSON shape, or deprecated aliases
unless the downstream app changelog calls it out.
- Do not touch live stores during tests. Use temp homes, temp configs, and temp
SQLite files.
- Use `GOWORK=off` when proving the public `crawlkit` API so local workspaces
do not hide missing release tags.
- Keep privacy filters in the app layer. `crawlkit` can run a filter callback;
it should not know what a Discord DM or Slack private channel means.