From 59c0033fc7edc80a2f63f77e933f52888e6bb9c7 Mon Sep 17 00:00:00 2001 From: Vincent Koc Date: Tue, 5 May 2026 17:16:57 -0700 Subject: [PATCH] docs: define crawlkit app boundary --- AGENTS.md | 3 ++ CHANGELOG.md | 2 + README.md | 1 + docs/boundary.md | 116 +++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 122 insertions(+) create mode 100644 docs/boundary.md diff --git a/AGENTS.md b/AGENTS.md index b7743f8..e17e409 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -11,6 +11,9 @@ It is not a provider crawler. Keep Slack, Discord, Notion, GitHub, and other provider-specific behavior in the downstream apps unless the abstraction is clearly reusable across at least two apps. +Use `docs/boundary.md` as the working ownership map when deciding whether a +feature belongs in `crawlkit` or a downstream crawl app. + ## Development Rules - Keep public package nouns stable and small: `config`, `store`, `snapshot`, diff --git a/CHANGELOG.md b/CHANGELOG.md index 1b48f9c..d288d55 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -3,6 +3,8 @@ ## Unreleased - Initial `crawlkit` module scaffold. +- Document the `crawlkit` versus crawl-app boundary for embeddings, search, + inference, sync state, snapshots, SQLite, and git mirrors. - Add `tui`, a shared Bubble Tea terminal archive browser used by the crawl apps for consistent `tui` command behavior. - Improve `tui` rows with compact column rendering, pane-specific scrolling, and full-height pane borders. - Tune `tui` pane colors and mouse-wheel buffering to better match the `gitcrawl` terminal browser feel. diff --git a/README.md b/README.md index 5e456d3..b321b1f 100644 --- a/README.md +++ b/README.md @@ -16,6 +16,7 @@ go get github.com/vincentkoc/crawlkit@latest Go packages are published by tagging this repository. There is no separate package registry step. See `docs/publishing.md` for the release commands. +See `docs/boundary.md` for the crawlkit-versus-app ownership boundary. ## Packages diff --git a/docs/boundary.md b/docs/boundary.md new file mode 100644 index 0000000..4fe9068 --- /dev/null +++ b/docs/boundary.md @@ -0,0 +1,116 @@ +# crawlkit boundary + +`crawlkit` is the shared mechanics layer for local-first crawler archives. It +should make each crawl app smaller and more uniform without turning into a +generic Slack, Discord, Notion, or GitHub crawler. + +The rule is simple: move behavior into `crawlkit` only when it is provider +neutral, reusable by at least two apps, and can preserve the app's existing +database and CLI contracts. Keep provider schemas, auth, API clients, cache +parsers, and product-specific ranking in the apps. + +## owns + +`crawlkit` should own these surfaces: + +- Config paths, TOML loading defaults, runtime directories, and token + diagnostics that are the same across apps. +- SQLite connection hygiene: read-only opens, busy timeouts, WAL pragmas, + schema-version checks, transactions, safe identifier quoting, and generic + query helpers. +- Snapshot packing: manifest format, JSONL/Gzip shards, table filters, + import progress, sidecar registration, backward-compatible manifest reads, + and import callbacks. +- Git mirror mechanics: clone/init, pull, origin management, path-scoped + commits, push retry behavior, and portable SQLite checkout cleanup. +- Sync freshness semantics: cursor/freshness records, stale checks, manifest + import state, and adapters for legacy table shapes. +- Embedding provider clients and vector math once extracted: OpenAI-compatible, + Ollama, llama.cpp, probe diagnostics, cosine search, top-k selection, + reciprocal-rank fusion, vector encoding, and dimension validation. +- FTS utilities that do not know app schemas: query escaping, snippets, + rebuild/optimize helpers, deferred refresh orchestration, and progress logs. +- Terminal archive browsing primitives: pane layout, sorting, focus, mouse + actions, menus, detail rendering primitives, and local/remote status chrome. +- Safe read-only desktop-cache snapshot helpers. The provider-specific parsing + of those snapshots stays in the apps. + +## does not own + +`crawlkit` should not own these surfaces: + +- Slack, Discord, Notion, GitHub, or future provider API clients. +- App-specific auth flows, token scopes, rate-limit policy, and provider + object normalization. +- App database schemas for messages, pages, threads, issues, members, blocks, + comments, channels, guilds, or workspaces. +- Provider desktop-cache parsing such as Slack LevelDB records, Discord cache + rows, or Notion SQLite object trees. +- App-specific FTS bodies and ranking, such as Notion display-tree ordering, + Slack mention normalization, Discord member search, and GitHub issue/PR + syntax. +- Summarization, clustering, triage inference, or prompts until the same + behavior exists in more than one app. +- App CLI command contracts. Shared helpers can format JSON/text/log output, + but the apps decide command names, flags, backward-compatible aliases, and + deprecation behavior. + +## current app seams + +| app | embeddings/search/inference | sync state | snapshot, sqlite, remote | +| --- | --- | --- | --- | +| `gitcrawl` | Has the richest inference path: OpenAI-only embeddings, local thread vectors, exact cosine neighbors, durable clusters, and GitHub thread/document FTS. The vector math and portable embedding client should move to `crawlkit`; GitHub thread task construction, clustering, and prompts stay app-owned. | Uses app-owned repo sync and portable metadata. Do not force it into the shared `sync_state` table. | Has the most mature portable-store git behavior: clone/pull, dirty checkout recovery, SQLite sidecar cleanup, and portable payload pruning. The generic git/SQLite checkout pieces belong in `mirror`; GitHub portable schema pruning stays app-owned. | +| `discrawl` | Has the best reusable embedding provider surface: OpenAI, OpenAI-compatible, Ollama, llama.cpp, probe checks, float32 blobs, semantic search, hybrid search, and RRF. Provider clients, vector encoding, cosine, top-k, and RRF should be extracted. Discord message/member FTS and privacy boundaries stay app-owned. | Uses a single `scope -> cursor` table with local-only scopes such as `wiretap:*`. Shared state should adapt to this shape, not migrate it. | Uses `snapshot` and `mirror`, with important app filters for DMs and local-only sync state. Embedding bundles are sidecars today; generic sidecar/binary-vector mechanics should move to `snapshot`, while DM exclusion remains in `discrawl`. | +| `slacrawl` | Has Slack FTS and Slack text/mention normalization. Embeddings are only reserved placeholders. Slack normalization and message FTS stay app-owned. | Closest to `crawlkit/state`: `source_name`, `entity_type`, `entity_id`, `value`, `updated_at`. It is the first app that can consume shared state directly. | Uses `snapshot` and `mirror` cleanly. Its remaining share logic is mostly table lists, search-index rebuilds, and import freshness. | +| `notcrawl` | Has page/comment FTS, display-tree page bodies, deferred FTS refresh, and maintain/rebuild commands. No embeddings yet. Deferred FTS orchestration can become shared; Notion page/comment FTS content stays app-owned. | Uses `source`, `entity_type`, `entity_id`, `cursor`, `synced_at`. Shared state needs column mapping or adapters before this can be de-duped safely. | Still carries custom manifest, JSONL/Gzip, Markdown sidecars, generated-path commits, and origin update behavior. The snapshot sidecar model and mirror path-scoped commit/origin helpers should let this converge without changing the Notion DB schema. | + +## extraction order + +1. Harden `mirror` first. + Add origin update semantics for existing checkouts, path-scoped commits so + publish never stages unrelated files, existing-origin pull for update flows, + and portable SQLite sidecar cleanup. This is the lowest-risk de-dupe because + every app already shells out to git in similar ways. + +2. Expand `snapshot` sidecars. + Keep table export/import generic, but add first-class sidecar/bundle helpers + for Markdown pages and embedding JSONL/Gzip bundles. Apps still provide + filters, table lists, delete callbacks, FTS rebuild callbacks, and privacy + rules. + +3. Add state adapters instead of one forced schema. + Keep the current source/entity/value schema as the canonical new shape, but + add adapters for `scope -> cursor` and `source/entity/cursor/synced_at` + stores. This avoids risky migrations while making freshness and stale checks + uniform. + +4. Extract embeddings and vector search. + Start from `discrawl/internal/embed` for provider clients and from + `gitcrawl/internal/vector` plus discrawl search helpers for cosine, top-k, + vector encoding, and reciprocal-rank fusion. Apps keep task selection, + content hashing policy, provider config placement, and result persistence. + +5. Add generic FTS helpers. + Provide query escaping, snippets, rebuild/optimize wrappers, deferred refresh + orchestration, and progress logging. Do not move entity-specific FTS schemas + or ranking into `crawlkit`. + +6. Keep inference app-owned until there are two implementations. + `gitcrawl` clustering and summary-oriented work should not be generalized + yet. Extract only the provider/vector primitives it shares with chat/document + crawlers. + +## compatibility gates + +Every extraction must keep these constraints: + +- Do not change existing app table shapes unless the app migration is explicitly + backward-compatible and tested against old fixtures. +- Do not change app command names, flags, JSON shape, or deprecated aliases + unless the downstream app changelog calls it out. +- Do not touch live stores during tests. Use temp homes, temp configs, and temp + SQLite files. +- Use `GOWORK=off` when proving the public `crawlkit` API so local workspaces + do not hide missing release tags. +- Keep privacy filters in the app layer. `crawlkit` can run a filter callback; + it should not know what a Discord DM or Slack private channel means.