commit f13029e3d13900b36acbea3c8c1565970baa1d50 Author: Vincent Koc Date: Sun Apr 26 22:58:48 2026 -0700 docs: define gitcrawl scope diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..5f72c12 --- /dev/null +++ b/.gitignore @@ -0,0 +1,20 @@ +.DS_Store +.env +.env.local + +bin/ +dist/ +coverage/ + +*.db +*.db-shm +*.db-wal +*.sqlite +*.sqlite-shm +*.sqlite-wal + +data/ +tmp/ +cache/ +logs/ +vectors/ diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..f06b860 --- /dev/null +++ b/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2026 OpenClaw + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/README.md b/README.md new file mode 100644 index 0000000..faa41c6 --- /dev/null +++ b/README.md @@ -0,0 +1,45 @@ +# gitcrawl + +`gitcrawl` is a local-first GitHub issue and pull request crawler for maintainer triage. + +It is the Go implementation of the `ghcrawl` product contract, minus the local HTTP API. Data stays local in SQLite. The primary runtime surfaces are the CLI, JSON command output, and a future TUI. + +## Status + +Early bootstrap. The implementation is being built in small commits. + +## Planned Commands + +```bash +gitcrawl init +gitcrawl doctor +gitcrawl sync owner/repo +gitcrawl refresh owner/repo +gitcrawl clusters owner/repo --json +gitcrawl cluster-detail owner/repo --id 123 --json +gitcrawl search owner/repo --query "download stalls" --json +gitcrawl tui owner/repo +``` + +`serve` is intentionally not part of `gitcrawl`. + +## Local Defaults + +- config: `~/.config/gitcrawl/config.toml` +- database: `~/.config/gitcrawl/gitcrawl.db` +- cache: `~/.config/gitcrawl/cache` +- vectors: `~/.config/gitcrawl/vectors` +- logs: `~/.config/gitcrawl/logs` + +## Requirements + +- Go 1.26+ +- a GitHub token for sync commands +- an OpenAI API key only for summary and embedding commands + +## Development + +```bash +go test ./... +go build ./cmd/gitcrawl +``` diff --git a/SPEC.md b/SPEC.md new file mode 100644 index 0000000..f13f538 --- /dev/null +++ b/SPEC.md @@ -0,0 +1,110 @@ +# gitcrawl Spec + +## Product Contract + +`gitcrawl` is a Go implementation of `ghcrawl` for local-first GitHub maintainer triage. + +The target is functional parity with `ghcrawl` except that `gitcrawl` does not expose a local HTTP API. + +## In Scope + +- local SQLite storage +- metadata-first GitHub sync for open issues and pull requests +- optional comment, review, review-comment, and PR code hydration +- canonical thread document building +- FTS search +- OpenAI summaries and embeddings +- deterministic fingerprints +- vector search +- clustering and durable cluster governance +- portable sync export/import +- CLI JSON surfaces for automation and agents +- TUI browsing after core JSON contracts settle + +## Out Of Scope + +- local HTTP API +- hosted service runtime +- browser web UI +- GitHub write-back actions + +## Architecture + +- `cmd/gitcrawl`: executable entrypoint +- `internal/cli`: command parsing and output +- `internal/config`: config and env resolution +- `internal/store`: SQLite schema and persistence +- `internal/github`: GitHub API client +- `internal/syncer`: repository sync workflows +- `internal/documents`: canonical document generation +- `internal/openai`: OpenAI summaries and embeddings +- `internal/vector`: vector search abstraction +- `internal/cluster`: similarity and durable cluster governance +- `internal/search`: keyword, semantic, and hybrid search +- `internal/portable`: compact sync export/import +- `internal/tui`: terminal UI + +## Command Surface + +No `serve` command. + +Planned public commands: + +- `init` +- `doctor` +- `configure` +- `version` +- `sync` +- `refresh` +- `summarize` +- `key-summaries` +- `embed` +- `cluster` +- `threads` +- `runs` +- `clusters` +- `durable-clusters` +- `cluster-detail` +- `cluster-explain` +- `neighbors` +- `search` +- `close-thread` +- `close-cluster` +- `exclude-cluster-member` +- `include-cluster-member` +- `set-cluster-canonical` +- `merge-clusters` +- `split-cluster` +- `export-sync` +- `import-sync` +- `validate-sync` +- `portable-size` +- `sync-status` +- `optimize` +- `tui` +- `completion` + +## Config + +Default config path: + +```text +~/.config/gitcrawl/config.toml +``` + +Default database path: + +```text +~/.config/gitcrawl/gitcrawl.db +``` + +Primary environment variables: + +- `GITCRAWL_CONFIG` +- `GITHUB_TOKEN` +- `OPENAI_API_KEY` +- `GITCRAWL_DB_PATH` +- `GITCRAWL_SUMMARY_MODEL` +- `GITCRAWL_EMBED_MODEL` + +Legacy `GHCRAWL_*` aliases should be supported where the compatibility cost is low.