docs: define gitcrawl scope

This commit is contained in:
Vincent Koc 2026-04-26 22:58:48 -07:00
commit f13029e3d1
No known key found for this signature in database
4 changed files with 196 additions and 0 deletions

20
.gitignore vendored Normal file
View File

@ -0,0 +1,20 @@
.DS_Store
.env
.env.local
bin/
dist/
coverage/
*.db
*.db-shm
*.db-wal
*.sqlite
*.sqlite-shm
*.sqlite-wal
data/
tmp/
cache/
logs/
vectors/

21
LICENSE Normal file
View File

@ -0,0 +1,21 @@
MIT License
Copyright (c) 2026 OpenClaw
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

45
README.md Normal file
View File

@ -0,0 +1,45 @@
# gitcrawl
`gitcrawl` is a local-first GitHub issue and pull request crawler for maintainer triage.
It is the Go implementation of the `ghcrawl` product contract, minus the local HTTP API. Data stays local in SQLite. The primary runtime surfaces are the CLI, JSON command output, and a future TUI.
## Status
Early bootstrap. The implementation is being built in small commits.
## Planned Commands
```bash
gitcrawl init
gitcrawl doctor
gitcrawl sync owner/repo
gitcrawl refresh owner/repo
gitcrawl clusters owner/repo --json
gitcrawl cluster-detail owner/repo --id 123 --json
gitcrawl search owner/repo --query "download stalls" --json
gitcrawl tui owner/repo
```
`serve` is intentionally not part of `gitcrawl`.
## Local Defaults
- config: `~/.config/gitcrawl/config.toml`
- database: `~/.config/gitcrawl/gitcrawl.db`
- cache: `~/.config/gitcrawl/cache`
- vectors: `~/.config/gitcrawl/vectors`
- logs: `~/.config/gitcrawl/logs`
## Requirements
- Go 1.26+
- a GitHub token for sync commands
- an OpenAI API key only for summary and embedding commands
## Development
```bash
go test ./...
go build ./cmd/gitcrawl
```

110
SPEC.md Normal file
View File

@ -0,0 +1,110 @@
# gitcrawl Spec
## Product Contract
`gitcrawl` is a Go implementation of `ghcrawl` for local-first GitHub maintainer triage.
The target is functional parity with `ghcrawl` except that `gitcrawl` does not expose a local HTTP API.
## In Scope
- local SQLite storage
- metadata-first GitHub sync for open issues and pull requests
- optional comment, review, review-comment, and PR code hydration
- canonical thread document building
- FTS search
- OpenAI summaries and embeddings
- deterministic fingerprints
- vector search
- clustering and durable cluster governance
- portable sync export/import
- CLI JSON surfaces for automation and agents
- TUI browsing after core JSON contracts settle
## Out Of Scope
- local HTTP API
- hosted service runtime
- browser web UI
- GitHub write-back actions
## Architecture
- `cmd/gitcrawl`: executable entrypoint
- `internal/cli`: command parsing and output
- `internal/config`: config and env resolution
- `internal/store`: SQLite schema and persistence
- `internal/github`: GitHub API client
- `internal/syncer`: repository sync workflows
- `internal/documents`: canonical document generation
- `internal/openai`: OpenAI summaries and embeddings
- `internal/vector`: vector search abstraction
- `internal/cluster`: similarity and durable cluster governance
- `internal/search`: keyword, semantic, and hybrid search
- `internal/portable`: compact sync export/import
- `internal/tui`: terminal UI
## Command Surface
No `serve` command.
Planned public commands:
- `init`
- `doctor`
- `configure`
- `version`
- `sync`
- `refresh`
- `summarize`
- `key-summaries`
- `embed`
- `cluster`
- `threads`
- `runs`
- `clusters`
- `durable-clusters`
- `cluster-detail`
- `cluster-explain`
- `neighbors`
- `search`
- `close-thread`
- `close-cluster`
- `exclude-cluster-member`
- `include-cluster-member`
- `set-cluster-canonical`
- `merge-clusters`
- `split-cluster`
- `export-sync`
- `import-sync`
- `validate-sync`
- `portable-size`
- `sync-status`
- `optimize`
- `tui`
- `completion`
## Config
Default config path:
```text
~/.config/gitcrawl/config.toml
```
Default database path:
```text
~/.config/gitcrawl/gitcrawl.db
```
Primary environment variables:
- `GITCRAWL_CONFIG`
- `GITHUB_TOKEN`
- `OPENAI_API_KEY`
- `GITCRAWL_DB_PATH`
- `GITCRAWL_SUMMARY_MODEL`
- `GITCRAWL_EMBED_MODEL`
Legacy `GHCRAWL_*` aliases should be supported where the compatibility cost is low.