docs: add gitcrawl.sh site
This commit is contained in:
parent
126059701c
commit
fc12f81b6a
64
.github/workflows/pages.yml
vendored
Normal file
64
.github/workflows/pages.yml
vendored
Normal file
@ -0,0 +1,64 @@
|
||||
name: Pages
|
||||
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- main
|
||||
paths:
|
||||
- "docs/**"
|
||||
- ".github/workflows/pages.yml"
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
contents: read
|
||||
pages: write
|
||||
id-token: write
|
||||
|
||||
concurrency:
|
||||
group: pages
|
||||
cancel-in-progress: false
|
||||
|
||||
jobs:
|
||||
build:
|
||||
name: Build site
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Check out
|
||||
uses: actions/checkout@v6
|
||||
|
||||
- name: Set up Ruby
|
||||
uses: ruby/setup-ruby@v1
|
||||
with:
|
||||
ruby-version: "3.3"
|
||||
bundler-cache: true
|
||||
working-directory: docs
|
||||
|
||||
- name: Configure Pages
|
||||
id: pages
|
||||
uses: actions/configure-pages@v5
|
||||
|
||||
- name: Build with Jekyll
|
||||
working-directory: docs
|
||||
run: |
|
||||
bundle exec jekyll build \
|
||||
--baseurl "${{ steps.pages.outputs.base_path }}" \
|
||||
--destination ../_site
|
||||
env:
|
||||
JEKYLL_ENV: production
|
||||
|
||||
- name: Upload artifact
|
||||
uses: actions/upload-pages-artifact@v3
|
||||
with:
|
||||
path: _site
|
||||
|
||||
deploy:
|
||||
name: Deploy to gitcrawl.sh
|
||||
needs: build
|
||||
runs-on: ubuntu-latest
|
||||
environment:
|
||||
name: github-pages
|
||||
url: ${{ steps.deployment.outputs.page_url }}
|
||||
steps:
|
||||
- name: Deploy to GitHub Pages
|
||||
id: deployment
|
||||
uses: actions/deploy-pages@v4
|
||||
@ -4,6 +4,8 @@
|
||||
|
||||
`gitcrawl` is a local-first GitHub issue and pull request crawler for maintainer triage. Data stays local in SQLite. The primary runtime surfaces are the CLI, JSON command output, and the terminal UI. There is no local HTTP API.
|
||||
|
||||
Full documentation: [gitcrawl.sh](https://gitcrawl.sh)
|
||||
|
||||
## Status
|
||||
|
||||
Early bootstrap. The implementation is being built in small commits.
|
||||
@ -50,7 +52,7 @@ gitcrawl tui owner/repo
|
||||
Pass `--numbers` to refresh exact issue or pull request rows without relying on list ordering or updated-time windows.
|
||||
Pass `--with pr-details` or `--include-pr-details` to hydrate pull request files, commits, checks, and workflow runs for local review. The `gh` shim can also auto-hydrate one exact PR on a PR-detail miss, then retry locally.
|
||||
`gitcrawl search issues|prs` accepts the common `gh search` shape (`<query> -R owner/repo --state open --json fields --limit N`) and answers from the local SQLite cache. It is intended for discovery without spending GitHub REST search quota; use `gh` for final live verification and GitHub write actions. Pass `--sync-if-stale 5m` to perform one metadata sync before the cached search when the local repository mirror is older than that duration.
|
||||
`gitcrawl gh` is a gh-compatible shim for agent workflows. It answers broad `gh search issues|prs`, `gh issue/pr list`, supported `gh issue/pr view --json` fields, hydrated `gh pr checks`, and hydrated `gh run list/view` from local SQLite, then falls through to the real GitHub CLI for unsupported commands. Local `gh issue/pr list` supports common filters such as `--author`, `--assignee`, and repeated `--label`. Read-only fallthroughs such as `gh pr diff`, `gh repo view/list`, `gh label list`, and GET-only `gh api` calls use a short persistent cache under `cache/gh-shim`; `gh pr diff` entries are keyed by the cached PR head SHA when available. Mutating commands pass through, increment write counters, and clear that cache. `gh xcache stats|keys|gc|flush` inspects, garbage-collects, or clears the fallthrough cache. Set `GITCRAWL_GH_PATH` to choose the backend `gh`, and symlink or install the binary as `gh`/`gitcrawl-gh` to run the shim directly.
|
||||
`gitcrawl gh` is a gh-compatible shim for agent workflows. It answers broad `gh search issues|prs`, `gh issue/pr list`, supported `gh issue/pr view --json` fields, hydrated `gh pr checks`, and hydrated `gh run list/view` from local SQLite, then falls through to the real GitHub CLI for unsupported commands. Local `gh issue/pr list` supports common filters such as `--author`, `--assignee`, and repeated `--label`. Read-only fallthroughs such as `gh pr diff`, `gh repo view/list`, `gh release list/view`, `gh workflow list/view`, `gh secret list`, `gh variable get/list`, `gh label list`, read-only `gh search` kinds, and GET-only `gh api` calls use a short persistent cache under `cache/gh-shim`; `gh pr diff` entries are keyed by the cached PR head SHA when available. Repeat read failures are cached by default so agents do not rediscover the same missing release or workflow; set `GITCRAWL_GH_CACHE_ERRORS=0` to disable that behavior. Mutating commands pass through, increment write counters, and clear that cache. `gh xcache stats|keys|gc|flush` inspects, garbage-collects, or clears the fallthrough cache. Set `GITCRAWL_GH_PATH` to choose the backend `gh`, and symlink or install the binary as `gh`/`gitcrawl-gh` to run the shim directly.
|
||||
The TUI starts at `--min-size 5` and `--sort size`, like ghcrawl's saved default, so the first screen is the useful cluster workload instead of singleton noise. Pass `--min-size 1` when you intentionally want singleton clusters. Mouse support is built in: click rows, wheel panes, and right-click for copy, sort, filter, jump, link, neighbor, local close/reopen, and member triage actions. Press `a` to open the same action menu from the keyboard, `#` to jump directly to an issue or PR number, `p` to switch between repositories already present in the local store, or `n` to load neighbors for the selected issue or PR. Enter from the members pane also loads neighbors before opening detail. The TUI quietly refreshes from the local store every 15 seconds.
|
||||
|
||||
## Local Defaults
|
||||
|
||||
1
docs/CNAME
Normal file
1
docs/CNAME
Normal file
@ -0,0 +1 @@
|
||||
gitcrawl.sh
|
||||
12
docs/Gemfile
Normal file
12
docs/Gemfile
Normal file
@ -0,0 +1,12 @@
|
||||
source "https://rubygems.org"
|
||||
|
||||
# GitHub Pages dependencies for local preview.
|
||||
# `bundle exec jekyll serve` reproduces the gitcrawl.sh site.
|
||||
gem "github-pages", group: :jekyll_plugins
|
||||
gem "jekyll-remote-theme"
|
||||
gem "jekyll-seo-tag"
|
||||
gem "jekyll-sitemap"
|
||||
gem "just-the-docs"
|
||||
|
||||
# Required for Ruby 3+ on macOS.
|
||||
gem "webrick"
|
||||
324
docs/Gemfile.lock
Normal file
324
docs/Gemfile.lock
Normal file
@ -0,0 +1,324 @@
|
||||
GEM
|
||||
remote: https://rubygems.org/
|
||||
specs:
|
||||
activesupport (8.1.3)
|
||||
base64
|
||||
bigdecimal
|
||||
concurrent-ruby (~> 1.0, >= 1.3.1)
|
||||
connection_pool (>= 2.2.5)
|
||||
drb
|
||||
i18n (>= 1.6, < 2)
|
||||
json
|
||||
logger (>= 1.4.2)
|
||||
minitest (>= 5.1)
|
||||
securerandom (>= 0.3)
|
||||
tzinfo (~> 2.0, >= 2.0.5)
|
||||
uri (>= 0.13.1)
|
||||
addressable (2.9.0)
|
||||
public_suffix (>= 2.0.2, < 8.0)
|
||||
base64 (0.3.0)
|
||||
bigdecimal (4.1.2)
|
||||
coffee-script (2.4.1)
|
||||
coffee-script-source
|
||||
execjs
|
||||
coffee-script-source (1.12.2)
|
||||
colorator (1.1.0)
|
||||
commonmarker (0.23.12)
|
||||
concurrent-ruby (1.3.6)
|
||||
connection_pool (3.0.2)
|
||||
csv (3.3.5)
|
||||
dnsruby (1.73.1)
|
||||
base64 (>= 0.2)
|
||||
logger (~> 1.6)
|
||||
simpleidn (~> 0.2.1)
|
||||
drb (2.2.3)
|
||||
em-websocket (0.5.3)
|
||||
eventmachine (>= 0.12.9)
|
||||
http_parser.rb (~> 0)
|
||||
ethon (0.18.0)
|
||||
ffi (>= 1.15.0)
|
||||
logger
|
||||
eventmachine (1.2.7)
|
||||
execjs (2.10.1)
|
||||
faraday (2.14.1)
|
||||
faraday-net_http (>= 2.0, < 3.5)
|
||||
json
|
||||
logger
|
||||
faraday-net_http (3.4.2)
|
||||
net-http (~> 0.5)
|
||||
ffi (1.17.4-aarch64-linux-gnu)
|
||||
ffi (1.17.4-aarch64-linux-musl)
|
||||
ffi (1.17.4-arm-linux-gnu)
|
||||
ffi (1.17.4-arm-linux-musl)
|
||||
ffi (1.17.4-arm64-darwin)
|
||||
ffi (1.17.4-x86_64-darwin)
|
||||
ffi (1.17.4-x86_64-linux-gnu)
|
||||
ffi (1.17.4-x86_64-linux-musl)
|
||||
forwardable-extended (2.6.0)
|
||||
gemoji (4.1.0)
|
||||
github-pages (232)
|
||||
github-pages-health-check (= 1.18.2)
|
||||
jekyll (= 3.10.0)
|
||||
jekyll-avatar (= 0.8.0)
|
||||
jekyll-coffeescript (= 1.2.2)
|
||||
jekyll-commonmark-ghpages (= 0.5.1)
|
||||
jekyll-default-layout (= 0.1.5)
|
||||
jekyll-feed (= 0.17.0)
|
||||
jekyll-gist (= 1.5.0)
|
||||
jekyll-github-metadata (= 2.16.1)
|
||||
jekyll-include-cache (= 0.2.1)
|
||||
jekyll-mentions (= 1.6.0)
|
||||
jekyll-optional-front-matter (= 0.3.2)
|
||||
jekyll-paginate (= 1.1.0)
|
||||
jekyll-readme-index (= 0.3.0)
|
||||
jekyll-redirect-from (= 0.16.0)
|
||||
jekyll-relative-links (= 0.6.1)
|
||||
jekyll-remote-theme (= 0.4.3)
|
||||
jekyll-sass-converter (= 1.5.2)
|
||||
jekyll-seo-tag (= 2.8.0)
|
||||
jekyll-sitemap (= 1.4.0)
|
||||
jekyll-swiss (= 1.0.0)
|
||||
jekyll-theme-architect (= 0.2.0)
|
||||
jekyll-theme-cayman (= 0.2.0)
|
||||
jekyll-theme-dinky (= 0.2.0)
|
||||
jekyll-theme-hacker (= 0.2.0)
|
||||
jekyll-theme-leap-day (= 0.2.0)
|
||||
jekyll-theme-merlot (= 0.2.0)
|
||||
jekyll-theme-midnight (= 0.2.0)
|
||||
jekyll-theme-minimal (= 0.2.0)
|
||||
jekyll-theme-modernist (= 0.2.0)
|
||||
jekyll-theme-primer (= 0.6.0)
|
||||
jekyll-theme-slate (= 0.2.0)
|
||||
jekyll-theme-tactile (= 0.2.0)
|
||||
jekyll-theme-time-machine (= 0.2.0)
|
||||
jekyll-titles-from-headings (= 0.5.3)
|
||||
jemoji (= 0.13.0)
|
||||
kramdown (= 2.4.0)
|
||||
kramdown-parser-gfm (= 1.1.0)
|
||||
liquid (= 4.0.4)
|
||||
mercenary (~> 0.3)
|
||||
minima (= 2.5.1)
|
||||
nokogiri (>= 1.16.2, < 2.0)
|
||||
rouge (= 3.30.0)
|
||||
terminal-table (~> 1.4)
|
||||
webrick (~> 1.8)
|
||||
github-pages-health-check (1.18.2)
|
||||
addressable (~> 2.3)
|
||||
dnsruby (~> 1.60)
|
||||
octokit (>= 4, < 8)
|
||||
public_suffix (>= 3.0, < 6.0)
|
||||
typhoeus (~> 1.3)
|
||||
html-pipeline (2.14.3)
|
||||
activesupport (>= 2)
|
||||
nokogiri (>= 1.4)
|
||||
http_parser.rb (0.8.1)
|
||||
i18n (1.14.8)
|
||||
concurrent-ruby (~> 1.0)
|
||||
jekyll (3.10.0)
|
||||
addressable (~> 2.4)
|
||||
colorator (~> 1.0)
|
||||
csv (~> 3.0)
|
||||
em-websocket (~> 0.5)
|
||||
i18n (>= 0.7, < 2)
|
||||
jekyll-sass-converter (~> 1.0)
|
||||
jekyll-watch (~> 2.0)
|
||||
kramdown (>= 1.17, < 3)
|
||||
liquid (~> 4.0)
|
||||
mercenary (~> 0.3.3)
|
||||
pathutil (~> 0.9)
|
||||
rouge (>= 1.7, < 4)
|
||||
safe_yaml (~> 1.0)
|
||||
webrick (>= 1.0)
|
||||
jekyll-avatar (0.8.0)
|
||||
jekyll (>= 3.0, < 5.0)
|
||||
jekyll-coffeescript (1.2.2)
|
||||
coffee-script (~> 2.2)
|
||||
coffee-script-source (~> 1.12)
|
||||
jekyll-commonmark (1.4.0)
|
||||
commonmarker (~> 0.22)
|
||||
jekyll-commonmark-ghpages (0.5.1)
|
||||
commonmarker (>= 0.23.7, < 1.1.0)
|
||||
jekyll (>= 3.9, < 4.0)
|
||||
jekyll-commonmark (~> 1.4.0)
|
||||
rouge (>= 2.0, < 5.0)
|
||||
jekyll-default-layout (0.1.5)
|
||||
jekyll (>= 3.0, < 5.0)
|
||||
jekyll-feed (0.17.0)
|
||||
jekyll (>= 3.7, < 5.0)
|
||||
jekyll-gist (1.5.0)
|
||||
octokit (~> 4.2)
|
||||
jekyll-github-metadata (2.16.1)
|
||||
jekyll (>= 3.4, < 5.0)
|
||||
octokit (>= 4, < 7, != 4.4.0)
|
||||
jekyll-include-cache (0.2.1)
|
||||
jekyll (>= 3.7, < 5.0)
|
||||
jekyll-mentions (1.6.0)
|
||||
html-pipeline (~> 2.3)
|
||||
jekyll (>= 3.7, < 5.0)
|
||||
jekyll-optional-front-matter (0.3.2)
|
||||
jekyll (>= 3.0, < 5.0)
|
||||
jekyll-paginate (1.1.0)
|
||||
jekyll-readme-index (0.3.0)
|
||||
jekyll (>= 3.0, < 5.0)
|
||||
jekyll-redirect-from (0.16.0)
|
||||
jekyll (>= 3.3, < 5.0)
|
||||
jekyll-relative-links (0.6.1)
|
||||
jekyll (>= 3.3, < 5.0)
|
||||
jekyll-remote-theme (0.4.3)
|
||||
addressable (~> 2.0)
|
||||
jekyll (>= 3.5, < 5.0)
|
||||
jekyll-sass-converter (>= 1.0, <= 3.0.0, != 2.0.0)
|
||||
rubyzip (>= 1.3.0, < 3.0)
|
||||
jekyll-sass-converter (1.5.2)
|
||||
sass (~> 3.4)
|
||||
jekyll-seo-tag (2.8.0)
|
||||
jekyll (>= 3.8, < 5.0)
|
||||
jekyll-sitemap (1.4.0)
|
||||
jekyll (>= 3.7, < 5.0)
|
||||
jekyll-swiss (1.0.0)
|
||||
jekyll-theme-architect (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-cayman (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-dinky (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-hacker (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-leap-day (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-merlot (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-midnight (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-minimal (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-modernist (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-primer (0.6.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-github-metadata (~> 2.9)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-slate (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-tactile (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-theme-time-machine (0.2.0)
|
||||
jekyll (> 3.5, < 5.0)
|
||||
jekyll-seo-tag (~> 2.0)
|
||||
jekyll-titles-from-headings (0.5.3)
|
||||
jekyll (>= 3.3, < 5.0)
|
||||
jekyll-watch (2.2.1)
|
||||
listen (~> 3.0)
|
||||
jemoji (0.13.0)
|
||||
gemoji (>= 3, < 5)
|
||||
html-pipeline (~> 2.2)
|
||||
jekyll (>= 3.0, < 5.0)
|
||||
json (2.19.5)
|
||||
just-the-docs (0.12.0)
|
||||
jekyll (>= 3.8.5)
|
||||
jekyll-include-cache
|
||||
jekyll-seo-tag (>= 2.0)
|
||||
rake (>= 12.3.1)
|
||||
kramdown (2.4.0)
|
||||
rexml
|
||||
kramdown-parser-gfm (1.1.0)
|
||||
kramdown (~> 2.0)
|
||||
liquid (4.0.4)
|
||||
listen (3.10.0)
|
||||
logger
|
||||
rb-fsevent (~> 0.10, >= 0.10.3)
|
||||
rb-inotify (~> 0.9, >= 0.9.10)
|
||||
logger (1.7.0)
|
||||
mercenary (0.3.6)
|
||||
minima (2.5.1)
|
||||
jekyll (>= 3.5, < 5.0)
|
||||
jekyll-feed (~> 0.9)
|
||||
jekyll-seo-tag (~> 2.1)
|
||||
minitest (6.0.6)
|
||||
drb (~> 2.0)
|
||||
prism (~> 1.5)
|
||||
net-http (0.9.1)
|
||||
uri (>= 0.11.1)
|
||||
nokogiri (1.19.3-aarch64-linux-gnu)
|
||||
racc (~> 1.4)
|
||||
nokogiri (1.19.3-aarch64-linux-musl)
|
||||
racc (~> 1.4)
|
||||
nokogiri (1.19.3-arm-linux-gnu)
|
||||
racc (~> 1.4)
|
||||
nokogiri (1.19.3-arm-linux-musl)
|
||||
racc (~> 1.4)
|
||||
nokogiri (1.19.3-arm64-darwin)
|
||||
racc (~> 1.4)
|
||||
nokogiri (1.19.3-x86_64-darwin)
|
||||
racc (~> 1.4)
|
||||
nokogiri (1.19.3-x86_64-linux-gnu)
|
||||
racc (~> 1.4)
|
||||
nokogiri (1.19.3-x86_64-linux-musl)
|
||||
racc (~> 1.4)
|
||||
octokit (4.25.1)
|
||||
faraday (>= 1, < 3)
|
||||
sawyer (~> 0.9)
|
||||
pathutil (0.16.2)
|
||||
forwardable-extended (~> 2.6)
|
||||
prism (1.9.0)
|
||||
public_suffix (5.1.1)
|
||||
racc (1.8.1)
|
||||
rake (13.4.2)
|
||||
rb-fsevent (0.11.2)
|
||||
rb-inotify (0.11.1)
|
||||
ffi (~> 1.0)
|
||||
rexml (3.4.4)
|
||||
rouge (3.30.0)
|
||||
rubyzip (2.4.1)
|
||||
safe_yaml (1.0.5)
|
||||
sass (3.7.4)
|
||||
sass-listen (~> 4.0.0)
|
||||
sass-listen (4.0.0)
|
||||
rb-fsevent (~> 0.9, >= 0.9.4)
|
||||
rb-inotify (~> 0.9, >= 0.9.7)
|
||||
sawyer (0.9.3)
|
||||
addressable (>= 2.3.5)
|
||||
faraday (>= 0.17.3, < 3)
|
||||
securerandom (0.4.1)
|
||||
simpleidn (0.2.3)
|
||||
terminal-table (1.8.0)
|
||||
unicode-display_width (~> 1.1, >= 1.1.1)
|
||||
typhoeus (1.6.0)
|
||||
ethon (>= 0.18.0)
|
||||
tzinfo (2.0.6)
|
||||
concurrent-ruby (~> 1.0)
|
||||
unicode-display_width (1.8.0)
|
||||
uri (1.1.1)
|
||||
webrick (1.9.2)
|
||||
|
||||
PLATFORMS
|
||||
aarch64-linux-gnu
|
||||
aarch64-linux-musl
|
||||
arm-linux-gnu
|
||||
arm-linux-musl
|
||||
arm64-darwin
|
||||
x86_64-darwin
|
||||
x86_64-linux-gnu
|
||||
x86_64-linux-musl
|
||||
|
||||
DEPENDENCIES
|
||||
github-pages
|
||||
jekyll-remote-theme
|
||||
jekyll-seo-tag
|
||||
jekyll-sitemap
|
||||
just-the-docs
|
||||
webrick
|
||||
|
||||
BUNDLED WITH
|
||||
2.5.22
|
||||
51
docs/_config.yml
Normal file
51
docs/_config.yml
Normal file
@ -0,0 +1,51 @@
|
||||
title: gitcrawl
|
||||
description: Local-first GitHub issue and pull request crawler for maintainer triage. Sync, search, cluster, and review related issues and PRs from a SQLite cache that lives on your machine.
|
||||
url: https://gitcrawl.sh
|
||||
baseurl: ""
|
||||
|
||||
remote_theme: just-the-docs/just-the-docs
|
||||
|
||||
color_scheme: dark
|
||||
search_enabled: true
|
||||
heading_anchors: true
|
||||
permalink: pretty
|
||||
|
||||
aux_links:
|
||||
GitHub: https://github.com/openclaw/gitcrawl
|
||||
Releases: https://github.com/openclaw/gitcrawl/releases
|
||||
Issues: https://github.com/openclaw/gitcrawl/issues
|
||||
|
||||
aux_links_new_tab: true
|
||||
|
||||
footer_content: "<a href='https://github.com/openclaw/gitcrawl'>gitcrawl</a> is open source under the <a href='https://github.com/openclaw/gitcrawl/blob/main/LICENSE'>MIT license</a>."
|
||||
|
||||
mermaid:
|
||||
version: "10.6.1"
|
||||
|
||||
callouts_level: quiet
|
||||
callouts:
|
||||
note:
|
||||
title: Note
|
||||
color: blue
|
||||
tip:
|
||||
title: Tip
|
||||
color: green
|
||||
warning:
|
||||
title: Warning
|
||||
color: yellow
|
||||
important:
|
||||
title: Important
|
||||
color: purple
|
||||
|
||||
plugins:
|
||||
- jekyll-remote-theme
|
||||
- jekyll-seo-tag
|
||||
- jekyll-sitemap
|
||||
|
||||
exclude:
|
||||
- Gemfile
|
||||
- Gemfile.lock
|
||||
- vendor/
|
||||
- .bundle/
|
||||
- node_modules/
|
||||
- README.md
|
||||
175
docs/automation.md
Normal file
175
docs/automation.md
Normal file
@ -0,0 +1,175 @@
|
||||
---
|
||||
title: Automation
|
||||
nav_order: 14
|
||||
permalink: /automation/
|
||||
---
|
||||
|
||||
# Automation
|
||||
{: .no_toc }
|
||||
|
||||
Stable JSON contracts, agent recipes, and patterns for keeping the local mirror warm without manual ceremony.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## JSON output is first class
|
||||
|
||||
Every command supports `--json` (or the global `--format json`). The resulting payload is pretty-printed with stable field names so you can pipe it directly into `jq` or feed it to an agent as structured context.
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --json | jq '{run_id, inserted, updated}'
|
||||
gitcrawl clusters owner/repo --json --sort size --min-size 5 \
|
||||
| jq '.clusters[] | {id, members: .member_count, latest: .latest_thread_number}'
|
||||
```
|
||||
|
||||
For the full per-command JSON shapes, see the individual feature pages and the [Commands reference](./commands).
|
||||
|
||||
## Exit codes
|
||||
|
||||
- `0` — success
|
||||
- non-zero — usage error, command not implemented, runtime error
|
||||
|
||||
Stderr always carries a human-readable error message; stdout is reserved for the requested output (text or JSON) so you can pipe stdout to `jq` without losing diagnostics.
|
||||
|
||||
## Keeping the mirror fresh
|
||||
|
||||
Three patterns, in increasing order of automation:
|
||||
|
||||
### On-demand staleness check
|
||||
|
||||
Use `--sync-if-stale` on `gitcrawl search` (or the gh-shim's search):
|
||||
|
||||
```bash
|
||||
gitcrawl search issues "manifest cache" \
|
||||
-R owner/repo \
|
||||
--sync-if-stale 5m \
|
||||
--json number,title,url
|
||||
```
|
||||
|
||||
Best for ad-hoc agent tools that should bound staleness but minimize sync calls.
|
||||
|
||||
### Auto-hydration via the gh shim
|
||||
|
||||
Symlink the gitcrawl binary as `gh` (or `gitcrawl-gh`) and let the shim pull a single PR's detail when an agent calls `gh pr view` or `gh pr checks` against an unhydrated PR. See [gh shim → auto-hydration](./gh-shim#auto-hydration).
|
||||
|
||||
This is the lowest-overhead pattern for fleets of agents — no scheduling required.
|
||||
|
||||
### Periodic background refresh
|
||||
|
||||
Run `gitcrawl refresh owner/repo` on a cron, systemd timer, or `launchd` agent every few minutes per repo. Combine with the gh shim and your agents almost never have to wait on GitHub.
|
||||
|
||||
```cron
|
||||
# Every 5 minutes, refresh the active repos.
|
||||
*/5 * * * * /usr/local/bin/gitcrawl refresh openclaw/gitcrawl --json > /tmp/gitcrawl.openclaw.json 2>&1
|
||||
```
|
||||
|
||||
For multiple repos, loop in a small shell script — gitcrawl is happy to run sequentially against a shared SQLite file.
|
||||
|
||||
## Agent recipes
|
||||
|
||||
### "Look up an issue without burning quota"
|
||||
|
||||
```bash
|
||||
gh issue view 123 -R owner/repo --json number,title,state,body,labels,author
|
||||
```
|
||||
|
||||
With the shim symlinked as `gh`, this answers from local SQLite if the issue is cached. Auto-hydration covers PR-detail fields. The agent prompt does not change.
|
||||
|
||||
### "Find candidates, hydrate them, summarize"
|
||||
|
||||
```bash
|
||||
NUMS=$(gh search issues "checksum mismatch" -R owner/repo \
|
||||
--json number --limit 30 \
|
||||
| jq -r '[.[].number] | join(",")')
|
||||
|
||||
gitcrawl sync owner/repo --numbers "$NUMS" --include-comments --with pr-details
|
||||
|
||||
gitcrawl cluster-detail owner/repo --id "$(gitcrawl clusters owner/repo --json \
|
||||
| jq '.clusters[0].id')"
|
||||
```
|
||||
|
||||
Search is local; the targeted sync brings exactly the rows you need; cluster-detail returns the structured triage view.
|
||||
|
||||
### "Find duplicates around a new bug report"
|
||||
|
||||
```bash
|
||||
NUM=789
|
||||
gitcrawl sync owner/repo --numbers "$NUM" --include-comments
|
||||
gitcrawl embed owner/repo --number "$NUM"
|
||||
gitcrawl neighbors owner/repo --number "$NUM" --limit 10 --json
|
||||
```
|
||||
|
||||
### "Triage a cluster end to end"
|
||||
|
||||
```bash
|
||||
ID=42
|
||||
|
||||
# Read.
|
||||
gitcrawl cluster-detail owner/repo --id "$ID" --body-chars 600 --json
|
||||
|
||||
# Decide canonical, then close locally.
|
||||
gitcrawl set-cluster-canonical owner/repo --id "$ID" --number 123
|
||||
gitcrawl close-cluster owner/repo --id "$ID" --reason "consolidated under #123"
|
||||
|
||||
# Comment upstream via real gh.
|
||||
gh issue comment 456 -R owner/repo --body "Duplicate of #123"
|
||||
```
|
||||
|
||||
### "Prove the shim is paying off"
|
||||
|
||||
```bash
|
||||
# Periodically log cache stats — watch local_hits climb relative to backend_misses.
|
||||
gitcrawl gh xcache stats --json \
|
||||
| jq '{local: .counters.local_hits, fallback: .counters.fallback_hits, github: .counters.backend_misses}'
|
||||
```
|
||||
|
||||
## Multi-repo automation
|
||||
|
||||
A single `gitcrawl.db` can hold many repositories. Loop in shell:
|
||||
|
||||
```bash
|
||||
for repo in openclaw/gitcrawl steipete/repo-a octocat/repo-b; do
|
||||
gitcrawl refresh "$repo" --json | jq '{repo: "'"$repo"'", sync: .sync, embed: .embed}'
|
||||
done
|
||||
```
|
||||
|
||||
Or maintain a small script that reads a list of repos from a file and runs them on a schedule.
|
||||
|
||||
## Output formats
|
||||
|
||||
| Format | When to use |
|
||||
| --- | --- |
|
||||
| `text` (default) | Humans at a terminal |
|
||||
| `json` (or `--json`) | Pipelines, scripts, agents |
|
||||
| `log` | Internal logging output; structured key/value pairs |
|
||||
|
||||
Force a format globally with `--format json` or per-command with `--json`. The `log` format is mostly used internally and is subject to change.
|
||||
|
||||
## CI integration
|
||||
|
||||
Run gitcrawl in CI to validate a portable store's freshness, sanity-check cluster shapes, or produce a triage report:
|
||||
|
||||
```yaml
|
||||
- name: Refresh and snapshot clusters
|
||||
run: |
|
||||
gitcrawl init --portable-store $PORTABLE_STORE_URL
|
||||
gitcrawl refresh openclaw/gitcrawl --json > sync.json
|
||||
gitcrawl clusters openclaw/gitcrawl --json --sort size --min-size 5 > clusters.json
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
||||
|
||||
- uses: actions/upload-artifact@v4
|
||||
with: { name: triage, path: "*.json" }
|
||||
```
|
||||
|
||||
The artifact gives reviewers a structured view of what changed and how the cluster graph looks today.
|
||||
|
||||
## Best practices
|
||||
|
||||
- **Set both tokens in a single place.** Either env or `[env]` in `config.toml`. Mixing sources tends to confuse `doctor` reports.
|
||||
- **Bound the staleness window.** `--sync-if-stale` on every agent-driven search is cheaper than a hot cron loop.
|
||||
- **Monitor `xcache stats`.** If `backend_misses` dwarfs `local_hits`, you are not yet getting the shim's benefit — usually means agents are calling `gh` directly without going through the symlink.
|
||||
- **Re-cluster after a backfill.** A large `--state all` sync should be followed by `gitcrawl refresh --no-sync` (or just `gitcrawl embed && gitcrawl cluster`) so the durable graph reflects the new content.
|
||||
- **Pin the `gh` binary.** Set `GITCRAWL_GH_PATH` explicitly so the shim does not accidentally invoke itself.
|
||||
160
docs/clustering.md
Normal file
160
docs/clustering.md
Normal file
@ -0,0 +1,160 @@
|
||||
---
|
||||
title: Clustering
|
||||
nav_order: 9
|
||||
permalink: /clustering/
|
||||
---
|
||||
|
||||
# Clustering
|
||||
{: .no_toc }
|
||||
|
||||
Group related issues and pull requests using vector similarity, hardened with deterministic GitHub reference evidence and cross-kind safeguards.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## How it works
|
||||
|
||||
Clustering builds a sparse nearest-neighbor graph over the local vector store. For each thread, gitcrawl picks the top `k` most similar threads (default 16). Edges below the cosine threshold (default 0.80) are dropped. The remaining graph is split into connected components capped at `--max-cluster-size` members.
|
||||
|
||||
Two safeguards keep mega-clusters from forming:
|
||||
|
||||
- **Title-token overlap.** A weak embedding edge needs concrete shared title tokens (4+ char alphanumeric tokens) unless its similarity is already high (≥ 0.90) or there is direct GitHub reference evidence (`#123`, `pull/123`, `issues/123`).
|
||||
- **Cross-kind pruning.** Edges connecting issues to pull requests need a higher floor (`--cross-kind-threshold`, default 0.93) than issue↔issue or PR↔PR edges.
|
||||
|
||||
GitHub references found in titles or in the first ~240 characters of bodies generate **deterministic reference edges** with score 0.94. Body-only references later in the document are treated as weak evidence (need title-token overlap or other support). Single-digit numbers in prose are ignored as ambiguous; references must be at least two digits or use a fully qualified form.
|
||||
|
||||
The result is written to two tables that survive across runs:
|
||||
|
||||
- `durable_clusters` — stable cluster rows with stable IDs derived from the representative thread
|
||||
- `durable_cluster_members` — thread-to-cluster mappings with override metadata
|
||||
|
||||
## Generate clusters
|
||||
|
||||
```bash
|
||||
gitcrawl cluster owner/repo
|
||||
```
|
||||
|
||||
The defaults match ghcrawl's tuning so the output is comparable across tools:
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--threshold <float>` | `0.80` | Minimum cosine score for an edge |
|
||||
| `--cross-kind-threshold <float>` | `0.93` | Minimum cosine score for issue↔PR edges |
|
||||
| `--min-size <n>` | `1` | Minimum members per emitted cluster |
|
||||
| `--max-cluster-size <n>` | `40` | Hard cap on cluster size |
|
||||
| `--k <n>` | `16` | Nearest-neighbor fanout per thread |
|
||||
| `--limit <n>` | _(no limit)_ | Maximum vector rows to consider |
|
||||
| `--model <name>` | _(config)_ | Embedding model override |
|
||||
| `--basis <name>` | _(config)_ | Embedding basis override |
|
||||
| `--include-closed` | _(off)_ | Include closed threads |
|
||||
|
||||
Every active vector-backed thread is represented in the result: singleton clusters use `kind = singleton_orphan`, multi-member clusters use `kind = duplicate_candidate`.
|
||||
|
||||
## List clusters
|
||||
|
||||
```bash
|
||||
gitcrawl clusters owner/repo
|
||||
gitcrawl clusters owner/repo --sort size --min-size 5
|
||||
gitcrawl clusters owner/repo --sort recent
|
||||
gitcrawl clusters owner/repo --hide-closed
|
||||
```
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--sort recent\|oldest\|size` | `size` | Ordering |
|
||||
| `--min-size <n>` | _(none)_ | Minimum active member count |
|
||||
| `--limit <n>` | _(no limit)_ | Maximum cluster rows |
|
||||
| `--hide-closed` | _(off)_ | Hide locally closed clusters |
|
||||
| `--include-closed` | _(deprecated)_ | Closed clusters are included by default |
|
||||
|
||||
`gitcrawl clusters` shows the latest raw run's clusters first and merges closed durable rows in as historical context. For a strict durable-only audit view (no merging with the latest run), use:
|
||||
|
||||
```bash
|
||||
gitcrawl durable-clusters owner/repo --include-closed
|
||||
```
|
||||
|
||||
GitHub-closed members are hidden from latest-run cluster summaries by default; pass `--include-closed` to see the full historical view.
|
||||
|
||||
## Inspect a cluster
|
||||
|
||||
```bash
|
||||
gitcrawl cluster-detail owner/repo --id 123
|
||||
gitcrawl cluster-explain owner/repo --id 123 # alias
|
||||
```
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--id <n>` | _(required)_ | Cluster ID |
|
||||
| `--member-limit <n>` | _(no limit)_ | Maximum members to return |
|
||||
| `--body-chars <n>` | `280` | Body snippet length per member |
|
||||
| `--include-closed` | _(off)_ | Include closed members |
|
||||
|
||||
`cluster-explain` is the same command — it exists so the verb reads naturally in agent prompts ("explain why these things ended up together").
|
||||
|
||||
## Find similar threads (neighbors)
|
||||
|
||||
```bash
|
||||
gitcrawl neighbors owner/repo --number 123 --limit 10
|
||||
```
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--number <n>` | _(required)_ | Source issue/PR |
|
||||
| `--limit <n>` | `10` | Maximum neighbors |
|
||||
| `--threshold <float>` | `0.2` | Minimum cosine score |
|
||||
|
||||
Useful for "what else looks like this?" without committing to a cluster. The TUI's `n` shortcut and "Enter on a member" both call this path.
|
||||
|
||||
## Tuning recipes
|
||||
|
||||
### My clusters are too greedy
|
||||
|
||||
Symptom: unrelated bug reports merged together.
|
||||
|
||||
```bash
|
||||
gitcrawl cluster owner/repo --threshold 0.85 --cross-kind-threshold 0.95
|
||||
```
|
||||
|
||||
Tighter thresholds drop more weak edges. The `--cross-kind-threshold` raise specifically helps when an issue and a PR keep getting glued together because of shared boilerplate.
|
||||
|
||||
### My clusters are too sparse
|
||||
|
||||
Symptom: clear duplicates landing in separate clusters.
|
||||
|
||||
```bash
|
||||
gitcrawl cluster owner/repo --threshold 0.75 --k 24
|
||||
```
|
||||
|
||||
Lower threshold + higher fanout. Watch for false merges via `cluster-detail`.
|
||||
|
||||
### Make a single big cluster smaller
|
||||
|
||||
Symptom: one cluster has 40 members and is incoherent.
|
||||
|
||||
```bash
|
||||
gitcrawl cluster owner/repo --max-cluster-size 20
|
||||
```
|
||||
|
||||
Or slice it manually:
|
||||
|
||||
```bash
|
||||
gitcrawl exclude-cluster-member owner/repo --id 12 --number 456 --reason "different repro"
|
||||
```
|
||||
|
||||
See [Governance](./governance) for the full override workflow.
|
||||
|
||||
## Re-clustering and stable IDs
|
||||
|
||||
Durable cluster IDs are derived from the representative thread, so they survive re-runs of `cluster` and `refresh`. This means:
|
||||
|
||||
- Local closes (`close-cluster`), exclusions, and canonical member overrides persist across re-clustering
|
||||
- You can safely re-cluster after every refresh without losing maintainer state
|
||||
|
||||
Cluster runs are recorded in `run_records` and visible via `gitcrawl runs --kind cluster`.
|
||||
|
||||
## See also
|
||||
|
||||
- [Governance](./governance) — close clusters, exclude members, set canonical
|
||||
- [TUI](./tui) — the interactive cluster browser
|
||||
- [Concepts](./concepts#cluster) — durable clusters and cluster kinds
|
||||
120
docs/commands.md
Normal file
120
docs/commands.md
Normal file
@ -0,0 +1,120 @@
|
||||
---
|
||||
title: Commands reference
|
||||
nav_order: 15
|
||||
permalink: /commands/
|
||||
---
|
||||
|
||||
# Commands reference
|
||||
{: .no_toc }
|
||||
|
||||
Complete CLI surface, one row per command. Use as a lookup table; deep documentation lives in the feature pages.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## Global flags
|
||||
|
||||
These work on every command.
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--config <path>` | `$GITCRAWL_CONFIG` or default | Override config path |
|
||||
| `--format text\|json\|log` | `text` | Output format |
|
||||
| `--json` | _(off)_ | Shorthand for `--format json` |
|
||||
| `--no-color` | _(off)_ | Suppress ANSI color |
|
||||
| `--version` | _(off)_ | Print version and exit (global only) |
|
||||
| `--help` / `-h` | — | Print usage |
|
||||
|
||||
## Setup
|
||||
|
||||
| Command | Purpose | Detailed docs |
|
||||
| --- | --- | --- |
|
||||
| `gitcrawl init [--db --portable-store --portable-db --store-dir --json]` | Create config, database, runtime directories; optionally clone a portable store | [Installation](./installation), [Portable stores](./portable-stores) |
|
||||
| `gitcrawl doctor [--json]` | Health check for config, database, credentials, model selection, repo/thread counts | [Configuration](./configuration#gitcrawl-doctor) |
|
||||
| `gitcrawl configure [--summary-model --embed-model --embedding-basis --json]` | Update model fields in `config.toml` | [Configuration](./configuration#gitcrawl-configure) |
|
||||
| `gitcrawl version` | Print version | — |
|
||||
|
||||
## Sync
|
||||
|
||||
| Command | Purpose | Docs |
|
||||
| --- | --- | --- |
|
||||
| `gitcrawl sync owner/repo [--state --since --numbers --limit --include-comments --include-pr-details --with pr-details --json]` | Sync issues and PRs from GitHub into local SQLite | [Sync](./sync) |
|
||||
| `gitcrawl refresh owner/repo [--no-sync --no-embed --no-cluster ...]` | Wrapper that runs sync → embed → cluster | [Refresh and embed](./refresh-and-embed) |
|
||||
| `gitcrawl embed owner/repo [--number --limit --force --include-closed --json]` | Generate OpenAI embeddings for thread documents | [Refresh and embed](./refresh-and-embed#embed) |
|
||||
| `gitcrawl runs owner/repo [--kind sync\|embedding\|cluster --limit --json]` | List recorded run history | [Refresh and embed](./refresh-and-embed#runs) |
|
||||
|
||||
## Inspect
|
||||
|
||||
| Command | Purpose | Docs |
|
||||
| --- | --- | --- |
|
||||
| `gitcrawl threads owner/repo [--include-closed --numbers --limit --json]` | List threads from local cache | — |
|
||||
| `gitcrawl search owner/repo --query <text> [--mode keyword\|semantic\|hybrid --limit --json]` | Local search (direct mode) | [Search](./search) |
|
||||
| `gitcrawl search issues\|prs <query> -R owner/repo [--state --json --limit --sync-if-stale]` | Local search (`gh search` shape) | [Search](./search#gh-search-compatibility-mode) |
|
||||
| `gitcrawl neighbors owner/repo --number <n> [--limit --threshold --json]` | Vector-similar threads to a specific issue/PR | [Clustering](./clustering#find-similar-threads-neighbors) |
|
||||
|
||||
## Cluster
|
||||
|
||||
| Command | Purpose | Docs |
|
||||
| --- | --- | --- |
|
||||
| `gitcrawl cluster owner/repo [--threshold --min-size --max-cluster-size --k --cross-kind-threshold --limit --model --basis --include-closed --json]` | Build durable clusters from vectors | [Clustering](./clustering#generate-clusters) |
|
||||
| `gitcrawl clusters owner/repo [--sort size\|recent\|oldest --min-size --limit --hide-closed --json]` | Latest-run cluster summary, merged with closed durable rows | [Clustering](./clustering#list-clusters) |
|
||||
| `gitcrawl durable-clusters owner/repo [--include-closed --sort --min-size --limit --json]` | Strict durable-cluster audit view | [Clustering](./clustering#list-clusters) |
|
||||
| `gitcrawl cluster-detail owner/repo --id <n> [--member-limit --body-chars --include-closed --json]` | Cluster + members detail | [Clustering](./clustering#inspect-a-cluster) |
|
||||
| `gitcrawl cluster-explain owner/repo --id <n> [...]` | Alias for `cluster-detail` | [Clustering](./clustering#inspect-a-cluster) |
|
||||
|
||||
## Governance
|
||||
|
||||
| Command | Purpose | Docs |
|
||||
| --- | --- | --- |
|
||||
| `gitcrawl close-thread owner/repo --number <n> [--reason --json]` | Local close on a thread | [Governance](./governance#local-close) |
|
||||
| `gitcrawl reopen-thread owner/repo --number <n> [--json]` | Inverse | — |
|
||||
| `gitcrawl close-cluster owner/repo --id <n> [--reason --json]` | Local close on a cluster | [Governance](./governance#local-close) |
|
||||
| `gitcrawl reopen-cluster owner/repo --id <n> [--json]` | Inverse | — |
|
||||
| `gitcrawl exclude-cluster-member owner/repo --id <n> --number <m> [--reason --json]` | Pull a thread out of a cluster | [Governance](./governance#member-exclusion) |
|
||||
| `gitcrawl include-cluster-member owner/repo --id <n> --number <m> [--reason --json]` | Inverse | — |
|
||||
| `gitcrawl set-cluster-canonical owner/repo --id <n> --number <m> [--reason --json]` | Pin canonical thread for a cluster | [Governance](./governance#canonical-member) |
|
||||
|
||||
## TUI
|
||||
|
||||
| Command | Purpose | Docs |
|
||||
| --- | --- | --- |
|
||||
| `gitcrawl tui [owner/repo] [--min-size --sort --limit --hide-closed --json]` | Interactive cluster browser; `--json` emits a snapshot instead of launching the UI | [TUI](./tui) |
|
||||
|
||||
## gh shim
|
||||
|
||||
| Command | Purpose | Docs |
|
||||
| --- | --- | --- |
|
||||
| `gitcrawl gh search issues\|prs <query> -R owner/repo [...]` | Local-first `gh search` | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh issue view <n> -R owner/repo --json <fields>` | Local-first thread view | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh pr view <n> -R owner/repo --json <fields>` | Same, for PRs (with auto-hydration) | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh issue list -R owner/repo [--state --search --author --assignee --label --json]` | Local-first list | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh pr list -R owner/repo [...]` | Same, for PRs | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh pr checks <n> -R owner/repo --json <fields>` | Cached PR checks (auto-hydrates if stale) | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh pr diff <n> -R owner/repo` | Falls through; cached by head SHA | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh run list -R owner/repo [--branch --commit --json]` | Cached workflow runs | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh run view <run-id> -R owner/repo [--json]` | Same, single run | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh repo view\|list ...` | Falls through; cached briefly | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh release list\|view ...` | Falls through; cached briefly | [gh shim](./gh-shim#read-only-fallthroughs-cached) |
|
||||
| `gitcrawl gh workflow list\|view ...` | Falls through; cached briefly | [gh shim](./gh-shim#read-only-fallthroughs-cached) |
|
||||
| `gitcrawl gh secret list ...` / `variable get\|list ...` | Falls through; cached briefly | [gh shim](./gh-shim#read-only-fallthroughs-cached) |
|
||||
| `gitcrawl gh label list ...` | Falls through; cached briefly | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh api <GET path>` | Falls through; cached briefly (GET-only) | [gh shim](./gh-shim) |
|
||||
| `gitcrawl gh xcache stats\|keys\|gc\|flush [--json]` | Cache inspection / housekeeping | [gh shim](./gh-shim#cache-inspection-xcache) |
|
||||
| _Anything else_ | Falls through to real `gh` | [gh shim](./gh-shim) |
|
||||
|
||||
The shim binary can be installed standalone by symlinking the `gitcrawl` binary as `gh` or `gitcrawl-gh`.
|
||||
|
||||
## Portable stores
|
||||
|
||||
| Command | Purpose | Docs |
|
||||
| --- | --- | --- |
|
||||
| `gitcrawl portable prune [--body-chars --no-vacuum --json]` | Truncate thread bodies and (optionally) `VACUUM` for a small publishable database | [Portable stores](./portable-stores#publishing-gitcrawl-portable-prune) |
|
||||
|
||||
## Not yet implemented
|
||||
|
||||
These appear in `SPEC.md` but currently return a "not implemented" error. They are reserved for future versions:
|
||||
|
||||
`summarize`, `key-summaries`, `merge-clusters`, `split-cluster`, `export-sync`, `import-sync`, `validate-sync`, `portable-size`, `sync-status`, `optimize`, `completion`
|
||||
|
||||
If you need any of these to land sooner, [open an issue](https://github.com/openclaw/gitcrawl/issues).
|
||||
100
docs/concepts.md
Normal file
100
docs/concepts.md
Normal file
@ -0,0 +1,100 @@
|
||||
---
|
||||
title: Concepts
|
||||
nav_order: 4
|
||||
permalink: /concepts/
|
||||
---
|
||||
|
||||
# Concepts
|
||||
{: .no_toc }
|
||||
|
||||
The handful of nouns gitcrawl uses, and how they connect.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## Repository mirror
|
||||
|
||||
A **repository** is the `owner/repo` you sync. Every gitcrawl command takes one, and most state in SQLite is keyed by it. You can mirror as many repos as you like into a single `gitcrawl.db`; commands always scope to the one you name.
|
||||
|
||||
The mirror is metadata-first: titles, bodies, authors, labels, state, timestamps, and IDs land in SQLite immediately. Comments, reviews, review comments, and full PR detail (files, commits, checks, workflow runs) are opt-in on a per-sync basis (see [Sync](./sync)).
|
||||
|
||||
## Thread
|
||||
|
||||
A **thread** is a single GitHub issue or pull request, with its body and metadata. The CLI exposes threads via `gitcrawl threads` and via the `gh` shim's `gh issue/pr view` and `gh issue/pr list` paths.
|
||||
|
||||
Threads have two state dimensions:
|
||||
|
||||
- **GitHub state** — `open` or `closed` upstream.
|
||||
- **Local close** — a maintainer-only override stored locally. `gitcrawl close-thread` and `reopen-thread` flip this without touching GitHub. Local closes drive the `--hide-closed` and `--include-closed` filters across `clusters`, `cluster-detail`, the TUI, and search.
|
||||
|
||||
Local close is for triage workflow: "I have handled this duplicate locally, I do not need it shown next time." It does not write back to GitHub.
|
||||
|
||||
## Document
|
||||
|
||||
A **document** is the canonical text gitcrawl indexes for a thread — title plus body, with comments folded in when present. Documents back the FTS index used by `gitcrawl search` and feed the embedding pipeline.
|
||||
|
||||
Most users never interact with documents directly; they show up in JSON output as a `document` field on neighbors and search hits.
|
||||
|
||||
## Embedding
|
||||
|
||||
An **embedding** is a vector representation of a thread's document, produced by an OpenAI model (default `text-embedding-3-small`, 1024 dimensions). Vectors live in `~/.config/gitcrawl/vectors` and are referenced from the `thread_vectors` table.
|
||||
|
||||
The **embedding basis** controls what text gets embedded. The default `title_original` uses title plus an excerpt of the original body. This is configurable via `gitcrawl configure --embedding-basis ...` but only `title_original` is currently implemented.
|
||||
|
||||
`gitcrawl embed` is the explicit command that fills the vector table. `gitcrawl refresh` runs it automatically as part of its sync → embed → cluster pipeline.
|
||||
|
||||
When the embedding input rune cap or model changes, vectors are rebuilt to avoid stale comparisons.
|
||||
|
||||
## Cluster
|
||||
|
||||
A **cluster** is a group of related threads inferred from vector similarity, with deterministic GitHub reference evidence (`#123`, `pull/123`, `issues/123`) folded in to harden weak edges.
|
||||
|
||||
Clustering is run by `gitcrawl cluster` (or as part of `gitcrawl refresh`). Defaults are tuned to ghcrawl's profile: `--threshold 0.80`, `--min-size 1`, `--max-cluster-size 40`, `--k 16` nearest-neighbor fanout, `--cross-kind-threshold 0.93` for issue↔PR edges.
|
||||
|
||||
Two safeguards keep mega-clusters from forming:
|
||||
|
||||
- **Title-token overlap.** A weak embedding edge needs concrete shared title tokens unless its similarity is already high or there is direct GitHub reference evidence.
|
||||
- **Cross-kind pruning.** Issue↔PR edges need a higher similarity floor (`--cross-kind-threshold`) than issue↔issue or PR↔PR.
|
||||
|
||||
### Cluster kinds
|
||||
|
||||
Every cluster ships with a kind that explains its shape:
|
||||
|
||||
- `singleton_orphan` — one member, no neighbors above threshold. Useful for surfacing isolated reports.
|
||||
- `duplicate_candidate` — multiple members above the merge threshold. The default duplicate triage row.
|
||||
|
||||
### Durable clusters
|
||||
|
||||
A **durable cluster** is a stable, long-lived row in `durable_clusters` with a stable ID derived from its representative thread. Durable cluster IDs survive re-runs of `cluster` and `refresh`, so the local close, exclusion, and canonical-member overrides you apply persist across re-clustering.
|
||||
|
||||
`gitcrawl clusters` and `gitcrawl tui` show the latest raw run's clusters first, with closed durable rows merged in as historical context. Use `gitcrawl durable-clusters` for an audit view that stays on the durable rows.
|
||||
|
||||
### Cluster overrides (governance)
|
||||
|
||||
Per-cluster maintainer overrides let you correct what the algorithm produced without re-tuning thresholds:
|
||||
|
||||
- **Local close** (`close-cluster`/`reopen-cluster`) — hides a duplicate-candidate from active triage.
|
||||
- **Member exclusion** (`exclude-cluster-member`/`include-cluster-member`) — pulls a specific thread out of a cluster and remembers why.
|
||||
- **Canonical member** (`set-cluster-canonical`) — pins which thread represents the cluster.
|
||||
|
||||
See [Governance](./governance) for the full workflow.
|
||||
|
||||
## Run
|
||||
|
||||
Every sync, embed, and cluster operation records a **run** in `run_records` with start/finish timestamps, status, and stage-specific stats. `gitcrawl runs --kind sync|embedding|cluster` lists them, useful for debugging or auditing.
|
||||
|
||||
## Portable store
|
||||
|
||||
A **portable store** is a Git-backed publish target for a `gitcrawl.db` plus its derived bodies, designed for sharing a local cache across agents or machines without a hosted service.
|
||||
|
||||
`gitcrawl init --portable-store https://github.com/org/repo` clones a portable store into `~/.config/gitcrawl/portable/`, points the runtime at it, and `gitcrawl portable prune --body-chars 256` keeps the published payload small. Read-only commands run against portable stores refresh the checkout before reading. See [Portable stores](./portable-stores).
|
||||
|
||||
## Cache
|
||||
|
||||
The `cache/` directory under `~/.config/gitcrawl/` holds:
|
||||
|
||||
- `cache/gh-shim/` — the short-lived fallthrough cache for the `gh` shim, keyed by config path, CWD, `GH_HOST`, `GH_REPO`, and command args. Inspect or clean it with `gitcrawl gh xcache stats|keys|gc|flush`.
|
||||
- `cache/pr/` — hydrated PR detail blobs used to answer `gh pr view`, `gh pr checks`, and `gh run` reads from local SQLite.
|
||||
|
||||
See [gh shim](./gh-shim) for the cache key composition and TTL behavior.
|
||||
144
docs/configuration.md
Normal file
144
docs/configuration.md
Normal file
@ -0,0 +1,144 @@
|
||||
---
|
||||
title: Configuration
|
||||
nav_order: 5
|
||||
permalink: /configuration/
|
||||
---
|
||||
|
||||
# Configuration
|
||||
{: .no_toc }
|
||||
|
||||
Where gitcrawl reads settings from, and how to override them.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## Resolution order
|
||||
|
||||
For each setting, gitcrawl looks in this order and uses the first match:
|
||||
|
||||
1. CLI flag (e.g., `--config`, `--summary-model`)
|
||||
2. Environment variable (`GITCRAWL_*`, then standard `GITHUB_TOKEN` / `OPENAI_API_KEY`)
|
||||
3. `[env]` table inside `config.toml`
|
||||
4. Top-level config field inside `config.toml`
|
||||
5. Built-in default
|
||||
|
||||
## Default paths
|
||||
|
||||
| Path | Purpose |
|
||||
| --- | --- |
|
||||
| `~/.config/gitcrawl/config.toml` | Configuration file |
|
||||
| `~/.config/gitcrawl/gitcrawl.db` | SQLite database |
|
||||
| `~/.config/gitcrawl/cache/` | Caches (PR detail, gh shim fallthrough) |
|
||||
| `~/.config/gitcrawl/cache/gh-shim/` | gh shim fallthrough cache |
|
||||
| `~/.config/gitcrawl/vectors/` | Vector store backing embeddings |
|
||||
| `~/.config/gitcrawl/logs/` | Operational logs |
|
||||
|
||||
Override the config root by setting `GITCRAWL_CONFIG=/path/to/config.toml` or by passing `--config` to any command.
|
||||
|
||||
## `config.toml`
|
||||
|
||||
`gitcrawl init` writes a minimal config. You can edit it by hand or with `gitcrawl configure`:
|
||||
|
||||
```toml
|
||||
summary_model = "gpt-5.4"
|
||||
embed_model = "text-embedding-3-small"
|
||||
embed_dimensions = 1024
|
||||
embedding_basis = "title_original"
|
||||
|
||||
[env]
|
||||
GITHUB_TOKEN = "ghp_xxx"
|
||||
OPENAI_API_KEY = "sk-xxx"
|
||||
|
||||
[portable_store]
|
||||
url = "https://github.com/org/portable-store.git"
|
||||
db_path = "data/openclaw__openclaw.sync.db"
|
||||
checkout_dir = "/Users/me/.config/gitcrawl/portable"
|
||||
```
|
||||
|
||||
### Notable fields
|
||||
|
||||
| Field | Default | Notes |
|
||||
| --- | --- | --- |
|
||||
| `summary_model` | `gpt-5.4` | Reserved for future summary commands |
|
||||
| `embed_model` | `text-embedding-3-small` | OpenAI embedding model |
|
||||
| `embed_dimensions` | `1024` | Must match the model |
|
||||
| `embedding_basis` | `title_original` | Only `title_original` is implemented |
|
||||
| `[env]` | _(empty)_ | Sets process env at startup; useful for tokens you do not want in your shell rc |
|
||||
| `[portable_store]` | _(empty)_ | Used when working from a shared, Git-backed cache |
|
||||
|
||||
## Environment variables
|
||||
|
||||
### Core
|
||||
|
||||
| Variable | Purpose |
|
||||
| --- | --- |
|
||||
| `GITCRAWL_CONFIG` | Override config path |
|
||||
| `GITCRAWL_DB_PATH` | Override database path |
|
||||
| `GITHUB_TOKEN` | GitHub API token (required for `sync`, `gh` shim fallthroughs) |
|
||||
| `OPENAI_API_KEY` | OpenAI API key (required for `embed`) |
|
||||
|
||||
### Model overrides
|
||||
|
||||
| Variable | Purpose |
|
||||
| --- | --- |
|
||||
| `GITCRAWL_SUMMARY_MODEL` | Override summary model |
|
||||
| `GITCRAWL_EMBED_MODEL` | Override embedding model |
|
||||
| `GITCRAWL_OPENAI_RETRY_DISABLED` | Set to `1` to disable OpenAI retry/backoff |
|
||||
| `GITCRAWL_OPENAI_BASE_URL` / `OPENAI_BASE_URL` | Custom OpenAI endpoint (e.g., for a proxy) |
|
||||
|
||||
### GitHub overrides
|
||||
|
||||
| Variable | Purpose |
|
||||
| --- | --- |
|
||||
| `GITCRAWL_GITHUB_BASE_URL` / `GITHUB_BASE_URL` | Custom GitHub API endpoint (used by `sync` and the `gh` shim) |
|
||||
| `GH_HOST` | GitHub host; included in the `gh` shim cache key |
|
||||
| `GH_REPO` | Default repo when `-R` is omitted; included in the `gh` shim cache key |
|
||||
|
||||
### gh shim
|
||||
|
||||
| Variable | Purpose |
|
||||
| --- | --- |
|
||||
| `GITCRAWL_GH_PATH` | Path to the real `gh` binary used for fallthrough |
|
||||
| `GITCRAWL_GH_AUTO_HYDRATE` | Set to `0` to disable PR auto-hydration on cache miss |
|
||||
| `GITCRAWL_GH_CACHE_TTL` | Override fallthrough cache TTL (e.g., `5m`, `1h`) |
|
||||
|
||||
If `GITCRAWL_GH_PATH` is unset, the shim probes common Homebrew install paths and then your `PATH`. Set it explicitly when you symlink the gitcrawl binary as `gh` (otherwise the shim will recurse into itself).
|
||||
|
||||
## Global flags
|
||||
|
||||
These flags work on every command:
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--config <path>` | `$GITCRAWL_CONFIG` or default | Override config path for this invocation |
|
||||
| `--format text\|json\|log` | `text` | Output format |
|
||||
| `--json` | _(off)_ | Shorthand for `--format json` |
|
||||
| `--no-color` | _(off)_ | Suppress ANSI color codes |
|
||||
| `--version` | _(off)_ | Print version and exit (global only) |
|
||||
|
||||
`--json` overrides `--format`. Both are honored on subcommands that produce output.
|
||||
|
||||
## `gitcrawl configure`
|
||||
|
||||
Interactive-friendly config edits without opening the file:
|
||||
|
||||
```bash
|
||||
gitcrawl configure --summary-model gpt-5.4
|
||||
gitcrawl configure --embed-model text-embedding-3-small
|
||||
gitcrawl configure --embedding-basis title_original
|
||||
gitcrawl configure --json
|
||||
```
|
||||
|
||||
Returns the resolved config path, the values that were updated, and the now-current model selection. See `gitcrawl configure --help`.
|
||||
|
||||
## `gitcrawl doctor`
|
||||
|
||||
A health check for everything covered above:
|
||||
|
||||
```bash
|
||||
gitcrawl doctor # human-readable
|
||||
gitcrawl doctor --json # for scripts
|
||||
```
|
||||
|
||||
Reports config path and existence, database path, whether `GITHUB_TOKEN` and `OPENAI_API_KEY` are present (and whether they came from env vs. config), the active summary/embed models, the embedding basis, and counts of repositories, threads, open threads, clusters, plus the last sync timestamp. If the API call surface is unsupported (older Go, missing crypto), `api_supported: false` is reported so you can investigate.
|
||||
175
docs/gh-shim.md
Normal file
175
docs/gh-shim.md
Normal file
@ -0,0 +1,175 @@
|
||||
---
|
||||
title: gh shim
|
||||
nav_order: 12
|
||||
permalink: /gh-shim/
|
||||
---
|
||||
|
||||
# gh shim
|
||||
{: .no_toc }
|
||||
|
||||
A `gh`-compatible binary that answers from local SQLite first and falls through to the real `gh` for everything else. The fastest way to cut GitHub API load across an agent fleet.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## What it is
|
||||
|
||||
The same `gitcrawl` binary serves a `gh`-compatible mode. Invoked as `gitcrawl gh ...`, or as `gh` / `gitcrawl-gh` via symlink, it intercepts read-only commands and serves them from the local mirror. Anything it cannot serve locally falls through to the real `gh` binary you already have installed, with a short persistent cache layered on top.
|
||||
|
||||
The shim never adds GitHub write behavior. Mutating commands (`gh issue close`, `gh pr merge`, `gh api -X POST ...`, `gh label create`, etc.) pass straight through to the real `gh`, increment a write counter, and clear the relevant cache entries on success.
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
# Side-by-side: agents opt in by calling `gitcrawl-gh`.
|
||||
ln -s "$(command -v gitcrawl)" /usr/local/bin/gitcrawl-gh
|
||||
|
||||
# Or replace the global `gh` so every caller picks up the cache automatically.
|
||||
ln -s "$(command -v gitcrawl)" /usr/local/bin/gh
|
||||
export GITCRAWL_GH_PATH=/opt/homebrew/bin/gh # tell the shim where the real gh is
|
||||
```
|
||||
|
||||
If `GITCRAWL_GH_PATH` is unset, the shim probes common Homebrew paths and then `PATH`. Set it explicitly when you replace the global `gh` so the shim does not recurse into itself.
|
||||
|
||||
## Supported local reads
|
||||
|
||||
### `gh search issues|prs`
|
||||
|
||||
```bash
|
||||
gh search issues "download stalls" -R owner/repo --state open \
|
||||
--match comments --json number,title,url
|
||||
gh search prs "manifest cache" -R owner/repo --state open \
|
||||
--json number,title,url --limit 20
|
||||
```
|
||||
|
||||
Answered from the local FTS index. Honors `--state`, `--json`, `--limit`. `--match` is accepted for parity (the local index already covers documents). Falls through if an unsupported filter combination is requested.
|
||||
|
||||
### `gh issue view` / `gh pr view`
|
||||
|
||||
```bash
|
||||
gh issue view 123 -R owner/repo --json number,title,state,url,body,labels,author
|
||||
gh pr view 123 -R owner/repo --json number,title,state,url,isDraft,author,headRef,baseRef
|
||||
```
|
||||
|
||||
Supported JSON fields include `number`, `title`, `state`, `url`, `body`, `author`, `createdAt`, `updatedAt`, `closedAt`, `labels`, plus PR-specific `isDraft`, `headRef`, `baseRef`. PR detail fields (`files`, `commits`, `checks`, `statusCheckRollup`) are answered from cached PR detail and trigger [auto-hydration](#auto-hydration) on miss.
|
||||
|
||||
### `gh issue list` / `gh pr list`
|
||||
|
||||
```bash
|
||||
gh issue list -R owner/repo --state open --search "hot loop" \
|
||||
--author octocat --label bug --label triage --json number,title,url
|
||||
gh pr list -R owner/repo --state open --search "manifest cache" \
|
||||
--assignee me --json number,title,url
|
||||
```
|
||||
|
||||
Supports `--state`, `--search` (keyword search), `--author`, `--assignee`, repeated `--label`, `--limit`, and `--json`. Falls through for unsupported filters.
|
||||
|
||||
### `gh pr checks`
|
||||
|
||||
```bash
|
||||
gh pr checks 123 -R owner/repo --json name,state,conclusion,detailsUrl
|
||||
```
|
||||
|
||||
Returns the cached check/status summary for the PR. If the cached PR detail is older than 90 seconds or its head SHA is stale, [auto-hydration](#auto-hydration) refreshes it before answering. Supported fields: `name`, `state`, `status`, `conclusion`, `detailsUrl`, `workflow`, `startedAt`, `completedAt`.
|
||||
|
||||
### `gh run list` / `gh run view`
|
||||
|
||||
```bash
|
||||
gh run list -R owner/repo --branch main --limit 20 \
|
||||
--json databaseId,workflowName,status,conclusion
|
||||
gh run view 123456789 -R owner/repo --json status,conclusion,headSha
|
||||
```
|
||||
|
||||
Workflow runs come from cached PR detail. Filters: `--branch`, `--commit` (head SHA). Supported fields: `databaseId`, `id`, `number`, `workflowName`, `name`, `displayTitle`, `status`, `conclusion`, `url`, `event`, `headBranch`, `headSha`, `createdAt`, `updatedAt`.
|
||||
|
||||
## Read-only fallthroughs (cached)
|
||||
|
||||
These commands always run real `gh` but the response body is cached for the next caller in the same workspace:
|
||||
|
||||
- `gh pr diff` — keyed by the cached PR head SHA when available, so the cache is stable across many sequential agent reads
|
||||
- `gh issue list/status/view`, `gh pr list/status/view/checks`, and unsupported read-only local shim shapes
|
||||
- `gh release list/view`, `gh workflow list/view`, `gh secret list`, and `gh variable get/list`
|
||||
- `gh project list/view/field-list/item-list`, `gh ruleset check/list/view`, `gh gist list/view`, and `gh org list`
|
||||
- `gh repo view` / `gh repo list`
|
||||
- `gh search code/commits/issues/prs/repos`
|
||||
- `gh label list`
|
||||
- `gh api <GET path>` — only `GET` requests; never cached for `POST`/`PATCH`/`DELETE`/`PUT`
|
||||
|
||||
Default cache TTL is short (30 seconds for most reads, 60 seconds for `gh api`, 5 minutes for `gh pr diff` without a stable SHA, 7 days for `gh pr diff` with a stable SHA). Override with `GITCRAWL_GH_CACHE_TTL=5m` or similar.
|
||||
|
||||
Repeat read failures are cached by default too. That avoids a fleet of agents all rediscovering the same missing release, workflow, secret, or unsupported field. Set `GITCRAWL_GH_CACHE_ERRORS=0` to cache successful reads only.
|
||||
|
||||
## Auto-hydration
|
||||
|
||||
When a local PR-detail read misses the cache, the shim can auto-hydrate exactly one PR before falling back:
|
||||
|
||||
1. Shim detects missing or stale PR detail (older than 90s, or head SHA mismatch)
|
||||
2. If `GITCRAWL_GH_AUTO_HYDRATE != 0` (the default), runs `gitcrawl sync --numbers <n> --with pr-details`
|
||||
3. Retries the local query against the freshly populated cache
|
||||
4. Falls through to the real `gh` if hydration failed
|
||||
|
||||
This keeps `gh pr view`, `gh pr checks`, and `gh run` reads cheap and fresh without manual sync orchestration. Disable with `GITCRAWL_GH_AUTO_HYDRATE=0` if you want the shim to be strictly cache-or-fallthrough.
|
||||
|
||||
## Cache inspection: `xcache`
|
||||
|
||||
```bash
|
||||
gitcrawl gh xcache stats # summary
|
||||
gitcrawl gh xcache keys # per-entry detail
|
||||
gitcrawl gh xcache gc # remove expired entries + stale lock files
|
||||
gitcrawl gh xcache flush # clear everything
|
||||
```
|
||||
|
||||
All accept `--json` for scripting.
|
||||
|
||||
`stats` JSON:
|
||||
|
||||
```json
|
||||
{
|
||||
"cache_dir": "/Users/me/.config/gitcrawl/cache/gh-shim",
|
||||
"entries": 142,
|
||||
"expired": 6,
|
||||
"locks": 0,
|
||||
"bytes": 1841234,
|
||||
"counters": {
|
||||
"local_hits": 540,
|
||||
"fallback_hits": 88,
|
||||
"backend_misses": 12,
|
||||
"pass_through_writes": 4
|
||||
},
|
||||
"commands": {
|
||||
"gh pr view": { "entries": 30, "bytes": 184320 },
|
||||
"gh search issues": { "entries": 14, "bytes": 18230 }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
`local_hits` are answered from SQLite; `fallback_hits` are answered from the fallthrough cache; `backend_misses` actually hit GitHub. Watching the ratio is the easiest way to confirm the shim is paying for itself.
|
||||
|
||||
## Cache key composition
|
||||
|
||||
Cache keys are deterministic SHA-256 hashes of:
|
||||
|
||||
- A version tag (`v2`)
|
||||
- The resolved gitcrawl config path
|
||||
- The current working directory
|
||||
- The `GH_HOST` env var
|
||||
- The `GH_REPO` env var
|
||||
- For `gh pr diff`: the stable identity `pr-diff:owner/repo:number:head-sha` (when available)
|
||||
- The full command argument vector, null-separated
|
||||
|
||||
This isolates sibling checkouts and portable stores while still coalescing repeated calls from the same agent workspace. Concurrent cache misses use a lock file so one process populates the entry while peers wait for the result, instead of all of them firing at GitHub.
|
||||
|
||||
## What does not flow through the shim
|
||||
|
||||
- **Mutating commands** — `gh issue close`, `gh pr merge`, `gh pr comment`, `gh api -X POST`, etc. These pass straight through, increment `pass_through_writes`, and clear the relevant cache entries on success.
|
||||
- **Auth flows** — `gh auth login`, `gh auth refresh`, etc. Always real `gh`.
|
||||
- **Anything the shim does not recognize** — falls through unmodified.
|
||||
|
||||
## Agent integration
|
||||
|
||||
Pattern: replace `gh` with `gitcrawl-gh` (or symlink to `gh`) for every agent in the fleet, then keep your existing prompts and tools. Most read-only triage flows ("look up this issue", "check the PR status", "list open issues for this label") become local-only without any prompt changes.
|
||||
|
||||
For best results, schedule a periodic `gitcrawl refresh owner/repo` (every few minutes per repo, depending on activity) so the local mirror stays warm. The shim's `--sync-if-stale` (via `gitcrawl search`) and auto-hydration handle the rest.
|
||||
|
||||
See [Automation](./automation) for full agent recipes and JSON contracts.
|
||||
130
docs/governance.md
Normal file
130
docs/governance.md
Normal file
@ -0,0 +1,130 @@
|
||||
---
|
||||
title: Governance
|
||||
nav_order: 10
|
||||
permalink: /governance/
|
||||
---
|
||||
|
||||
# Governance
|
||||
{: .no_toc }
|
||||
|
||||
Maintainer overrides on top of the cluster algorithm. All changes are local; gitcrawl never writes back to GitHub.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## Why governance exists
|
||||
|
||||
The cluster algorithm is good but not perfect. Sometimes it misses an obvious duplicate, or glues two unrelated reports together, or picks a poor representative thread. Governance commands let you correct the result without re-tuning thresholds or re-running embeddings.
|
||||
|
||||
Every override is recorded with a reason and persists across `cluster`/`refresh` runs because durable cluster IDs are stable. The TUI exposes the same actions via right-click and the `a` action menu.
|
||||
|
||||
## Local close
|
||||
|
||||
Mark a thread or a cluster as "handled locally — do not show me this again."
|
||||
|
||||
```bash
|
||||
gitcrawl close-thread owner/repo --number 123 --reason "duplicate handled"
|
||||
gitcrawl reopen-thread owner/repo --number 123
|
||||
|
||||
gitcrawl close-cluster owner/repo --id 42 --reason "all members handled"
|
||||
gitcrawl reopen-cluster owner/repo --id 42
|
||||
```
|
||||
|
||||
The reason defaults to `CLI manual close` and is stored alongside the override for audit. Locally closed threads and clusters are filtered out by `--hide-closed` across `clusters`, `cluster-detail`, the TUI, and search.
|
||||
|
||||
This **does not** change anything on GitHub. It is purely a local triage signal — useful when you have already commented "duplicate of #X" on the upstream issue and want to clear it from your maintainer view.
|
||||
|
||||
JSON output:
|
||||
|
||||
```json
|
||||
{ "repository": "owner/repo", "number": 123, "reason": "duplicate handled", "closed": true }
|
||||
```
|
||||
|
||||
## Member exclusion
|
||||
|
||||
Pull a single thread out of a cluster, or pull it back in.
|
||||
|
||||
```bash
|
||||
gitcrawl exclude-cluster-member owner/repo --id 42 --number 456 --reason "different repro"
|
||||
gitcrawl include-cluster-member owner/repo --id 42 --number 456
|
||||
```
|
||||
|
||||
Use this when the algorithm is mostly right but caught one false positive. The override travels with the cluster's stable ID, so re-clustering does not undo your decision.
|
||||
|
||||
JSON output:
|
||||
|
||||
```json
|
||||
{
|
||||
"repository": "owner/repo",
|
||||
"override": { "cluster_id": 42, "thread_number": 456, "kind": "exclude", "reason": "different repro", "created_at": "..." },
|
||||
"excluded": true
|
||||
}
|
||||
```
|
||||
|
||||
## Canonical member
|
||||
|
||||
Pin which thread represents the cluster — this is what shows up as the row title in `clusters` and the TUI summary.
|
||||
|
||||
```bash
|
||||
gitcrawl set-cluster-canonical owner/repo --id 42 --number 123 --reason "main tracking issue"
|
||||
```
|
||||
|
||||
The chosen `--number` must already be a member of the cluster. The TUI's right-click menu has a "set canonical" entry that calls this command.
|
||||
|
||||
## Reopen and undo
|
||||
|
||||
There is no separate `undo`. The inverse commands are explicit:
|
||||
|
||||
| Action | Inverse |
|
||||
| --- | --- |
|
||||
| `close-thread` | `reopen-thread` |
|
||||
| `close-cluster` | `reopen-cluster` |
|
||||
| `exclude-cluster-member` | `include-cluster-member` |
|
||||
| `set-cluster-canonical` | `set-cluster-canonical --number <other>` |
|
||||
|
||||
Each call records a fresh override row, so the audit history is preserved.
|
||||
|
||||
## Reading overrides
|
||||
|
||||
`gitcrawl cluster-detail` returns active overrides as part of the JSON payload, and `gitcrawl runs --kind cluster` lists when each clustering run was performed. To inspect raw override history you can query SQLite directly:
|
||||
|
||||
```bash
|
||||
sqlite3 ~/.config/gitcrawl/gitcrawl.db \
|
||||
"SELECT cluster_id, thread_number, kind, reason, created_at
|
||||
FROM cluster_member_overrides ORDER BY created_at DESC LIMIT 20;"
|
||||
```
|
||||
|
||||
(The schema is internal and may change between versions — prefer the JSON outputs from the CLI for stable contracts.)
|
||||
|
||||
## Workflow patterns
|
||||
|
||||
### "Triage this cluster, then move on"
|
||||
|
||||
```bash
|
||||
gitcrawl cluster-detail owner/repo --id 42 --body-chars 600 | less
|
||||
# ...read, decide canonical, add labels via gh, comment via gh...
|
||||
gitcrawl set-cluster-canonical owner/repo --id 42 --number 123
|
||||
gitcrawl close-cluster owner/repo --id 42 --reason "consolidated under #123"
|
||||
```
|
||||
|
||||
### "This thread doesn't belong here"
|
||||
|
||||
```bash
|
||||
gitcrawl exclude-cluster-member owner/repo --id 42 --number 456 --reason "different repro"
|
||||
gitcrawl neighbors owner/repo --number 456 --limit 10 # find a better home manually
|
||||
```
|
||||
|
||||
### "I'm done with this issue locally even though upstream is still open"
|
||||
|
||||
```bash
|
||||
gitcrawl close-thread owner/repo --number 789 --reason "answered in chat"
|
||||
```
|
||||
|
||||
The thread stays open on GitHub; only your local triage view hides it.
|
||||
|
||||
## What governance does *not* do
|
||||
|
||||
- It does not edit, label, comment on, or close GitHub issues. Use `gh` for that.
|
||||
- It does not retrain embeddings or reshape the underlying graph — it overlays decisions on top of the algorithm output.
|
||||
- It does not propagate to other gitcrawl installations unless you publish your database via a [portable store](./portable-stores).
|
||||
69
docs/index.md
Normal file
69
docs/index.md
Normal file
@ -0,0 +1,69 @@
|
||||
---
|
||||
title: Home
|
||||
layout: home
|
||||
nav_order: 1
|
||||
description: "gitcrawl is a local-first GitHub issue and pull request crawler for maintainer triage."
|
||||
permalink: /
|
||||
---
|
||||
|
||||
# gitcrawl
|
||||
{: .fs-9 }
|
||||
|
||||
A local-first GitHub issue and pull request crawler for maintainer triage. Sync, search, cluster, and review related threads from a SQLite cache that lives entirely on your machine.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
[Quickstart](./quickstart){: .btn .btn-primary .fs-5 .mb-4 .mb-md-0 .mr-2 }
|
||||
[View on GitHub](https://github.com/openclaw/gitcrawl){: .btn .fs-5 .mb-4 .mb-md-0 }
|
||||
|
||||
---
|
||||
|
||||
## What gitcrawl does
|
||||
|
||||
`gitcrawl` mirrors a GitHub repository's issues and pull requests into local SQLite, then layers semantic clustering, full-text search, and a `gh`-compatible shim on top so a maintainer (or an agent acting on their behalf) can triage threads without burning live API quota.
|
||||
|
||||
- **Local SQLite first.** All issues, PRs, comments, reviews, files, commits, checks, and workflow runs land in `~/.config/gitcrawl/gitcrawl.db`. Queries hit the disk, not GitHub.
|
||||
- **Semantic clustering.** OpenAI embeddings group related reports, with deterministic GitHub reference evidence (`#123`, `pull/123`) preventing weak similarity bridges from forming mega-clusters.
|
||||
- **`gh`-compatible shim.** Drop `gitcrawl gh` (or symlink it as `gh`) into agent workflows and most read-only `gh search`, `gh issue/pr view`, `gh pr checks`, and `gh run` calls answer from local cache instead of the GitHub API.
|
||||
- **Terminal UI.** `gitcrawl tui` is a keyboard- and mouse-driven cluster browser with bidirectional sort, jump-to-number, neighbors, and member-level governance actions.
|
||||
- **Agent-friendly JSON.** Every command supports `--json` for clean automation surfaces.
|
||||
|
||||
---
|
||||
|
||||
## Pick your path
|
||||
|
||||
<div class="code-example" markdown="1">
|
||||
|
||||
### I want to try it
|
||||
[Quickstart](./quickstart) walks you from `git clone` to a populated cluster view in five minutes.
|
||||
|
||||
### I want to wire up an agent
|
||||
The [`gh` shim](./gh-shim) is the fastest way to cut GitHub API load — point your agent at `gitcrawl-gh`, keep the agent's `gh` calls intact.
|
||||
|
||||
### I want to triage a busy repo
|
||||
Read [Sync](./sync) to bring data local, then [Clustering](./clustering) and the [TUI](./tui) for the maintainer workflow.
|
||||
|
||||
### I want the full reference
|
||||
[Commands](./commands) lists every flag and JSON field. [Configuration](./configuration) covers env vars and paths.
|
||||
|
||||
</div>
|
||||
|
||||
---
|
||||
|
||||
## Project status
|
||||
|
||||
Early bootstrap. The implementation is being built in small commits — see the [changelog](https://github.com/openclaw/gitcrawl/blob/main/CHANGELOG.md) for what shipped recently.
|
||||
|
||||
The product contract in [SPEC.md](https://github.com/openclaw/gitcrawl/blob/main/SPEC.md) is the source of truth for what is in and out of scope.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Local HTTP API
|
||||
- Hosted service runtime
|
||||
- Browser web UI
|
||||
- GitHub write-back actions (use `gh` for those)
|
||||
|
||||
---
|
||||
|
||||
## License
|
||||
|
||||
Released under the [MIT license](https://github.com/openclaw/gitcrawl/blob/main/LICENSE).
|
||||
80
docs/installation.md
Normal file
80
docs/installation.md
Normal file
@ -0,0 +1,80 @@
|
||||
---
|
||||
title: Installation
|
||||
nav_order: 2
|
||||
permalink: /installation/
|
||||
---
|
||||
|
||||
# Installation
|
||||
{: .no_toc }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## Requirements
|
||||
|
||||
- **Go 1.26+** if building from source
|
||||
- **Git** for cloning the repository (and for portable stores)
|
||||
- **A GitHub token** for any command that talks to GitHub (`sync`, `refresh`, `gh` shim fallthroughs)
|
||||
- **An OpenAI API key** only for `embed`, `refresh` (embed stage), and any future summary commands
|
||||
- **`gh` CLI** if you want the shim to fall through to the real GitHub CLI for unsupported commands
|
||||
|
||||
gitcrawl runs on macOS and Linux. Windows is not actively tested.
|
||||
|
||||
## Install from a GitHub release
|
||||
|
||||
Each tagged release publishes archives for `darwin_amd64`, `darwin_arm64`, `linux_amd64`, and `linux_arm64` via [GoReleaser](https://github.com/openclaw/gitcrawl/blob/main/.goreleaser.yaml).
|
||||
|
||||
```bash
|
||||
# Replace VERSION and PLATFORM with the values you want.
|
||||
curl -L "https://github.com/openclaw/gitcrawl/releases/download/v0.1.2/gitcrawl_0.1.2_darwin_arm64.tar.gz" \
|
||||
| tar -xz -C /usr/local/bin gitcrawl
|
||||
|
||||
gitcrawl --version
|
||||
```
|
||||
|
||||
Browse the [releases page](https://github.com/openclaw/gitcrawl/releases) for the latest tag and the full asset list.
|
||||
|
||||
## Install from source
|
||||
|
||||
```bash
|
||||
git clone https://github.com/openclaw/gitcrawl.git
|
||||
cd gitcrawl
|
||||
go build \
|
||||
-ldflags "-X github.com/openclaw/gitcrawl/internal/cli.version=$(git describe --tags --always --dirty)" \
|
||||
-o bin/gitcrawl ./cmd/gitcrawl
|
||||
|
||||
./bin/gitcrawl --version
|
||||
```
|
||||
|
||||
Symlink or copy `bin/gitcrawl` somewhere on your `PATH` (`~/bin`, `/usr/local/bin`, `~/.local/bin`).
|
||||
|
||||
## Install the `gh` shim
|
||||
|
||||
The shim is the same binary. Symlink it as `gh` (replacing the real CLI) or as `gitcrawl-gh` (running side by side):
|
||||
|
||||
```bash
|
||||
# Side-by-side install — agents can opt in by calling `gitcrawl-gh`.
|
||||
ln -s "$(command -v gitcrawl)" /usr/local/bin/gitcrawl-gh
|
||||
|
||||
# Or replace the global `gh` so every agent picks up the cache automatically.
|
||||
ln -s "$(command -v gitcrawl)" /usr/local/bin/gh
|
||||
export GITCRAWL_GH_PATH="$(command -v /opt/homebrew/bin/gh)" # point shim at the real gh
|
||||
```
|
||||
|
||||
When invoked as `gh` or `gitcrawl-gh`, the binary auto-detects shim mode. See [the gh shim guide](./gh-shim) for details.
|
||||
|
||||
## Verify the install
|
||||
|
||||
```bash
|
||||
gitcrawl init # creates ~/.config/gitcrawl/{config.toml,gitcrawl.db,...}
|
||||
gitcrawl doctor # confirms config, database, and credential discovery
|
||||
gitcrawl doctor --json # same, machine-readable
|
||||
```
|
||||
|
||||
`doctor` reports whether `GITHUB_TOKEN` and `OPENAI_API_KEY` are present, where they came from, the version, repository count, and the last sync timestamp. If anything is missing, the message tells you which env var or config field to set.
|
||||
|
||||
## Updating
|
||||
|
||||
- **Release archives:** download the new tarball and replace the binary.
|
||||
- **Source builds:** `git pull && go build ...` — the version string comes from `git describe`.
|
||||
- **Configuration is forward-compatible.** Existing `config.toml` and `gitcrawl.db` files are reused across versions; no migration step is needed for normal point releases.
|
||||
112
docs/portable-stores.md
Normal file
112
docs/portable-stores.md
Normal file
@ -0,0 +1,112 @@
|
||||
---
|
||||
title: Portable stores
|
||||
nav_order: 13
|
||||
permalink: /portable-stores/
|
||||
---
|
||||
|
||||
# Portable stores
|
||||
{: .no_toc }
|
||||
|
||||
A Git-backed publish target for a `gitcrawl.db` plus its derived bodies — share a local cache across agents and machines without running a hosted service.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## When to use one
|
||||
|
||||
- You want every agent on a team to read from a shared, recently synced cache without each agent making its own GitHub calls.
|
||||
- You want a backup of the SQLite cache that someone else can clone and use immediately.
|
||||
- You want a deterministic snapshot of "what gitcrawl knew at time T" for reproducible triage.
|
||||
|
||||
A portable store is just a Git repository whose contents include a SQLite database (and optionally derived bodies and vectors). Anyone with read access to the repository can `git clone` it and have a fully populated gitcrawl mirror in seconds.
|
||||
|
||||
## Setup: pointing gitcrawl at a portable store
|
||||
|
||||
```bash
|
||||
gitcrawl init \
|
||||
--portable-store https://github.com/openclaw/gitcrawl-store.git \
|
||||
--portable-db data/openclaw__openclaw.sync.db \
|
||||
--store-dir ~/.config/gitcrawl/portable
|
||||
```
|
||||
|
||||
`init` will:
|
||||
|
||||
1. Clone the portable store to `--store-dir`
|
||||
2. Wire `~/.config/gitcrawl/config.toml` to use the database at `--portable-db` inside that checkout
|
||||
3. Create the runtime cache, vector, and log directories in the standard locations
|
||||
|
||||
JSON output reports `portable_store_url`, `portable_store_dir`, and `portable_store: cloned|pulled|reset-pulled` so automation can tell what happened.
|
||||
|
||||
## How read-only commands behave
|
||||
|
||||
Read-only commands (`search`, `threads`, `clusters`, `cluster-detail`, `neighbors`, the TUI) refresh the portable-store checkout before reading, so they always see the latest published data:
|
||||
|
||||
- The refresh is best-effort and non-interactive
|
||||
- SSH attempts are bounded so an offline remote does not hang the CLI
|
||||
- Stale SQLite sidecars (WAL, SHM) are cleared after the pull so queries see freshly pulled data
|
||||
- Local Git pull configuration that tries to rebase onto multiple branch merge refs is handled cleanly
|
||||
|
||||
If the remote is unreachable, the read still answers from the local checkout.
|
||||
|
||||
## How write commands behave
|
||||
|
||||
Write commands (`embed`, `refresh`, `cluster`, neighbor generation) need to persist new data without mutating the published portable store. They open a **writable runtime mirror** alongside the portable checkout so vectors and overrides land in the runtime cache while the portable database remains read-only.
|
||||
|
||||
This separation means:
|
||||
|
||||
- You can `gitcrawl embed` against a portable store without dirtying the Git checkout
|
||||
- Local cluster overrides (`close-cluster`, exclusions, canonicals) live in the runtime mirror
|
||||
- Only the publishing workflow writes back into the portable checkout
|
||||
|
||||
## Publishing: `gitcrawl portable prune`
|
||||
|
||||
```bash
|
||||
gitcrawl portable prune
|
||||
gitcrawl portable prune --body-chars 256 # default
|
||||
gitcrawl portable prune --body-chars 512 --no-vacuum
|
||||
gitcrawl portable prune --json
|
||||
```
|
||||
|
||||
`prune` truncates thread bodies in the database to the requested character cap and (by default) runs SQLite `VACUUM` to reclaim space. The result is a smaller database suitable for committing back to Git.
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--body-chars <n>` | `256` | Maximum body characters to keep per thread |
|
||||
| `--no-vacuum` | _(off)_ | Skip the post-prune `VACUUM` |
|
||||
| `--json` | _(off)_ | JSON output |
|
||||
|
||||
After pruning, commit and push the database file from the portable checkout the way you would for any Git repository.
|
||||
|
||||
## A typical publishing flow
|
||||
|
||||
```bash
|
||||
# In the portable store checkout, refresh upstream data into the local runtime mirror.
|
||||
gitcrawl refresh owner/repo
|
||||
|
||||
# Prune for a small, shareable footprint.
|
||||
gitcrawl portable prune --body-chars 256
|
||||
|
||||
# Commit and push using normal Git.
|
||||
cd ~/.config/gitcrawl/portable
|
||||
git add data/openclaw__openclaw.sync.db
|
||||
git commit -m "data: refresh openclaw/gitcrawl"
|
||||
git push
|
||||
```
|
||||
|
||||
Other agents and machines pull the new commit on their next read-only command.
|
||||
|
||||
## Cached search against a portable store
|
||||
|
||||
`gitcrawl search` (and the gh-shim's search) work against portable-store data with one wrinkle: when the portable store has been pruned, generated document indexes may not be present. Search falls back to compact thread title/body data automatically — you keep useful results without the publisher needing to ship the full document indexes.
|
||||
|
||||
## Caveats
|
||||
|
||||
- The portable store carries the SQLite database. It does not carry the runtime cache or the vector store unless you explicitly publish them.
|
||||
- Vectors regenerated on each consumer's machine after `embed` are not shared; if you want shared vectors, publish the `vectors/` directory alongside the database.
|
||||
- Portable stores are read-mostly. Multiple writers pushing concurrently will race the way any Git workflow does — gate writes through a single publisher or a CI workflow.
|
||||
|
||||
## See also
|
||||
|
||||
- [Sync](./sync) — what gets written into the database that ends up in the portable store
|
||||
- [gh shim](./gh-shim) — agents reading a shared portable store benefit doubly from the shim's local-first answers
|
||||
132
docs/quickstart.md
Normal file
132
docs/quickstart.md
Normal file
@ -0,0 +1,132 @@
|
||||
---
|
||||
title: Quickstart
|
||||
nav_order: 3
|
||||
permalink: /quickstart/
|
||||
---
|
||||
|
||||
# Quickstart
|
||||
{: .no_toc }
|
||||
|
||||
Five minutes from clean machine to a populated cluster view.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## 1. Install and initialize
|
||||
|
||||
```bash
|
||||
# Build (or download a release archive — see Installation).
|
||||
git clone https://github.com/openclaw/gitcrawl.git
|
||||
cd gitcrawl
|
||||
go build -o /usr/local/bin/gitcrawl ./cmd/gitcrawl
|
||||
|
||||
# Create config + database under ~/.config/gitcrawl.
|
||||
gitcrawl init
|
||||
```
|
||||
|
||||
Defaults written:
|
||||
|
||||
- `~/.config/gitcrawl/config.toml`
|
||||
- `~/.config/gitcrawl/gitcrawl.db`
|
||||
- `~/.config/gitcrawl/cache/`
|
||||
- `~/.config/gitcrawl/vectors/`
|
||||
- `~/.config/gitcrawl/logs/`
|
||||
|
||||
## 2. Set credentials
|
||||
|
||||
```bash
|
||||
export GITHUB_TOKEN=ghp_xxx # required for sync
|
||||
export OPENAI_API_KEY=sk-xxx # required for embeddings
|
||||
```
|
||||
|
||||
Either set them in your shell profile or store them in `~/.config/gitcrawl/config.toml`:
|
||||
|
||||
```toml
|
||||
[env]
|
||||
GITHUB_TOKEN = "ghp_xxx"
|
||||
OPENAI_API_KEY = "sk-xxx"
|
||||
```
|
||||
|
||||
`gitcrawl doctor` confirms the credentials are visible and reports their source.
|
||||
|
||||
## 3. Sync a repository
|
||||
|
||||
```bash
|
||||
gitcrawl sync openclaw/gitcrawl
|
||||
```
|
||||
|
||||
By default this fetches **open** issues and pull requests, plus a sweep of recently closed rows so the local store does not rot. Add `--include-comments` for review threads, `--include-pr-details` (or `--with pr-details`) for PR files, commits, checks, and workflow runs.
|
||||
|
||||
Need exact rows? Use `--numbers`:
|
||||
|
||||
```bash
|
||||
gitcrawl sync openclaw/gitcrawl --numbers 123,456 --include-comments
|
||||
```
|
||||
|
||||
## 4. Embed and cluster
|
||||
|
||||
The `refresh` command runs sync → embed → cluster end to end:
|
||||
|
||||
```bash
|
||||
gitcrawl refresh openclaw/gitcrawl
|
||||
```
|
||||
|
||||
You can run the stages individually if you want finer control — see [Refresh and embed](./refresh-and-embed) and [Clustering](./clustering).
|
||||
|
||||
## 5. Browse clusters
|
||||
|
||||
Open the TUI:
|
||||
|
||||
```bash
|
||||
gitcrawl tui openclaw/gitcrawl
|
||||
# or just `gitcrawl tui` and the most recently synced repo is inferred
|
||||
```
|
||||
|
||||
- `↑`/`↓` navigate clusters, `Enter` opens member detail
|
||||
- `a` opens the action menu, `#` jumps to a number, `n` loads neighbors, `p` switches repo
|
||||
- Right-click and mouse wheel work in most terminals
|
||||
|
||||
For a non-interactive view:
|
||||
|
||||
```bash
|
||||
gitcrawl clusters openclaw/gitcrawl --sort size --min-size 5
|
||||
gitcrawl cluster-detail openclaw/gitcrawl --id 12
|
||||
gitcrawl neighbors openclaw/gitcrawl --number 123 --limit 10
|
||||
```
|
||||
|
||||
## 6. Search the local cache
|
||||
|
||||
```bash
|
||||
gitcrawl search openclaw/gitcrawl --query "download stalls" --mode hybrid
|
||||
```
|
||||
|
||||
The same command also accepts the `gh search` shape, which makes it a drop-in for scripts that already speak `gh`:
|
||||
|
||||
```bash
|
||||
gitcrawl search issues "manifest cache" \
|
||||
-R openclaw/gitcrawl \
|
||||
--state open \
|
||||
--json number,title,state,url,updatedAt,labels \
|
||||
--limit 30
|
||||
```
|
||||
|
||||
Add `--sync-if-stale 5m` to refresh the local mirror first when it is older than the duration you tolerate.
|
||||
|
||||
## 7. Wire up the `gh` shim (optional)
|
||||
|
||||
```bash
|
||||
ln -s "$(command -v gitcrawl)" /usr/local/bin/gitcrawl-gh
|
||||
gitcrawl-gh search issues "download stalls" -R openclaw/gitcrawl --json number,title,url
|
||||
gitcrawl-gh pr view 123 -R openclaw/gitcrawl --json number,title,state,url
|
||||
gitcrawl-gh xcache stats
|
||||
```
|
||||
|
||||
Most read-only `gh` calls answer locally, mutating commands pass straight through to the real `gh`. See [gh shim](./gh-shim) for the full surface.
|
||||
|
||||
## Where to next
|
||||
|
||||
- [Concepts](./concepts) — what threads, durable clusters, and embeddings actually mean
|
||||
- [Sync](./sync) — every flag for hydrating the local store
|
||||
- [Clustering](./clustering) — tuning the cluster graph for a specific repo
|
||||
- [Automation](./automation) — JSON contracts for agents and scripts
|
||||
164
docs/reference.md
Normal file
164
docs/reference.md
Normal file
@ -0,0 +1,164 @@
|
||||
---
|
||||
title: Reference
|
||||
nav_order: 16
|
||||
permalink: /reference/
|
||||
---
|
||||
|
||||
# Reference
|
||||
{: .no_toc }
|
||||
|
||||
Lookup tables for paths, environment variables, and defaults.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## Paths
|
||||
|
||||
| Path | Purpose |
|
||||
| --- | --- |
|
||||
| `~/.config/gitcrawl/config.toml` | Configuration file |
|
||||
| `~/.config/gitcrawl/gitcrawl.db` | SQLite database |
|
||||
| `~/.config/gitcrawl/cache/` | Caches (PR detail, gh-shim fallthrough) |
|
||||
| `~/.config/gitcrawl/cache/gh-shim/` | gh-shim fallthrough cache |
|
||||
| `~/.config/gitcrawl/vectors/` | Vector store backing embeddings |
|
||||
| `~/.config/gitcrawl/logs/` | Operational logs |
|
||||
| `~/.config/gitcrawl/portable/` | Portable-store checkout (when configured) |
|
||||
|
||||
Override the config root with `--config <path>` or `GITCRAWL_CONFIG`.
|
||||
|
||||
## Environment variables
|
||||
|
||||
### Core
|
||||
|
||||
| Variable | Default | Used by | Purpose |
|
||||
| --- | --- | --- | --- |
|
||||
| `GITCRAWL_CONFIG` | `~/.config/gitcrawl/config.toml` | All commands | Override config path |
|
||||
| `GITCRAWL_DB_PATH` | `~/.config/gitcrawl/gitcrawl.db` | All commands | Override database path |
|
||||
| `GITHUB_TOKEN` | _(none)_ | `sync`, `gh` shim | GitHub API token |
|
||||
| `OPENAI_API_KEY` | _(none)_ | `embed`, `refresh` | OpenAI API key |
|
||||
|
||||
### Models
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `GITCRAWL_SUMMARY_MODEL` | `gpt-5.4` | Summary model (reserved for future commands) |
|
||||
| `GITCRAWL_EMBED_MODEL` | `text-embedding-3-small` | OpenAI embedding model |
|
||||
| `GITCRAWL_OPENAI_RETRY_DISABLED` | _(off)_ | Set `1` to disable OpenAI retry/backoff |
|
||||
| `GITCRAWL_OPENAI_BASE_URL` / `OPENAI_BASE_URL` | OpenAI default | Custom OpenAI endpoint |
|
||||
|
||||
### GitHub overrides
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `GITCRAWL_GITHUB_BASE_URL` / `GITHUB_BASE_URL` | GitHub default | Custom GitHub API endpoint |
|
||||
| `GH_HOST` | _(none)_ | Included in gh-shim cache key |
|
||||
| `GH_REPO` | _(none)_ | Default `-R` value; included in gh-shim cache key |
|
||||
|
||||
### gh shim
|
||||
|
||||
| Variable | Default | Purpose |
|
||||
| --- | --- | --- |
|
||||
| `GITCRAWL_GH_PATH` | _(probed)_ | Path to the real `gh` binary |
|
||||
| `GITCRAWL_GH_AUTO_HYDRATE` | _(on)_ | Set `0` to disable PR auto-hydration on cache miss |
|
||||
| `GITCRAWL_GH_CACHE_TTL` | `30s` for most commands | Override fallthrough cache TTL (e.g., `5m`, `1h`) |
|
||||
|
||||
## Configuration defaults
|
||||
|
||||
| Field | Default |
|
||||
| --- | --- |
|
||||
| `summary_model` | `gpt-5.4` |
|
||||
| `embed_model` | `text-embedding-3-small` |
|
||||
| `embed_dimensions` | `1024` |
|
||||
| `embedding_basis` | `title_original` |
|
||||
| `batch_size` (embeddings) | `64` |
|
||||
| `concurrency` (embeddings) | `2` |
|
||||
| `tui_default_sort` | `size` |
|
||||
|
||||
## Clustering defaults
|
||||
|
||||
| Parameter | Default | Source |
|
||||
| --- | --- | --- |
|
||||
| `--threshold` | `0.80` | `cluster`, `refresh` |
|
||||
| `--cross-kind-threshold` | `0.93` | `cluster`, `refresh` |
|
||||
| `--min-size` | `1` | `cluster`, `refresh` |
|
||||
| `--max-cluster-size` | `40` | `cluster`, `refresh` |
|
||||
| `--k` (nearest-neighbor fanout) | `16` | `cluster`, `refresh` |
|
||||
| Weak-edge title overlap floor | `0.18` | internal |
|
||||
| High-confidence edge score | `0.90` | internal |
|
||||
| Deterministic reference edge score | `0.94` | internal |
|
||||
| Body-only reference prefix length | `240` chars | internal |
|
||||
|
||||
## TUI defaults
|
||||
|
||||
| Parameter | Default |
|
||||
| --- | --- |
|
||||
| `--min-size` | `5` |
|
||||
| `--sort` | `size` |
|
||||
| Working set limit | `500` rows |
|
||||
| Refresh interval | `15s` |
|
||||
|
||||
## gh shim cache TTLs
|
||||
|
||||
| Cache class | TTL |
|
||||
| --- | --- |
|
||||
| Most read-only fallthroughs | `30s` |
|
||||
| `gh api` (GET) | `60s` |
|
||||
| `gh pr diff` without stable head SHA | `5m` |
|
||||
| `gh pr diff` with stable head SHA | `7d` |
|
||||
| Override | `GITCRAWL_GH_CACHE_TTL` |
|
||||
| Cache read failures | on by default; disable with `GITCRAWL_GH_CACHE_ERRORS=0` |
|
||||
|
||||
## gh shim cache key composition
|
||||
|
||||
A SHA-256 hash of:
|
||||
|
||||
- Version tag (`v2`)
|
||||
- Resolved gitcrawl config path
|
||||
- Current working directory
|
||||
- `GH_HOST` env var
|
||||
- `GH_REPO` env var
|
||||
- For `gh pr diff`: `pr-diff:owner/repo:number:head-sha` (when head SHA is known)
|
||||
- Full command argument vector (null-separated)
|
||||
|
||||
This isolates sibling checkouts and portable stores while coalescing repeated calls from the same workspace.
|
||||
|
||||
## Output formats
|
||||
|
||||
| Format | Where to use |
|
||||
| --- | --- |
|
||||
| `text` | Human terminal use (default) |
|
||||
| `json` | Pipelines, scripts, agents (also via `--json`) |
|
||||
| `log` | Internal structured logging output |
|
||||
|
||||
## Exit codes
|
||||
|
||||
- `0` — success
|
||||
- non-zero — usage error, "not implemented" command, or runtime failure
|
||||
|
||||
stderr always carries error messages. stdout is reserved for command output.
|
||||
|
||||
## File-system layout (worked example)
|
||||
|
||||
```
|
||||
~/.config/gitcrawl/
|
||||
├── config.toml
|
||||
├── gitcrawl.db # SQLite mirror
|
||||
├── gitcrawl.db-shm # SQLite shared-memory file
|
||||
├── gitcrawl.db-wal # SQLite write-ahead log
|
||||
├── cache/
|
||||
│ ├── gh-shim/ # gh fallthrough cache; inspect with xcache
|
||||
│ └── pr/ # hydrated PR detail blobs
|
||||
├── vectors/ # vector store backing embeddings
|
||||
├── logs/
|
||||
└── portable/ # portable-store checkout (optional)
|
||||
└── data/
|
||||
└── owner__repo.sync.db
|
||||
```
|
||||
|
||||
## See also
|
||||
|
||||
- [Configuration](./configuration) — narrative version of this reference
|
||||
- [Commands](./commands) — every command and flag, in one table
|
||||
- [SPEC.md](https://github.com/openclaw/gitcrawl/blob/main/SPEC.md) — product contract
|
||||
- [CHANGELOG.md](https://github.com/openclaw/gitcrawl/blob/main/CHANGELOG.md) — what shipped recently
|
||||
130
docs/refresh-and-embed.md
Normal file
130
docs/refresh-and-embed.md
Normal file
@ -0,0 +1,130 @@
|
||||
---
|
||||
title: Refresh and embed
|
||||
nav_order: 7
|
||||
permalink: /refresh-and-embed/
|
||||
---
|
||||
|
||||
# Refresh and embed
|
||||
{: .no_toc }
|
||||
|
||||
`gitcrawl refresh` is the one command most users want. It runs sync → embed → cluster in order, with the same flags you would use individually.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## refresh
|
||||
|
||||
```bash
|
||||
gitcrawl refresh owner/repo
|
||||
```
|
||||
|
||||
By default this performs:
|
||||
|
||||
1. **Sync** — open + recently closed issues and PRs (see [Sync](./sync))
|
||||
2. **Embed** — fill `thread_vectors` for any thread whose document changed
|
||||
3. **Cluster** — rebuild durable clusters with the standard thresholds
|
||||
|
||||
Disable any stage with `--no-sync`, `--no-embed`, `--no-cluster`. The remaining stages still run; failures are reported per stage in the JSON output.
|
||||
|
||||
### Stage-specific flags
|
||||
|
||||
`refresh` forwards flags through to the underlying stages:
|
||||
|
||||
| Forwarded to | Flag |
|
||||
| --- | --- |
|
||||
| sync | `--since`, `--state`, `--limit`, `--include-comments` |
|
||||
| embed | `--limit` |
|
||||
| cluster | `--threshold` (0.80), `--min-size` (1), `--max-cluster-size` (40), `--k` (16), `--cross-kind-threshold` (0.93) |
|
||||
|
||||
`--include-code` is accepted but currently a no-op.
|
||||
|
||||
### JSON output
|
||||
|
||||
```bash
|
||||
gitcrawl refresh owner/repo --json
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"repository": "owner/repo",
|
||||
"sync": { "selected": 124, "inserted": 12, "updated": 9, "run_id": 42 },
|
||||
"embed": { "selected": 21, "embedded": 21, "skipped": 0, "failed": 0, "model": "text-embedding-3-small", "run_id": 43 },
|
||||
"cluster": {
|
||||
"threshold": 0.8, "cross_kind": 0.93, "min_size": 1, "max_size": 40, "k": 16,
|
||||
"vector_count": 312, "edge_count": 1042, "cluster_count": 87, "member_count": 312, "run_id": 44
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Each stage object mirrors the JSON shape of the standalone command. You can read the per-stage `run_id` later via `gitcrawl runs --kind sync|embedding|cluster`.
|
||||
|
||||
## embed
|
||||
|
||||
```bash
|
||||
gitcrawl embed owner/repo
|
||||
```
|
||||
|
||||
Generates OpenAI embeddings for any thread whose document hash has changed since its last embedding. Works through the database in batches (default size 64) with bounded concurrency (default 2).
|
||||
|
||||
### Flags
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--number <n>` | _(any)_ | Embed a single issue/PR by number |
|
||||
| `--limit <n>` | _(no limit)_ | Maximum rows to embed in this run |
|
||||
| `--force` | _(off)_ | Re-embed every selected row, ignoring content hash |
|
||||
| `--include-closed` | _(off)_ | Include closed threads |
|
||||
|
||||
### When to `--force`
|
||||
|
||||
You should rarely need it. The pipeline auto-forces a rebuild when:
|
||||
|
||||
- The configured embedding model changes (`GITCRAWL_EMBED_MODEL` or `embed_model` in config)
|
||||
- The embedding input rune cap changes (so older, larger-cap vectors are not silently mixed in)
|
||||
|
||||
Use `--force` manually if you have manually edited vectors, or want to confirm an output is reproducible from scratch.
|
||||
|
||||
### Failure handling
|
||||
|
||||
OpenAI errors are retried with backoff unless `GITCRAWL_OPENAI_RETRY_DISABLED=1`. The JSON output includes a `failures` array with batch-level diagnostics (`batch_start`, `batch_end`, `attempts`, `status`, `code`, `message`) so partial failures do not silently drop rows.
|
||||
|
||||
Oversized inputs are capped before being sent upstream so a single huge body cannot exceed the model's input limit.
|
||||
|
||||
### JSON output
|
||||
|
||||
```json
|
||||
{
|
||||
"repository": "owner/repo",
|
||||
"model": "text-embedding-3-small",
|
||||
"basis": "title_original",
|
||||
"selected": 21,
|
||||
"embedded": 20,
|
||||
"skipped": 0,
|
||||
"failed": 1,
|
||||
"retries": 3,
|
||||
"status": "ok",
|
||||
"failures": [
|
||||
{ "batch_start": 16, "batch_end": 17, "attempts": 3, "status": 429, "type": "rate_limit", "code": "rate_limit_exceeded", "message": "..." }
|
||||
],
|
||||
"run_id": 43
|
||||
}
|
||||
```
|
||||
|
||||
## runs
|
||||
|
||||
Inspect what `refresh`, `sync`, `embed`, or `cluster` actually did:
|
||||
|
||||
```bash
|
||||
gitcrawl runs owner/repo --kind sync # default kind
|
||||
gitcrawl runs owner/repo --kind embedding
|
||||
gitcrawl runs owner/repo --kind cluster
|
||||
```
|
||||
|
||||
Each row carries `started_at`, `finished_at`, `status`, and `stats_json` — useful when an agent needs to know whether a sync is fresh enough or whether the last cluster pass converged.
|
||||
|
||||
## Cost notes
|
||||
|
||||
- **GitHub.** Sync uses standard REST endpoints; the API quota is the dominant cost on busy repos. Use `--include-comments` and `--with pr-details` selectively.
|
||||
- **OpenAI.** `text-embedding-3-small` is inexpensive but not free. `embed` is bounded by `--limit` if you want to stay under a budget on initial backfills.
|
||||
- **Disk.** Vectors and PR detail blobs grow with the repo. The portable-store flow includes `gitcrawl portable prune` to keep published payloads small — see [Portable stores](./portable-stores).
|
||||
127
docs/search.md
Normal file
127
docs/search.md
Normal file
@ -0,0 +1,127 @@
|
||||
---
|
||||
title: Search
|
||||
nav_order: 8
|
||||
permalink: /search/
|
||||
---
|
||||
|
||||
# Search
|
||||
{: .no_toc }
|
||||
|
||||
Local full-text and semantic search over the SQLite mirror, plus a `gh search`-compatible surface for scripts.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## Why local search
|
||||
|
||||
`gitcrawl search` runs against the local SQLite cache and the local vector store. It does not consume GitHub REST search quota and it returns deterministically ordered hits with full thread metadata. It is intended for **discovery**, not for write actions — use `gh` for the final live verification before commenting, closing, labeling, or merging.
|
||||
|
||||
## Direct mode
|
||||
|
||||
```bash
|
||||
gitcrawl search owner/repo --query "download stalls"
|
||||
gitcrawl search owner/repo --query "manifest cache" --mode hybrid --limit 30 --json
|
||||
```
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--query <text>` | _(required)_ | Search text |
|
||||
| `--mode keyword\|semantic\|hybrid` | `keyword` | `keyword` uses SQLite FTS, `semantic` uses vector cosine, `hybrid` blends them |
|
||||
| `--limit <n>` | _(implementation default)_ | Maximum hits |
|
||||
|
||||
**Hybrid mode** is the most robust default — it blends full-text recall with semantic neighbors so typos, synonyms, and stack-trace fragments still surface relevant rows.
|
||||
|
||||
JSON output:
|
||||
|
||||
```json
|
||||
{
|
||||
"repository": "owner/repo",
|
||||
"query": "download stalls",
|
||||
"mode": "hybrid",
|
||||
"hits": [
|
||||
{ "number": 123, "kind": "issue", "title": "...", "score": 0.81, "url": "...", "updated_at": "..." }
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## `gh search` compatibility mode
|
||||
|
||||
The same command also accepts the `gh search` shape so scripts that already speak `gh` work without rewriting:
|
||||
|
||||
```bash
|
||||
gitcrawl search issues "download stalls" \
|
||||
-R owner/repo \
|
||||
--state open \
|
||||
--json number,title,state,url,updatedAt,labels \
|
||||
--limit 30
|
||||
|
||||
gitcrawl search prs "manifest cache" \
|
||||
-R owner/repo \
|
||||
--state open \
|
||||
--json number,title,state,url,updatedAt,isDraft,author \
|
||||
--limit 20
|
||||
```
|
||||
|
||||
Recognized flags in this mode:
|
||||
|
||||
| Flag | Description |
|
||||
| --- | --- |
|
||||
| `-R` / `--repo` | Target repository (also reads `GH_REPO`) |
|
||||
| `--state open\|closed\|all` | Issue state filter |
|
||||
| `--json` | Comma-separated field list (gh-compatible) |
|
||||
| `--limit` / `-L` | Maximum rows |
|
||||
| `--match` | Accepted for parity; the local FTS index already covers documents |
|
||||
| `--sort` / `--order` | Accepted for parity |
|
||||
| `--sync-if-stale <duration>` | Run one metadata sync first if the local mirror is older than the duration |
|
||||
|
||||
The output shape matches `gh search issues|prs --json ...` exactly so you can pipe into the same `jq` filters you already have.
|
||||
|
||||
## `--sync-if-stale`
|
||||
|
||||
```bash
|
||||
gitcrawl search issues "hot loop" \
|
||||
-R owner/repo \
|
||||
--state open \
|
||||
--sync-if-stale 5m \
|
||||
--json number,title,url
|
||||
```
|
||||
|
||||
If the most recent successful sync for this repo is older than `5m`, gitcrawl runs one metadata sync first and then answers the search from the freshly populated cache. The search result still comes from SQLite — only the staleness check triggers GitHub.
|
||||
|
||||
This is the right pattern for agents: keep latency predictable on cache hits, and bound the staleness window for everything else.
|
||||
|
||||
## Search vs. the `gh` shim
|
||||
|
||||
There are two ways to run cached searches:
|
||||
|
||||
| Command | Best for |
|
||||
| --- | --- |
|
||||
| `gitcrawl search issues|prs ...` | Human use; mixes naturally with the rest of the gitcrawl CLI |
|
||||
| `gitcrawl gh search issues|prs ...` | Agents and scripts that call `gh` directly — symlinked as `gh` or `gitcrawl-gh` it is invisible to callers |
|
||||
|
||||
Both paths share the same local cache and produce gh-shaped JSON. The shim adds the additional `gh issue/pr view`, `gh issue/pr list`, `gh pr checks`, `gh run`, and `xcache` surface — see [gh shim](./gh-shim).
|
||||
|
||||
## Combining with sync
|
||||
|
||||
A common discovery pattern:
|
||||
|
||||
```bash
|
||||
# 1. Find candidates locally.
|
||||
NUMS=$(gitcrawl search issues "download stalls" -R owner/repo \
|
||||
--json number --limit 20 \
|
||||
| jq -r '[.[].number] | join(",")')
|
||||
|
||||
# 2. Hydrate them with comments + PR detail in one round-trip.
|
||||
gitcrawl sync owner/repo --numbers "$NUMS" --include-comments --with pr-details
|
||||
|
||||
# 3. Re-query with full conversational context (or open in TUI).
|
||||
gitcrawl tui owner/repo
|
||||
```
|
||||
|
||||
## Limits
|
||||
|
||||
- The keyword index covers titles, bodies, and (when synced) comments and review comments.
|
||||
- Semantic search relies on the local vector store. Run `gitcrawl embed` first.
|
||||
- Hybrid mode degrades gracefully: with no vectors, it behaves like keyword.
|
||||
- Closed threads are included by the FTS index when synced; locally closed threads are filtered out by the `--hide-closed` flag where applicable.
|
||||
155
docs/sync.md
Normal file
155
docs/sync.md
Normal file
@ -0,0 +1,155 @@
|
||||
---
|
||||
title: Sync
|
||||
nav_order: 6
|
||||
permalink: /sync/
|
||||
---
|
||||
|
||||
# Sync
|
||||
{: .no_toc }
|
||||
|
||||
Bring GitHub issues and pull requests into local SQLite. Idempotent, incremental, and tunable per workflow.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## The default
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo
|
||||
```
|
||||
|
||||
This fetches **open** issues and pull requests for the repository. To keep local state from rotting, an incremental sync also sweeps recently closed items so that issues and PRs closed between runs are reflected locally.
|
||||
|
||||
A sync writes:
|
||||
|
||||
- `repositories` — repo metadata
|
||||
- `threads` — issues and PRs (titles, bodies, authors, labels, state, timestamps)
|
||||
- `documents` — canonical thread documents (when bodies change)
|
||||
- `run_records` — sync run statistics
|
||||
|
||||
## State filters
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --state open # default
|
||||
gitcrawl sync owner/repo --state closed # only closed
|
||||
gitcrawl sync owner/repo --state all # full backfill
|
||||
```
|
||||
|
||||
`--state all` is the right choice for a one-shot historical backfill on a new repository. After that, the default `--state open` (with its closed sweep) is enough for ongoing freshness.
|
||||
|
||||
## Time-windowed sync
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --since 2026-04-01T00:00:00Z
|
||||
```
|
||||
|
||||
`--since` accepts an RFC 3339 timestamp and limits the GitHub query to threads updated after that point. Combine with `--state` to scope tightly:
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --state all --since 2026-04-01T00:00:00Z
|
||||
```
|
||||
|
||||
## Exact rows
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --numbers 123,456 --include-comments
|
||||
```
|
||||
|
||||
`--numbers` is the safest way to refresh specific issues or PRs — it bypasses list ordering and the updated-time window, fetching exactly the rows you ask for. Pair it with `--include-comments` and/or `--include-pr-details` to hydrate the conversation and PR-only data at the same time.
|
||||
|
||||
This is also what the `gh` shim uses internally for [auto-hydration](./gh-shim#auto-hydration).
|
||||
|
||||
## Hydration depth
|
||||
|
||||
| Flag | What it adds |
|
||||
| --- | --- |
|
||||
| `--include-comments` | Issue comments, PR review comments, reviews |
|
||||
| `--include-pr-details` | PR files, commits, status checks, workflow runs |
|
||||
| `--with pr-details` | Same as `--include-pr-details` (gh-style flag) |
|
||||
|
||||
PR details land in `pr_files`, `pr_commits`, `pr_checks`, and `pr_runs` tables and back the `gh pr view`, `gh pr checks`, and `gh run list/view` shim paths. See [gh shim](./gh-shim).
|
||||
|
||||
`--include-code` is accepted for compatibility but is currently a no-op.
|
||||
|
||||
## Limit and pagination
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --limit 200
|
||||
```
|
||||
|
||||
`--limit` caps the number of rows fetched in this invocation. The underlying GitHub paginator surfaces total page counts in run records and honors GitHub's `Retry-After` and rate-limit response headers, so partial syncs interrupted by rate limiting resume cleanly.
|
||||
|
||||
## JSON output
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --json
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"repository": "owner/repo",
|
||||
"state": "open",
|
||||
"since": "",
|
||||
"selected": 124,
|
||||
"inserted": 12,
|
||||
"updated": 9,
|
||||
"deleted": 0,
|
||||
"comments_inserted": 0,
|
||||
"comments_updated": 0,
|
||||
"reviews_inserted": 0,
|
||||
"pr_files_inserted": 0,
|
||||
"pr_commits_inserted": 0,
|
||||
"run_id": 42,
|
||||
"started_at": "2026-05-05T07:30:11Z",
|
||||
"finished_at": "2026-05-05T07:30:43Z"
|
||||
}
|
||||
```
|
||||
|
||||
## Common workflows
|
||||
|
||||
### First-time setup for a repo
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --state all --include-comments
|
||||
gitcrawl embed owner/repo
|
||||
gitcrawl cluster owner/repo
|
||||
```
|
||||
|
||||
Or in one step:
|
||||
|
||||
```bash
|
||||
gitcrawl refresh owner/repo --include-comments
|
||||
```
|
||||
|
||||
### Periodic incremental refresh
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo
|
||||
```
|
||||
|
||||
The closed sweep keeps the open list honest without paying for a full backfill.
|
||||
|
||||
### Pull a specific issue + comments + PR detail
|
||||
|
||||
```bash
|
||||
gitcrawl sync owner/repo --numbers 123 --include-comments --with pr-details
|
||||
```
|
||||
|
||||
### Refresh a batch you got from search
|
||||
|
||||
```bash
|
||||
NUMS=$(gitcrawl gh search issues "manifest cache" -R owner/repo --json number --limit 20 \
|
||||
| jq -r '[.[].number] | join(",")')
|
||||
gitcrawl sync owner/repo --numbers "$NUMS" --with pr-details
|
||||
```
|
||||
|
||||
## Required credentials
|
||||
|
||||
`sync` requires a GitHub token. gitcrawl resolves it from `GITHUB_TOKEN`, the `[env]` table in `config.toml`, or from `gh auth token` if the real `gh` CLI is installed and authenticated. `gitcrawl doctor` reports the source.
|
||||
|
||||
## See also
|
||||
|
||||
- [Refresh and embed](./refresh-and-embed) — the wrapper that runs sync, embed, and cluster end to end
|
||||
- [gh shim](./gh-shim) — how synced PR details power `gh pr view` / `gh pr checks` / `gh run` from local cache
|
||||
- [Portable stores](./portable-stores) — sharing the synced cache across machines
|
||||
113
docs/tui.md
Normal file
113
docs/tui.md
Normal file
@ -0,0 +1,113 @@
|
||||
---
|
||||
title: TUI
|
||||
nav_order: 11
|
||||
permalink: /tui/
|
||||
---
|
||||
|
||||
# TUI
|
||||
{: .no_toc }
|
||||
|
||||
`gitcrawl tui` is the interactive cluster browser. Keyboard-first, mouse-friendly, refreshes from local SQLite every 15 seconds.
|
||||
{: .fs-6 .fw-300 }
|
||||
|
||||
1. TOC
|
||||
{:toc}
|
||||
|
||||
## Launching
|
||||
|
||||
```bash
|
||||
gitcrawl tui owner/repo
|
||||
gitcrawl tui # infers the most recently updated local repo
|
||||
gitcrawl tui --min-size 5 # default; show clusters with ≥5 active members
|
||||
gitcrawl tui --sort recent # alternate sort
|
||||
gitcrawl tui --hide-closed # focus only on currently open clusters
|
||||
```
|
||||
|
||||
| Flag | Default | Description |
|
||||
| --- | --- | --- |
|
||||
| `--min-size <n>` | `5` | Minimum active member count |
|
||||
| `--sort recent\|oldest\|size` | `size` | Cluster ordering |
|
||||
| `--limit <n>` | `500` | Working-set cap (rows fetched into the TUI) |
|
||||
| `--hide-closed` | _(off)_ | Hide locally closed clusters |
|
||||
| `--include-closed` | _(deprecated)_ | Closed clusters are included by default |
|
||||
| `--json` | _(off)_ | Emit a non-interactive JSON snapshot instead of launching the UI |
|
||||
|
||||
When `--json` is passed, the TUI command produces the same cluster summary the interactive view would render — useful for CI checks or for an agent that wants the same view a human would see.
|
||||
|
||||
## Default behavior
|
||||
|
||||
The TUI starts at `--min-size 5` and `--sort size` so the first screen is the useful triage workload, not singleton noise. Pass `--min-size 1` when you intentionally want singletons (e.g., looking for orphans).
|
||||
|
||||
The view auto-refreshes from the local store every 15 seconds. There is no GitHub call from the TUI itself — to bring in fresh upstream data, run `gitcrawl sync` (or `refresh`) in another terminal and the TUI picks it up on the next tick.
|
||||
|
||||
## Keyboard
|
||||
|
||||
| Key | Action |
|
||||
| --- | --- |
|
||||
| `↑` / `↓` | Move within the active pane |
|
||||
| `Tab` / `Shift+Tab` | Switch panes |
|
||||
| `Enter` | Open detail for selected cluster or member; on a member, loads neighbors first |
|
||||
| `a` | Open the action menu (cluster or member, depending on focus) |
|
||||
| `#` | Jump to a specific issue or PR number |
|
||||
| `n` | Load neighbors for the selected issue or PR |
|
||||
| `p` | Switch between repositories already present in the local store |
|
||||
| `s` | Cycle sort mode (`size` ↔ `recent` ↔ `oldest`, both directions) |
|
||||
| `/` | Filter rows by substring |
|
||||
| `q` | Quit |
|
||||
|
||||
The action menu opened with `a` mirrors the right-click menu, so every mouse action has a keyboard equivalent.
|
||||
|
||||
## Mouse
|
||||
|
||||
Mouse support is built in and works in most modern terminals (iTerm2, Kitty, Alacritty, WezTerm, recent macOS Terminal):
|
||||
|
||||
- **Click** a row to select it
|
||||
- **Double-click** to open detail
|
||||
- **Wheel** scrolls the focused pane
|
||||
- **Right-click** opens the cluster or member action menu
|
||||
- **Trackpad scroll** is buffered to avoid jumpy redraws
|
||||
|
||||
If your terminal does not pass through mouse events, all actions remain available via keyboard.
|
||||
|
||||
## Action menu
|
||||
|
||||
Cluster actions:
|
||||
|
||||
- Copy issue/PR URL or number
|
||||
- Sort cluster members
|
||||
- Filter to a member subset
|
||||
- Jump to a referenced issue or PR
|
||||
- Open canonical thread on GitHub
|
||||
- Load neighbors for the canonical
|
||||
- Local close / reopen
|
||||
- Set canonical member
|
||||
- Exclude / include member
|
||||
|
||||
Member actions:
|
||||
|
||||
- Copy URL / number
|
||||
- Load neighbors
|
||||
- Open on GitHub
|
||||
- Local close / reopen this thread
|
||||
- Exclude from cluster
|
||||
|
||||
These map directly onto the [governance](./governance) commands. Anything you can do interactively, you can also script.
|
||||
|
||||
## Display rules
|
||||
|
||||
`gitcrawl clusters` and the TUI use the same display rules:
|
||||
|
||||
- Latest raw run clusters first
|
||||
- Closed durable rows merged in as historical context
|
||||
- Default sort is `size` (largest active membership first)
|
||||
- GitHub-closed members are hidden from the latest-run view; pass `--include-closed` to see the full historical cluster
|
||||
|
||||
For an audit-style view that does not merge with the latest run, use `gitcrawl durable-clusters --include-closed`.
|
||||
|
||||
## Tips
|
||||
|
||||
- Resize your terminal — the panes reflow.
|
||||
- A single repo with thousands of threads is fine; the working set is capped at 500 rows so the UI stays snappy.
|
||||
- Run `gitcrawl refresh owner/repo` periodically in a sibling terminal; the TUI reflects new data on the next 15s tick.
|
||||
- If the cluster you are looking for is missing, check `--min-size` and `--hide-closed`.
|
||||
- The status bar at the bottom shows the active sort, filter, repo, and any warnings (e.g., "vector model mismatch — re-run embed").
|
||||
Loading…
Reference in New Issue
Block a user