6.3 KiB
| title | nav_order | permalink |
|---|---|---|
| Automation | 14 | /automation/ |
Automation
{: .no_toc }
Stable JSON contracts, agent recipes, and patterns for keeping the local mirror warm without manual ceremony. {: .fs-6 .fw-300 }
- TOC {:toc}
JSON output is first class
Every command supports --json (or the global --format json). The resulting payload is pretty-printed with stable field names so you can pipe it directly into jq or feed it to an agent as structured context.
gitcrawl sync owner/repo --json | jq '{run_id, inserted, updated}'
gitcrawl clusters owner/repo --json --sort size --min-size 5 \
| jq '.clusters[] | {id, members: .member_count, latest: .latest_thread_number}'
For the full per-command JSON shapes, see the individual feature pages and the Commands reference.
Exit codes
0— success- non-zero — usage error, command not implemented, runtime error
Stderr always carries a human-readable error message; stdout is reserved for the requested output (text or JSON) so you can pipe stdout to jq without losing diagnostics.
Keeping the mirror fresh
Three patterns, in increasing order of automation:
On-demand staleness check
Use --sync-if-stale on gitcrawl search (or the gh-shim's search):
gitcrawl search issues "manifest cache" \
-R owner/repo \
--sync-if-stale 5m \
--json number,title,url
Best for ad-hoc agent tools that should bound staleness but minimize sync calls.
Auto-hydration via the gh shim
Symlink the gitcrawl binary as gh (or gitcrawl-gh) and let the shim pull a single PR's detail when an agent calls gh pr view or gh pr checks against an unhydrated PR. See gh shim → auto-hydration.
This is the lowest-overhead pattern for fleets of agents — no scheduling required.
Periodic background refresh
Run gitcrawl refresh owner/repo on a cron, systemd timer, or launchd agent every few minutes per repo. Combine with the gh shim and your agents almost never have to wait on GitHub.
# Every 5 minutes, refresh the active repos.
*/5 * * * * $HOME/bin/gitcrawl refresh openclaw/gitcrawl --json > /tmp/gitcrawl.openclaw.json 2>&1
For multiple repos, loop in a small shell script — gitcrawl is happy to run sequentially against a shared SQLite file.
Agent recipes
"Look up an issue without burning quota"
gh issue view 123 -R owner/repo --json number,title,state,body,labels,author
With the shim symlinked as gh, this answers from local SQLite if the issue is cached. Auto-hydration covers PR-detail fields. The agent prompt does not change.
"Find candidates, hydrate them, summarize"
NUMS=$(gh search issues "checksum mismatch" -R owner/repo \
--json number --limit 30 \
| jq -r '[.[].number] | join(",")')
gitcrawl sync owner/repo --numbers "$NUMS" --include-comments --with pr-details
gitcrawl cluster-detail owner/repo --id "$(gitcrawl clusters owner/repo --json \
| jq '.clusters[0].id')"
Search is local; the targeted sync brings exactly the rows you need; cluster-detail returns the structured triage view.
"Find duplicates around a new bug report"
NUM=789
gitcrawl sync owner/repo --numbers "$NUM" --include-comments
gitcrawl embed owner/repo --number "$NUM"
gitcrawl neighbors owner/repo --number "$NUM" --limit 10 --json
"Triage a cluster end to end"
ID=42
# Read.
gitcrawl cluster-detail owner/repo --id "$ID" --body-chars 600 --json
# Decide canonical, then close locally.
gitcrawl set-cluster-canonical owner/repo --id "$ID" --number 123
gitcrawl close-cluster owner/repo --id "$ID" --reason "consolidated under #123"
# Comment upstream via real gh.
gh issue comment 456 -R owner/repo --body "Duplicate of #123"
"Prove the shim is paying off"
# Periodically log cache stats — watch local_hits climb relative to backend_misses.
gitcrawl gh xcache stats --json \
| jq '{local: .counters.local_hits, fallback: .counters.fallback_hits, github: .counters.backend_misses}'
# During release/debug sessions, compare a recent window or snapshot before reset.
gitcrawl gh xcache stats --since 1h --json
gitcrawl gh xcache snapshot --reset --json
Multi-repo automation
A single gitcrawl.db can hold many repositories. Loop in shell:
for repo in openclaw/gitcrawl steipete/repo-a octocat/repo-b; do
gitcrawl refresh "$repo" --json | jq '{repo: "'"$repo"'", sync: .sync, embed: .embed}'
done
Or maintain a small script that reads a list of repos from a file and runs them on a schedule.
Output formats
| Format | When to use |
|---|---|
text (default) |
Humans at a terminal |
json (or --json) |
Pipelines, scripts, agents |
log |
Internal logging output; structured key/value pairs |
Force a format globally with --format json or per-command with --json. The log format is mostly used internally and is subject to change.
CI integration
Run gitcrawl in CI to validate a portable store's freshness, sanity-check cluster shapes, or produce a triage report:
- name: Refresh and snapshot clusters
run: |
gitcrawl init --portable-store $PORTABLE_STORE_URL
gitcrawl refresh openclaw/gitcrawl --json > sync.json
gitcrawl clusters openclaw/gitcrawl --json --sort size --min-size 5 > clusters.json
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
- uses: actions/upload-artifact@v4
with: { name: triage, path: "*.json" }
The artifact gives reviewers a structured view of what changed and how the cluster graph looks today.
Best practices
- Set both tokens in a single place. Either env or
[env]inconfig.toml. Mixing sources tends to confusedoctorreports. - Bound the staleness window.
--sync-if-staleon every agent-driven search is cheaper than a hot cron loop. - Monitor
xcache stats. Ifbackend_missesdwarfslocal_hits, you are not yet getting the shim's benefit — usually means agents are callingghdirectly without going through the symlink. - Re-cluster after a backfill. A large
--state allsync should be followed bygitcrawl refresh --no-sync(or justgitcrawl embed && gitcrawl cluster) so the durable graph reflects the new content. - Pin the
ghbinary. SetGITCRAWL_GH_PATHexplicitly so the shim does not accidentally invoke itself.