Peter Steinberger e5621d1b78

feat: improve gh shim cache observability

2026-05-05 21:23:39 +01:00

6.3 KiB

Raw Permalink Blame History

title	nav_order	permalink
Automation	14	/automation/

Automation

{: .no_toc }

Stable JSON contracts, agent recipes, and patterns for keeping the local mirror warm without manual ceremony. {: .fs-6 .fw-300 }

TOC {:toc}

JSON output is first class

Every command supports --json (or the global --format json). The resulting payload is pretty-printed with stable field names so you can pipe it directly into jq or feed it to an agent as structured context.

gitcrawl sync owner/repo --json | jq '{run_id, inserted, updated}'
gitcrawl clusters owner/repo --json --sort size --min-size 5 \
  | jq '.clusters[] | {id, members: .member_count, latest: .latest_thread_number}'

For the full per-command JSON shapes, see the individual feature pages and the Commands reference.

Exit codes

0 — success
non-zero — usage error, command not implemented, runtime error

Stderr always carries a human-readable error message; stdout is reserved for the requested output (text or JSON) so you can pipe stdout to jq without losing diagnostics.

Keeping the mirror fresh

Three patterns, in increasing order of automation:

On-demand staleness check

Use --sync-if-stale on gitcrawl search (or the gh-shim's search):

gitcrawl search issues "manifest cache" \
  -R owner/repo \
  --sync-if-stale 5m \
  --json number,title,url

Best for ad-hoc agent tools that should bound staleness but minimize sync calls.

Auto-hydration via the gh shim

Symlink the gitcrawl binary as gh (or gitcrawl-gh) and let the shim pull a single PR's detail when an agent calls gh pr view or gh pr checks against an unhydrated PR. See gh shim → auto-hydration.

This is the lowest-overhead pattern for fleets of agents — no scheduling required.

Periodic background refresh

Run gitcrawl refresh owner/repo on a cron, systemd timer, or launchd agent every few minutes per repo. Combine with the gh shim and your agents almost never have to wait on GitHub.

# Every 5 minutes, refresh the active repos.
*/5 * * * * $HOME/bin/gitcrawl refresh openclaw/gitcrawl --json > /tmp/gitcrawl.openclaw.json 2>&1

For multiple repos, loop in a small shell script — gitcrawl is happy to run sequentially against a shared SQLite file.

Agent recipes

"Look up an issue without burning quota"

gh issue view 123 -R owner/repo --json number,title,state,body,labels,author

With the shim symlinked as gh, this answers from local SQLite if the issue is cached. Auto-hydration covers PR-detail fields. The agent prompt does not change.

"Find candidates, hydrate them, summarize"

NUMS=$(gh search issues "checksum mismatch" -R owner/repo \
        --json number --limit 30 \
        | jq -r '[.[].number] | join(",")')

gitcrawl sync owner/repo --numbers "$NUMS" --include-comments --with pr-details

gitcrawl cluster-detail owner/repo --id "$(gitcrawl clusters owner/repo --json \
        | jq '.clusters[0].id')"

Search is local; the targeted sync brings exactly the rows you need; cluster-detail returns the structured triage view.

"Find duplicates around a new bug report"

NUM=789
gitcrawl sync owner/repo --numbers "$NUM" --include-comments
gitcrawl embed owner/repo --number "$NUM"
gitcrawl neighbors owner/repo --number "$NUM" --limit 10 --json

"Triage a cluster end to end"

ID=42

# Read.
gitcrawl cluster-detail owner/repo --id "$ID" --body-chars 600 --json

# Decide canonical, then close locally.
gitcrawl set-cluster-canonical owner/repo --id "$ID" --number 123
gitcrawl close-cluster owner/repo --id "$ID" --reason "consolidated under #123"

# Comment upstream via real gh.
gh issue comment 456 -R owner/repo --body "Duplicate of #123"

"Prove the shim is paying off"

# Periodically log cache stats — watch local_hits climb relative to backend_misses.
gitcrawl gh xcache stats --json \
  | jq '{local: .counters.local_hits, fallback: .counters.fallback_hits, github: .counters.backend_misses}'

# During release/debug sessions, compare a recent window or snapshot before reset.
gitcrawl gh xcache stats --since 1h --json
gitcrawl gh xcache snapshot --reset --json

Multi-repo automation

A single gitcrawl.db can hold many repositories. Loop in shell:

for repo in openclaw/gitcrawl steipete/repo-a octocat/repo-b; do
  gitcrawl refresh "$repo" --json | jq '{repo: "'"$repo"'", sync: .sync, embed: .embed}'
done

Or maintain a small script that reads a list of repos from a file and runs them on a schedule.

Output formats

Format	When to use
`text` (default)	Humans at a terminal
`json` (or `--json`)	Pipelines, scripts, agents
`log`	Internal logging output; structured key/value pairs

Force a format globally with --format json or per-command with --json. The log format is mostly used internally and is subject to change.

CI integration

Run gitcrawl in CI to validate a portable store's freshness, sanity-check cluster shapes, or produce a triage report:

- name: Refresh and snapshot clusters
  run: |
    gitcrawl init --portable-store $PORTABLE_STORE_URL
    gitcrawl refresh openclaw/gitcrawl --json > sync.json
    gitcrawl clusters openclaw/gitcrawl --json --sort size --min-size 5 > clusters.json    
  env:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- uses: actions/upload-artifact@v4
  with: { name: triage, path: "*.json" }

The artifact gives reviewers a structured view of what changed and how the cluster graph looks today.

Best practices

Set both tokens in a single place. Either env or [env] in config.toml. Mixing sources tends to confuse doctor reports.
Bound the staleness window. --sync-if-stale on every agent-driven search is cheaper than a hot cron loop.
Monitor xcache stats. If backend_misses dwarfs local_hits, you are not yet getting the shim's benefit — usually means agents are calling gh directly without going through the symlink.
Re-cluster after a backfill. A large --state all sync should be followed by gitcrawl refresh --no-sync (or just gitcrawl embed && gitcrawl cluster) so the durable graph reflects the new content.
Pin the gh binary. Set GITCRAWL_GH_PATH explicitly so the shim does not accidentally invoke itself.

6.3 KiB Raw Permalink Blame History