feat: add repo metadata sync, wiki/release mirroring, disk alerts, and updated README

This commit is contained in:
mineracks 2026-04-03 00:02:55 +10:00
parent f6b3bde95d
commit 8f1add8346
2 changed files with 153 additions and 29 deletions

129
README.md
View File

@ -6,12 +6,12 @@ Append-only, tamper-resistant mirroring of GitHub repositories to a self-hosted
This tool is built for the scenario where upstream repos you depend on are deliberately destroyed — whether by a compromised maintainer, a platform takedown, account suspension, or coerced force-push. Specifically:
- **Upstream force-pushes empty history** Your copy keeps all previous commits, branches, and tags via timestamped backup refs. The wipe is detected and blocked.
- **Upstream deletes branches or tags** Your copy retains them. The sync script never deletes local refs.
- **Upstream repo is deleted entirely** Fetch fails gracefully; your existing local copy and Gitea copy are untouched.
- **GitHub account is banned/suspended** Same as deletion — your copies persist.
- **DMCA takedown** Your pre-takedown copy is preserved.
- **Subtle history rewrite** (less than 50% of refs removed) Still captured in backup refs, and the live refs are updated so you can diff the before and after.
- **Upstream force-pushes empty history** Your copy keeps all previous commits, branches, and tags via timestamped backup refs. The wipe is detected and blocked.
- **Upstream deletes branches or tags** Your copy retains them. The sync script never deletes local refs.
- **Upstream repo is deleted entirely** Fetch fails gracefully; your existing local copy and Gitea copy are untouched.
- **GitHub account is banned/suspended** Same as deletion — your copies persist.
- **DMCA takedown** Your pre-takedown copy is preserved.
- **Subtle history rewrite** (less than 50% of refs removed) Still captured in backup refs, and the live refs are updated so you can diff the before and after.
## How it works
@ -46,14 +46,51 @@ Every sync writes a structured audit log recording exactly what happened: which
The audit directory is set with the `+a` (append-only) filesystem attribute during install, so even the breakglass service user can't delete or modify previous audit entries.
## Features
### Organisation and repo metadata sync
On first sync, the script automatically creates Gitea organisations matching each GitHub owner and syncs their avatars. Repository metadata — including default branch, description, and homepage URL — is read from the GitHub API and applied to the Gitea repo. Descriptions are prefixed with `[BREAKGLASS]` so it's always clear which repos are mirrors.
This runs once per repo (tracked by marker files) and prevents Gitea from showing 500 errors due to default branch mismatches.
### Wiki mirroring
When `SYNC_WIKIS=true` (the default), the script checks whether each GitHub repo has an associated wiki. If one exists, it clones the wiki as a separate bare repo and pushes it to the matching Gitea repo's wiki. This preserves project documentation alongside the code.
Wiki repos use the standard `.wiki.git` suffix and are pushed via HTTP with credential-store authentication.
### Release asset downloads
When `SYNC_RELEASES=true`, the script downloads release assets (binaries, source archives, installers) for the latest N releases per repo (configured by `RELEASE_KEEP`, default 3). Assets are stored locally under `RELEASE_ROOT` and uploaded to Gitea as proper releases via the API, preserving the tag name, release title, and body text.
This ensures that even if GitHub removes download links, you have local copies of the actual release binaries people need to verify and install software.
### LFS support with timeouts
Repos using Git LFS are handled automatically. LFS objects are fetched and pushed alongside regular git objects. To prevent massive LFS repos (like seedsigner/buildroot) from blocking the entire sync indefinitely, each LFS operation is wrapped in a configurable timeout (`LFS_TIMEOUT`, default 600 seconds). If a timeout is hit, the sync continues with remaining repos rather than stalling.
### Push notifications
The sync script and health check both send push notifications for significant events. Supported backends are ntfy (recommended — free, no server needed, push to phone), email, and Telegram. Notifications include priority levels and tags:
- **Urgent** — wipe detection triggered, sync blocked
- **High** — errors during sync, healthcheck failures
- **Default** — sync completed successfully, new repos mirrored
- **Low** — routine status updates
### Disk space monitoring
The health check includes disk usage monitoring. It warns at 80% usage and sends a critical alert at 90%, giving you time to expand storage or prune release assets before the mirror runs out of space.
## Quick start
### Prerequisites
- Fresh Ubuntu 22.04+ VM (2 GB RAM, 20+ GB disk)
- Your Gitea instance accessible via HTTPS
- Your Gitea instance accessible via HTTP/HTTPS
- Gitea personal access token (repo read/write scope)
- Optional: GitHub token for higher API rate limits
- Optional: GitHub token for higher API rate limits (60 → 5000 req/h)
### Install
@ -72,18 +109,24 @@ The installer handles everything interactively: packages, user creation, config,
/etc/breakglass/mirror.env # tokens, URLs, settings (mode 600)
/etc/breakglass/sources.yml # GitHub owners to mirror
/var/lib/breakglass/repos/ # bare git clones
/var/lib/breakglass/releases/ # downloaded release assets
/var/lib/breakglass/audit/ # tamper-evident audit logs (+a attr)
/var/log/breakglass/ # sync logs (90-day rotation)
```
Systemd timers:
- `breakglass-sync.timer` — daily at 04:00 UTC (with 30min random jitter)
- `breakglass-healthcheck.timer` — daily at 08:00 UTC
- `breakglass-sync.timer` — daily at 02:00 local time, with `Persistent=true` so missed runs fire on next boot
- `breakglass-healthcheck.timer` — daily at 08:00 local time
Both services use `Restart=on-failure` with a 5-minute backoff and an 8-hour timeout to handle large initial syncs.
## Configuration
### sources.yml
Define which GitHub owners to mirror. You can mirror entire organisations or filter to specific repos:
```yaml
owners:
- github: bitcoin
@ -91,26 +134,37 @@ owners:
- github: seedsigner
- github: seedhammer
# With filters:
- github: some-large-org
# Mirror only specific repos from an org:
- github: cmyk
include:
- "important-repo"
- "seedetcher"
# Exclude repos by pattern:
- github: some-large-org
exclude:
- "test-*"
- "deprecated-*"
```
### mirror.env
Key settings:
| Variable | Purpose | Default |
|----------|---------|---------|
| `GITEA_URL` | Your Gitea instance URL | — |
| `GITEA_TOKEN` | Gitea API token | — |
| `GITHUB_TOKEN` | GitHub token (optional) | — |
| `WIPE_THRESHOLD` | Block sync if upstream loses >N% of refs | 50 |
| `NOTIFY_METHOD` | `ntfy`, `email`, `telegram`, or `none` | none |
| `STALE_DAYS` | Alert if a repo hasn't synced in N days | 7 |
| `GITEA_USER` | Gitea username for push auth | — |
| `GITHUB_TOKEN` | GitHub token (optional, raises rate limit) | — |
| `WIPE_THRESHOLD` | Block sync if upstream loses >N% of refs | `50` |
| `NOTIFY_METHOD` | `ntfy`, `email`, `telegram`, or `none` | `none` |
| `NTFY_TOPIC` | ntfy topic name (make it unguessable) | `breakglass` |
| `NTFY_SERVER` | ntfy server URL | `https://ntfy.sh` |
| `STALE_DAYS` | Alert if a repo hasn't synced in N days | `7` |
| `LFS_TIMEOUT` | Max seconds per LFS fetch/push (0 = no limit) | `600` |
| `SYNC_WIKIS` | Mirror GitHub wikis to Gitea | `true` |
| `SYNC_RELEASES` | Download and mirror release assets | `true` |
| `RELEASE_KEEP` | How many releases to keep per repo | `3` |
| `RELEASE_ROOT` | Where to store downloaded release assets | `/var/lib/breakglass/releases` |
| `FORCE_HTTP11` | Force HTTP/1.1 (helps with Cloudflare Tunnel) | `true` |
## Day-to-day commands
@ -121,8 +175,14 @@ sudo systemctl status breakglass-sync.timer
# Trigger immediate sync
sudo systemctl start breakglass-sync.service
# Watch sync live
sudo journalctl -u breakglass-sync.service -f
# Run sync in foreground (useful for debugging)
sudo -u breakglass /opt/breakglass/scripts/breakglass-sync.sh
# Run sync detached from your SSH session (won't die if you disconnect)
sudo -u breakglass nohup /opt/breakglass/scripts/breakglass-sync.sh &
# Watch sync logs live
tail -f /var/log/breakglass/sync-$(date +%Y%m%d)*.log
# Run health check
sudo systemctl start breakglass-healthcheck.service
@ -133,30 +193,41 @@ ls -lt /var/lib/breakglass/audit/
# View recent sync logs
ls -lt /var/log/breakglass/ | head
# Check disk usage
du -sh /var/lib/breakglass/repos/ /var/lib/breakglass/releases/
# Re-sync metadata for all repos (e.g., after fixing a bug)
sudo rm -f /var/lib/breakglass/repos/.avatars/*.meta.synced
sudo systemctl start breakglass-sync.service
# Add a new GitHub org
sudo nano /etc/breakglass/sources.yml
```
## Health checks
The healthcheck script (runs daily) verifies:
The healthcheck script (runs daily at 08:00) verifies:
1. Gitea is reachable
2. Sync timer is active
2. Sync timer is active and enabled
3. Recent sync logs exist
4. No repos have gone stale
4. No repos have gone stale (configurable threshold)
5. Backup refs exist in all repos (append-only is working)
6. Audit log checksums haven't been tampered with
7. Local ref counts haven't decreased (local deletion detection)
8. Disk usage is below warning (80%) and critical (90%) thresholds
Results are sent as a push notification with appropriate priority levels.
## What this does NOT protect against
To be transparent about limitations:
- **VM compromise**: If an attacker gets root on your mirror VM, they can delete everything. Mitigate with VM-level snapshots, ZFS snapshots, or offsite backups of `/var/lib/breakglass/repos/`.
- **Gitea compromise**: If someone gets admin on your Gitea, they could delete repos there. The bare clones on the VM are the primary archive; Gitea is a secondary copy and convenient browsing interface.
- **Disk failure**: Standard hardware risk. Use RAID or VM-level redundancy.
- **Repos you don't know about yet**: This only mirrors repos from the owners you've configured. If a new critical repo appears, you need to add the owner to sources.yml.
- **VM compromise** — If an attacker gets root on your mirror VM, they can delete everything. Mitigate with VM-level snapshots, ZFS snapshots, or offsite backups of `/var/lib/breakglass/repos/`.
- **Gitea compromise** — If someone gets admin on your Gitea, they could delete repos there. The bare clones on the VM are the primary archive; Gitea is a secondary copy and convenient browsing interface.
- **Disk failure** — Standard hardware risk. Use RAID or VM-level redundancy.
- **Repos you don't know about yet** — This only mirrors repos from the owners you've configured. If a new critical repo appears, you need to add the owner to `sources.yml`.
- **GitHub API rate limits** — Without a `GITHUB_TOKEN`, you're limited to 60 requests/hour. Large orgs with many repos will hit this. A token raises the limit to 5000/hour.
For maximum paranoia, consider also running periodic `tar` backups of `/var/lib/breakglass/repos/` to an offsite location (S3, another server, external drive).
@ -164,7 +235,7 @@ For maximum paranoia, consider also running periodic `tar` backups of `/var/lib/
The original ran inside Umbrel's managed Docker environment. Umbrel silently recycled containers and broke the automation after a few weeks. This version runs on a plain Ubuntu VM where nothing can interfere with the systemd timers or filesystem.
Key improvements: wipe detection, audit trail, filesystem-level append-only on audit dir, staging namespace for safe fetch, health monitoring with notifications, and no `--mirror` flag (which enables destructive pruning).
Key improvements over v1: wipe detection with configurable threshold, tamper-evident audit trail with checksums, filesystem-level append-only on audit directory, staging namespace for safe fetch, wiki and release mirroring, LFS support with timeouts, org/repo avatar and metadata sync, 8-point health monitoring with disk space alerts, push notifications via ntfy/email/Telegram, systemd timers with persistence and failure restart, and no `--mirror` flag (which enables destructive pruning).
## License

View File

@ -745,9 +745,62 @@ push_to_gitea() {
warn "LFS push skipped/incomplete for ${gitea_org}/${repo} (may have timed out)"
fi
# ── Sync default branch and description from GitHub ───
sync_repo_metadata "$gitea_org" "$repo"
audit "PUSHED repo=${gitea_org}/${repo}"
}
# ═══════════════════════════════════════════════════════════
# METADATA: Sync default branch, description, website from GitHub
# ═══════════════════════════════════════════════════════════
sync_repo_metadata() {
local gitea_org="$1" repo="$2"
local marker="${MIRROR_ROOT}/.avatars/${gitea_org}_${repo}.meta.synced"
# Only sync metadata once per repo (delete marker to force re-sync)
[[ -f "$marker" ]] && return 0
# Get GitHub repo info
local gh_data
gh_data=$(gh_api "/repos/${gitea_org}/${repo}" 2>/dev/null) || return 0
# Extract default branch
local default_branch
default_branch=$(echo "$gh_data" | grep -o '"default_branch"[[:space:]]*:[[:space:]]*"[^"]*"' \
| sed 's/"default_branch"[[:space:]]*:[[:space:]]*"//;s/"//')
# Extract description
local description
description=$(echo "$gh_data" | grep -o '"description"[[:space:]]*:[[:space:]]*"[^"]*"' | head -1 \
| sed 's/"description"[[:space:]]*:[[:space:]]*"//;s/"//')
# Extract homepage
local homepage
homepage=$(echo "$gh_data" | grep -o '"homepage"[[:space:]]*:[[:space:]]*"[^"]*"' \
| sed 's/"homepage"[[:space:]]*:[[:space:]]*"//;s/"//')
if [[ -n "$default_branch" ]]; then
# Build the update payload
local payload="{\"default_branch\":\"${default_branch}\""
if [[ -n "$description" ]]; then
# Escape special JSON chars in description
description=$(echo "$description" | sed 's/\\/\\\\/g;s/"/\\"/g')
payload+=",\"description\":\"[BREAKGLASS] ${description}\""
fi
if [[ -n "$homepage" ]]; then
payload+=",\"website\":\"${homepage}\""
fi
payload+="}"
gitea_api PATCH "/repos/${gitea_org}/${repo}" -d "$payload" &>/dev/null || true
log " metadata synced (default_branch: $default_branch)"
fi
touch "$marker"
}
# ── Notifications ────────────────────────────────────────
notify() {