chore: scaffold notioncrawl
This commit is contained in:
commit
a03b44eb6f
7
.gitignore
vendored
Normal file
7
.gitignore
vendored
Normal file
@ -0,0 +1,7 @@
|
||||
bin/
|
||||
dist/
|
||||
*.db
|
||||
*.db-*
|
||||
*.log
|
||||
.DS_Store
|
||||
.notioncrawl/
|
||||
19
CONTRIBUTING.md
Normal file
19
CONTRIBUTING.md
Normal file
@ -0,0 +1,19 @@
|
||||
# Contributing to notioncrawl
|
||||
|
||||
Keep real Notion workspace data, secrets, tokens, cookies, and exported private
|
||||
content out of git.
|
||||
|
||||
Useful local checks:
|
||||
|
||||
```bash
|
||||
go test ./...
|
||||
go build ./cmd/notioncrawl
|
||||
```
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- read Notion Desktop data through snapshots only
|
||||
- prefer stable normalized rows plus raw source payloads
|
||||
- keep Markdown rendering deterministic
|
||||
- add comments only where Notion-specific behavior is not obvious
|
||||
- keep `README.md`, `SPEC.md`, and examples in sync
|
||||
21
LICENSE
Normal file
21
LICENSE
Normal file
@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2026 Vincent Koc
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
15
Makefile
Normal file
15
Makefile
Normal file
@ -0,0 +1,15 @@
|
||||
BINARY ?= bin/notioncrawl
|
||||
|
||||
.PHONY: build test run fmt
|
||||
|
||||
build:
|
||||
go build -o $(BINARY) ./cmd/notioncrawl
|
||||
|
||||
test:
|
||||
go test ./...
|
||||
|
||||
run:
|
||||
go run ./cmd/notioncrawl $(ARGS)
|
||||
|
||||
fmt:
|
||||
gofmt -w $$(find . -name '*.go' -not -path './.git/*')
|
||||
71
README.md
Normal file
71
README.md
Normal file
@ -0,0 +1,71 @@
|
||||
# notioncrawl
|
||||
|
||||
`notioncrawl` mirrors Notion workspace data into local SQLite and normalized
|
||||
Markdown so you can search, query, diff, and share your Notion memory without
|
||||
depending on the Notion UI.
|
||||
|
||||
It has two ingestion paths:
|
||||
|
||||
- `desktop`: read-only snapshots of the local Notion desktop cache
|
||||
- `api`: official Notion API sync with rate-limit aware crawling
|
||||
|
||||
SQLite is the canonical archive. Markdown is the durable human/agent surface.
|
||||
Git share mode publishes normalized snapshots that other machines can subscribe
|
||||
to without holding Notion credentials.
|
||||
|
||||
## Current Scope
|
||||
|
||||
- local SQLite storage with FTS5
|
||||
- read-only local desktop cache ingestion from macOS Notion
|
||||
- official API page/block/user/comment ingestion
|
||||
- normalized Markdown export organized by space and page path
|
||||
- compressed JSONL git-share snapshots plus import/update workflows
|
||||
- read-only SQL access for ad hoc inspection
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
go build -o bin/notioncrawl ./cmd/notioncrawl
|
||||
bin/notioncrawl init
|
||||
bin/notioncrawl doctor
|
||||
bin/notioncrawl sync --source desktop
|
||||
bin/notioncrawl export-md
|
||||
bin/notioncrawl search "launch plan"
|
||||
```
|
||||
|
||||
For API sync:
|
||||
|
||||
```bash
|
||||
export NOTION_TOKEN="secret_..."
|
||||
bin/notioncrawl sync --source api
|
||||
```
|
||||
|
||||
Default paths:
|
||||
|
||||
- config: `~/.notioncrawl/config.toml`
|
||||
- database: `~/.notioncrawl/notioncrawl.db`
|
||||
- cache: `~/.notioncrawl/cache`
|
||||
- Markdown archive: `~/.notioncrawl/pages`
|
||||
- git share repo: `~/.notioncrawl/share`
|
||||
|
||||
## Commands
|
||||
|
||||
- `init` writes a starter config
|
||||
- `doctor` checks config, SQLite, desktop cache, and token presence
|
||||
- `sync` ingests from `desktop`, `api`, or `all`
|
||||
- `export-md` renders normalized Markdown files from SQLite
|
||||
- `search` searches page and comment text through FTS5
|
||||
- `sql` runs read-only SQL against the archive
|
||||
- `publish` exports SQLite tables and Markdown into a git share repo
|
||||
- `subscribe` clones a share repo and imports the latest snapshot
|
||||
- `update` pulls and imports a subscribed share repo
|
||||
|
||||
## Safety Model
|
||||
|
||||
Desktop mode is read-only. It snapshots Notion's local SQLite database before
|
||||
reading it and never writes to Notion application storage.
|
||||
|
||||
API mode uses the official Notion API. It stores raw API payloads alongside
|
||||
normalized rows so renderers can improve without recrawling.
|
||||
|
||||
Secrets are never exported into Markdown or git-share snapshots.
|
||||
132
SPEC.md
Normal file
132
SPEC.md
Normal file
@ -0,0 +1,132 @@
|
||||
# notioncrawl Spec
|
||||
|
||||
## Goals
|
||||
|
||||
- build a local-first Notion crawler
|
||||
- mirror Notion pages, blocks, databases, comments, and workspace metadata
|
||||
- store normalized records in SQLite
|
||||
- preserve raw source records for future re-rendering
|
||||
- render normalized Markdown blobs into an organized file tree
|
||||
- support fast text search and raw SQL
|
||||
- support one-shot backfill and incremental repair
|
||||
- publish and subscribe private git-backed snapshots
|
||||
|
||||
## Product Summary
|
||||
|
||||
`notioncrawl` is a Go CLI that turns Notion workspace memory into a local
|
||||
SQLite archive plus normalized Markdown files.
|
||||
|
||||
V1 scope:
|
||||
|
||||
- macOS Notion Desktop cache discovery
|
||||
- read-only desktop snapshot ingestion
|
||||
- official Notion API sync
|
||||
- pages and blocks
|
||||
- databases/data sources as collections
|
||||
- comments and discussions where available
|
||||
- users and spaces/workspaces
|
||||
- FTS5 search over rendered page/comment text
|
||||
- raw SQL access
|
||||
- Markdown export
|
||||
- git-backed archive publishing and subscription
|
||||
|
||||
Out of scope for V1:
|
||||
|
||||
- write-back actions
|
||||
- modifying Notion local storage
|
||||
- bypassing workspace permissions
|
||||
- full attachment blob mirroring by default
|
||||
- public integration Marketplace hardening
|
||||
|
||||
## Data Sources
|
||||
|
||||
### Desktop Source
|
||||
|
||||
Default macOS path:
|
||||
|
||||
```text
|
||||
~/Library/Application Support/Notion/notion.db
|
||||
```
|
||||
|
||||
Desktop sync must:
|
||||
|
||||
1. locate Notion Desktop storage
|
||||
2. snapshot `notion.db` into the cache dir
|
||||
3. open the snapshot read-only
|
||||
4. ingest supported tables into the local archive
|
||||
5. record unsupported source records in `raw_records`
|
||||
|
||||
Desktop cache coverage is opportunistic. It only includes what Notion has
|
||||
cached, downloaded, or recently touched locally.
|
||||
|
||||
### API Source
|
||||
|
||||
API sync uses `NOTION_TOKEN` by default. It must:
|
||||
|
||||
1. search/list pages and data sources visible to the integration
|
||||
2. recursively fetch block children
|
||||
3. fetch users
|
||||
4. fetch comments where the integration has access
|
||||
5. obey `Retry-After` on rate limits
|
||||
6. store raw JSON plus normalized rows
|
||||
|
||||
## SQLite Archive
|
||||
|
||||
SQLite is canonical. Markdown is generated output.
|
||||
|
||||
Core tables:
|
||||
|
||||
- `spaces`
|
||||
- `users`
|
||||
- `pages`
|
||||
- `blocks`
|
||||
- `collections`
|
||||
- `collection_views`
|
||||
- `comments`
|
||||
- `discussions`
|
||||
- `raw_records`
|
||||
- `sync_state`
|
||||
- `page_fts`
|
||||
- `comment_fts`
|
||||
|
||||
## Markdown Archive
|
||||
|
||||
Markdown export writes deterministic paths:
|
||||
|
||||
```text
|
||||
pages/<space-slug>/<page-title>-<short-id>.md
|
||||
```
|
||||
|
||||
Each file starts with YAML-ish front matter:
|
||||
|
||||
```yaml
|
||||
---
|
||||
id: ...
|
||||
space_id: ...
|
||||
title: ...
|
||||
source: desktop+api
|
||||
notion_url: ...
|
||||
created_time: ...
|
||||
last_edited_time: ...
|
||||
---
|
||||
```
|
||||
|
||||
The body renders blocks into normalized Markdown. Unsupported blocks should be
|
||||
represented with concise placeholders, not silently dropped.
|
||||
|
||||
## Git Share
|
||||
|
||||
Git share mode exports:
|
||||
|
||||
```text
|
||||
manifest.json
|
||||
data/*.jsonl.gz
|
||||
pages/**/*.md
|
||||
```
|
||||
|
||||
`publish` writes a snapshot and optionally commits/pushes it.
|
||||
|
||||
`subscribe` clones a snapshot repo, writes reader config, and imports data into
|
||||
SQLite without requiring Notion credentials.
|
||||
|
||||
`update` pulls the latest snapshot and imports it.
|
||||
15
cmd/notioncrawl/main.go
Normal file
15
cmd/notioncrawl/main.go
Normal file
@ -0,0 +1,15 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"os"
|
||||
)
|
||||
|
||||
func main() {
|
||||
if len(os.Args) > 1 && (os.Args[1] == "-h" || os.Args[1] == "--help") {
|
||||
fmt.Print("Usage of notioncrawl:\n notioncrawl [global flags] <command> [args]\n")
|
||||
return
|
||||
}
|
||||
fmt.Fprintln(os.Stderr, "notioncrawl: implementation in progress")
|
||||
os.Exit(2)
|
||||
}
|
||||
19
config.example.toml
Normal file
19
config.example.toml
Normal file
@ -0,0 +1,19 @@
|
||||
db_path = "~/.notioncrawl/notioncrawl.db"
|
||||
cache_dir = "~/.notioncrawl/cache"
|
||||
markdown_dir = "~/.notioncrawl/pages"
|
||||
|
||||
[notion.desktop]
|
||||
enabled = true
|
||||
path = ""
|
||||
|
||||
[notion.api]
|
||||
enabled = true
|
||||
token_env = "NOTION_TOKEN"
|
||||
base_url = "https://api.notion.com/v1"
|
||||
version = "2022-06-28"
|
||||
|
||||
[share]
|
||||
remote = ""
|
||||
branch = "main"
|
||||
repo_path = "~/.notioncrawl/share"
|
||||
stale_after = "1h"
|
||||
Loading…
Reference in New Issue
Block a user