2.9 KiB
notioncrawl Spec
Goals
- build a local-first Notion crawler
- mirror Notion pages, blocks, databases, comments, and workspace metadata
- store normalized records in SQLite
- preserve raw source records for future re-rendering
- render normalized Markdown blobs into an organized file tree
- support fast text search and raw SQL
- support one-shot backfill and incremental repair
- publish and subscribe private git-backed snapshots
Product Summary
notioncrawl is a Go CLI that turns Notion workspace memory into a local
SQLite archive plus normalized Markdown files.
V1 scope:
- macOS Notion Desktop cache discovery
- read-only desktop snapshot ingestion
- official Notion API sync
- pages and blocks
- databases/data sources as collections
- comments and discussions where available
- users and spaces/workspaces
- FTS5 search over rendered page/comment text
- raw SQL access
- Markdown export
- git-backed archive publishing and subscription
Out of scope for V1:
- write-back actions
- modifying Notion local storage
- bypassing workspace permissions
- full attachment blob mirroring by default
- public integration Marketplace hardening
Data Sources
Desktop Source
Default macOS path:
~/Library/Application Support/Notion/notion.db
Desktop sync must:
- locate Notion Desktop storage
- snapshot
notion.dbinto the cache dir - open the snapshot read-only
- ingest supported tables into the local archive
- record unsupported source records in
raw_records
Desktop cache coverage is opportunistic. It only includes what Notion has cached, downloaded, or recently touched locally.
API Source
API sync uses NOTION_TOKEN by default. It must:
- search/list pages and data sources visible to the integration
- recursively fetch block children
- fetch users
- fetch comments where the integration has access
- obey
Retry-Afteron rate limits - store raw JSON plus normalized rows
SQLite Archive
SQLite is canonical. Markdown is generated output.
Core tables:
spacesuserspagesblockscollectionscollection_viewscommentsdiscussionsraw_recordssync_statepage_ftscomment_fts
Markdown Archive
Markdown export writes deterministic paths:
pages/<space-slug>/<page-title>-<short-id>.md
Each file starts with YAML-ish front matter:
---
id: ...
space_id: ...
title: ...
source: desktop+api
notion_url: ...
created_time: ...
last_edited_time: ...
---
The body renders blocks into normalized Markdown. Unsupported blocks should be represented with concise placeholders, not silently dropped.
Git Share
Git share mode exports:
manifest.json
data/*.jsonl.gz
pages/**/*.md
publish writes a snapshot and optionally commits/pushes it.
subscribe clones a snapshot repo, writes reader config, and imports data into
SQLite without requiring Notion credentials.
update pulls the latest snapshot and imports it.