openclaw/notcrawl

Vincent Koc a03b44eb6f

chore: scaffold notioncrawl

2026-04-22 14:41:56 -07:00

2.9 KiB

Raw Blame History

notioncrawl Spec

Goals

build a local-first Notion crawler
mirror Notion pages, blocks, databases, comments, and workspace metadata
store normalized records in SQLite
preserve raw source records for future re-rendering
render normalized Markdown blobs into an organized file tree
support fast text search and raw SQL
support one-shot backfill and incremental repair
publish and subscribe private git-backed snapshots

Product Summary

notioncrawl is a Go CLI that turns Notion workspace memory into a local SQLite archive plus normalized Markdown files.

V1 scope:

macOS Notion Desktop cache discovery
read-only desktop snapshot ingestion
official Notion API sync
pages and blocks
databases/data sources as collections
comments and discussions where available
users and spaces/workspaces
FTS5 search over rendered page/comment text
raw SQL access
Markdown export
git-backed archive publishing and subscription

Out of scope for V1:

write-back actions
modifying Notion local storage
bypassing workspace permissions
full attachment blob mirroring by default
public integration Marketplace hardening

Data Sources

Desktop Source

Default macOS path:

~/Library/Application Support/Notion/notion.db

Desktop sync must:

locate Notion Desktop storage
snapshot notion.db into the cache dir
open the snapshot read-only
ingest supported tables into the local archive
record unsupported source records in raw_records

Desktop cache coverage is opportunistic. It only includes what Notion has cached, downloaded, or recently touched locally.

API Source

API sync uses NOTION_TOKEN by default. It must:

search/list pages and data sources visible to the integration
recursively fetch block children
fetch users
fetch comments where the integration has access
obey Retry-After on rate limits
store raw JSON plus normalized rows

SQLite Archive

SQLite is canonical. Markdown is generated output.

Core tables:

spaces
users
pages
blocks
collections
collection_views
comments
discussions
raw_records
sync_state
page_fts
comment_fts

Markdown Archive

Markdown export writes deterministic paths:

pages/<space-slug>/<page-title>-<short-id>.md

Each file starts with YAML-ish front matter:

---
id: ...
space_id: ...
title: ...
source: desktop+api
notion_url: ...
created_time: ...
last_edited_time: ...
---

The body renders blocks into normalized Markdown. Unsupported blocks should be represented with concise placeholders, not silently dropped.

Git share mode exports:

manifest.json
data/*.jsonl.gz
pages/**/*.md

publish writes a snapshot and optionally commits/pushes it.

subscribe clones a snapshot repo, writes reader config, and imports data into SQLite without requiring Notion credentials.

update pulls the latest snapshot and imports it.