openclaw/notcrawl

Fork 0

Vincent Koc b80faa9e7e

Validation / validate (push) Has been cancelled

Details

fix(markdown): preserve unicode export paths

2026-04-27 11:03:23 -07:00

4.4 KiB

Raw Permalink Blame History

notcrawl Spec

Goals

build a local-first Notion crawler
mirror Notion pages, blocks, databases, comments, and workspace metadata
store normalized records in SQLite
preserve raw source records for future re-rendering
render normalized Markdown blobs into an organized file tree
support fast text search and raw SQL
support one-shot backfill and incremental repair
publish and subscribe private git-backed snapshots

Product Summary

notcrawl is a Go CLI that turns Notion workspace memory into a local SQLite archive plus normalized Markdown files.

V1 scope:

macOS Notion Desktop cache discovery
read-only desktop snapshot ingestion
official Notion API sync
pages and blocks
databases/data sources as collections, including current data-source API endpoints
database rows as pages linked to their collection
comments and discussions where available
users and spaces/workspaces
FTS5 search over rendered page/comment text
raw SQL access
archive status, activity reporting, and SQLite maintenance commands
Markdown export
CSV/TSV export for database rows
git-backed archive publishing and subscription

Out of scope for V1:

write-back actions
modifying Notion local storage
bypassing workspace permissions
full attachment blob mirroring by default
public integration Marketplace hardening

Data Sources

Desktop Source

Default macOS path:

~/Library/Application Support/Notion/notion.db

Desktop sync must:

locate Notion Desktop storage
snapshot notion.db into the cache dir
open the snapshot read-only
ingest supported tables into the local archive
record unsupported source records in raw_records

Desktop cache coverage is opportunistic. It only includes what Notion has cached, downloaded, or recently touched locally.

API Source

API sync uses NOTION_TOKEN by default. It must:

search/list pages and data sources visible to the integration
recursively fetch block children
fetch users
fetch comments where the integration has access
obey Retry-After on rate limits
store raw JSON plus normalized rows

New configs should use the current Notion API version. Existing configs pinned to legacy 2022-06-28 must continue using deprecated database query endpoints.

SQLite Archive

SQLite is canonical. Markdown is generated output.

Store startup must enable WAL, foreign keys, a busy timeout, normal synchronous writes, in-memory temp storage, and the crawler query indexes needed for common page, collection, comment, raw-record, and sync-state lookups.

report must provide a SQL-free archive summary: total records, recent edited page/comment windows, top databases, top spaces, and recently edited pages.

Core tables:

spaces
users
pages
blocks
collections
collection_views
comments
discussions
raw_records
sync_state
page_fts
comment_fts

Markdown Archive

Markdown export writes deterministic Unicode-safe paths. Path components keep readable letters, numbers, CJK text, and emoji while replacing filesystem path separators and unsafe punctuation with dashes:

pages/<space-slug>/<page-title>-<short-id>.md

Each file starts with YAML-ish front matter:

---
id: ...
space_id: ...
title: ...
source: desktop+api
notion_url: ...
created_time: ...
last_edited_time: ...
---

The body renders blocks into normalized Markdown. Unsupported blocks should be represented with concise placeholders, not silently dropped.

Git share mode exports:

manifest.json
data/*.jsonl.gz
pages/**/*.md

publish writes a snapshot and optionally commits/pushes it.

subscribe clones a snapshot repo, writes reader config, and imports data into SQLite without requiring Notion credentials.

update pulls the latest snapshot and imports it.

Database Export

API sync discovers databases/data sources visible to the integration, stores metadata in collections, queries each collection for row pages, and links those pages through pages.collection_id.

export-db renders row properties into delimited text:

notcrawl export-db --database <database-id> --format csv --output rows.csv
notcrawl export-db --database <database-id> --format tsv --output rows.tsv

The first columns are stable metadata:

page_id
page_title
url

Remaining columns come from the database schema, with any extra row properties appended alphabetically.

4.4 KiB Raw Permalink Blame History