On-disk substrate¶
Goals of the layout¶
- Portable.
tarit, mail it, untar it on another machine, runanchor serve --data-dir ./received. Same canvases, same documents. - Inspectable. Plain
.jsonand.md.catworks.jqworks.git logof the data folder tells a coherent story. - Independently lifecycled. Documents and canvases don't depend on each other. Add or remove either without orphaning the other.
- Producer-extensible. Every extension owns a top-level folder. Adding a new extension doesn't change paths owned by existing ones.
The full tree¶
data/
├── canvases/ (canvas core)
│ └── <slug>/
│ ├── meta.json slug, title, created_at, schema_version
│ ├── events.jsonl append-only event log, one event per line
│ └── snapshot.json cached fold of events at known version
│
├── bronze/ (anchor_pdfs)
│ └── <slug>/<original>.pdf raw PDF bytes
│
├── silver/ (anchor_pdfs)
│ └── <slug>.json Docling extraction — items, pages, bboxes
│
├── gold/ (anchor_pdfs, OIP-shaped)
│ ├── manifest.json anchor_pdfs's OIP manifest
│ └── artefacts/
│ └── <slug>/
│ ├── document.json OIP document — title, source_kind, ingested_at
│ ├── regions.json OIP regions — id, kind, source_ref, content paths
│ └── content/
│ ├── page_001.md polished markdown per page
│ ├── r0042.png region screenshot
│ └── r0042.json structured region data (e.g. spec table rows)
│
├── fmus/ (anchor_fmus)
│ ├── manifest.json anchor_fmus's OIP manifest
│ └── artefacts/
│ └── <slug>/
│ ├── document.json title (model name), source_kind=application/x-fmu
│ ├── regions.json regions: variables, parameters, simulation results
│ ├── source/<original>.fmu
│ └── content/
│ ├── model.json parsed modelDescription.xml
│ └── simulations/<sim-id>.json
│
└── .oip/
└── producers.d/ third-party OIP manifests, per-project scope
└── <name>.json
Per-canvas folder, line by line¶
data/canvases/<slug>/
├── meta.json
│ {
│ "slug": "alfa-laval-eval",
│ "title": "Alfa Laval LKH evaluation",
│ "created_at": "2026-04-30T08:14:22Z",
│ "schema_version": "0.2"
│ }
│
├── events.jsonl
│ {"id": "uuid", "ts": ..., "version": 1, "type": "NodeAdded", ...}
│ {"id": "uuid", "ts": ..., "version": 2, "type": "EdgeAdded", ...}
│ ...
│
└── snapshot.json
{
"version": 142,
"nodes": { "n1": {...}, "n2": {...}, ... },
"edges": { "e1": {...}, ... }
}
events.jsonl is the truth. snapshot.json is a performance
optimisation — without it cold-boot would have to replay every event;
with it the server reads the snapshot and only replays events with
version > snapshot.version. Compaction (rewriting the snapshot at
checkpoints, optionally truncating event history) is on the roadmap
but not implemented; today the log grows monotonically.
Per-document artefact, line by line¶
A data/gold/artefacts/alfa-laval-lkh/ produced by anchor_pdfs looks
like:
document.json
{
"slug": "alfa-laval-lkh",
"title": "Alfa Laval LKH centrifugal pump",
"source_kind": "application/pdf",
"source_path": "../../bronze/alfa-laval-lkh/original.pdf",
"ingested_at": "2026-04-30T08:14:22Z",
"ingested_by": "anchor-pdfs/0.2.0",
"size_units": {"page_count": 4, "item_count": 187}
}
regions.json
[
{
"id": "alfa-laval-lkh:r0007",
"kind": "spec_block",
"title": "Pump performance — model LKH-5",
"source_ref": {
"kind": "pdf-page-bbox",
"page": 2,
"bbox": [72.0, 144.0, 540.0, 312.0]
},
"content": {
"text": "content/r0007.md",
"image": "content/r0007.png",
"json": "content/r0007.json"
}
},
...
]
content/
├── page_001.md full polished markdown for page 1
├── page_002.md
├── r0007.md markdown for region 7 (a spec block)
├── r0007.png cropped screenshot of the bbox
├── r0007.json structured rows (canonical headers + units)
└── ...
This is the unit of provenance. Every value on the canvas points
back to a region. A spec-table row is a data field on a node;
that field carries the region id; the region carries the page and bbox;
the bbox is rendered as a highlight in the PDF viewer. There is no
detached "metadata table" — the source ref is the data.
Why JSONL events instead of a database¶
- Events are tiny, mostly < 1 KB. A canvas with 10 000 events is 10 MB.
- One file per workspace makes copying, moving, and diffing trivial.
- An append-only file is the simplest possible crash-safe write. No transactions, no migrations, no DBA required.
- A future Postgres / SQLite store is a port (the
WorkspaceStoreprotocol). The day we need indexed queries across workspaces is the day we add a second implementation. The file format stays.
Multi-tenant note¶
The current layout assumes one user, one machine. The multi-tenant
roadmap (separate memory entry) adds a tenants/<tenant>/ prefix and
moves canvases under tenants/<tenant>/canvases/. Documents stay
shared by default with opt-in scoping. Nothing in the current paths
needs to change to accommodate that — they become subpaths.