In a real production database indexing 2.4 million personal files across 4.2 TB of data, the metadata is 4.6 GB. That is a 1,000:1 ratio — one byte of metadata for every thousand bytes of content. This ratio is the foundation of an entire product architecture.
The Measurement
On 2026-03-20, during product design work on Heirloom — a local-first personal data management system — a SQLite database was queried that had been built by scanning a real personal storage collection:
- 2,404,812 files indexed
- 4.2 TB of data (photos, videos, documents, RAW files)
- 4.6 GB SQLite database containing all metadata
That works out to ~1,900 bytes per file on average, or roughly 1:1,000 by byte count relative to the underlying data.
This isn’t a benchmark. It’s a production measurement from a real system. And it has significant consequences for how personal storage products should be architected.
What 1,900 Bytes Per File Actually Contains
The metadata record for a single file is not just a name and timestamp. Fully indexed, it includes:
| Field category | Contents | Approx size |
|---|---|---|
| Path and identity | Full path, filename, extension | 100–400 bytes |
| POSIX metadata | mtime, ctime, atime, size, inode, link count, permissions | ~100 bytes |
| Content hash | BLAKE3 (32 bytes) | 32 bytes |
| Directory tree | Struct hash of containing directory | 32 bytes |
| Extended attributes | EXIF for photos, ID3 for audio, container metadata for video | 200–800 bytes |
| AI annotations | Labels, scene descriptions, content classifications | 200–600 bytes |
| Replication state | Priority, disposition, replica locations | 50–200 bytes |
| Database overhead | B-tree nodes, indexes, row headers | ~100 bytes |
BLAKE3 hashes are almost negligible — 32 bytes per file. The bulk of per-file storage is paths, human-readable metadata, and AI-derived annotations. The more you index, the more valuable the record becomes, but the cost per byte stays linear.
SQLite itself scales cleanly here. The format supports databases up to 281 TB with the maximum page size, and the PRAGMA max_page_count default since 3.45.0 (January 2024) supports ~17.5 TB at the default 4 KB page size — far beyond any personal collection. Read throughput in WAL mode exceeds 70,000 writes/second and 496,000 reads/second on commodity hardware. At 10 million files, a full table scan still completes in seconds.
The Ratio at Scale
The 1:1,000 ratio holds across file size categories, but the absolute metadata cost per unit of content varies dramatically:
| Average file size | Files per PB of content | Metadata per PB | What fits on a 4 TB NVMe index |
|---|---|---|---|
| 100 KB (documents) | 10 billion | 19 TB | 0.2 PB of content |
| 1 MB (photos, compressed) | 1 billion | 1.9 TB | 2 PB of content |
| 5 MB (RAW / HEIC / ProRes) | 200 million | 380 GB | 10 PB of content |
| 100 MB (video clips) | 10 million | 19 GB | 200 PB of content |
| 1 GB (feature films) | 1 million | 1.9 GB | 2,000 PB of content |
A 4 TB NVMe drive dedicated purely to indexing video clips can represent 200 petabytes of content. The math is favorable in every direction that matters for consumer and prosumer use cases.
Real-world Heirloom target: a family with 50 TB of photos and video, averaging ~5 MB per file, has roughly 10 million files and generates approximately 20 GB of metadata. That fits in the RAM of a 2022 MacBook Pro. It fits in the free tier of most cloud storage providers.
The R2 Economics
Cloudflare R2 charges $0.015 per GB per month with zero egress fees. No charge to read data out. No charge to sync to another device. No lock-in through bandwidth pricing.
Run the numbers for Heirloom customers storing only metadata in R2:
| Customer profile | Content size | Files (avg 5 MB) | Metadata | R2 cost/month | Product price | Gross margin on infra |
|---|---|---|---|---|---|---|
| Family | 50 TB | ~10 million | ~20 GB | $0.30 | $10 | 97% |
| Power user | 500 TB | ~100 million | ~190 GB | $2.85 | $25 | 89% |
| Small studio | 5 PB | ~1 billion | ~1.9 TB | $28.50 | $99 | 71% |
Compare to what iCloud costs to store the content itself: 2 TB on iCloud costs $9.99/month. That 2 TB limit stops a typical family cold. To store 50 TB on iCloud is not an option at any price. On Google One, 2 TB is $9.99/month — the same constraint.
These services are selling you cloud storage. Heirloom is not a cloud storage service. The content never leaves your drives. The catalog — what you have, where it is, what it means — syncs across devices for $0.30/month.
For comparison, Backblaze B2 charges $6/TB/month for content storage with egress at $0.01/GB. AWS S3 Standard runs approximately $23/TB/month with egress at $0.09/GB. Google Cloud Storage Standard is similarly priced. Every major content-cloud provider charges for both storage and retrieval. R2 charges for neither egress nor retrieval, making it structurally suited for catalog-only sync patterns that require frequent reads across devices.
The Architecture This Enables
Scan once, query forever
Because the metadata database is small relative to the data it describes, the entire catalog can live on a fast local NVMe and be queried without touching the underlying files. Drives can be offline. Files can be on a shelf in a hard case. The system can still answer: “What do I have? Where is it? Is it duplicated? What’s in that folder from 2019?”
This is a fundamental UX shift. Traditional file browsers require the drive to be mounted. Heirloom’s browser requires only the index. A laptop with the catalog synced from R2 can navigate the full family archive without spinning up a single external drive.
Pre-compute aggressively
At 1,900 bytes per file and 20 GB for a 10-million-file collection, every reasonable derived value — directory trees, content hashes, AI labels, perceptual hashes for deduplication — can be materialized in the index at scan time. There is no need to recompute on query. The index is cheap; query time should be instant.
SQLite’s FTS5 full-text search extension handles filename and annotation search across millions of rows in milliseconds. Combined with pre-computed vector embeddings for semantic search — stored in a sidecar LanceDB or SQLite-vec file — the full query surface can be served from disk at sub-100ms latency.
GDPR-trivial compliance
The metadata database contains hashes, paths, and machine-generated annotations. It contains no personal content: no image pixels, no document text, no audio samples. In most jurisdictions, a content hash and a file path are not “personal data” in the GDPR sense — they are structural identifiers. Even under aggressive interpretation, the data subject’s deletion right is satisfied by removing the index entries. The content itself, on private drives, was never subject to the regulation’s server-side obligations.
This is a structural advantage over iCloud, Google Photos, Dropbox, and every other service that uploads your actual files. Those services carry the full weight of GDPR compliance: subject access requests, data portability, right to erasure, security obligations for content-at-rest. Heirloom’s server sees only catalog entries.
Content identity without content storage
BLAKE3 hashing provides content-addressed identity for every file. Two files with the same hash are, by definition, identical — regardless of filename, path, or modification timestamp. This enables:
- Cross-device deduplication without transferring content
- Integrity verification of local files against the catalog record
- Offline replica tracking — the system knows which drives hold which files by hash, not by path
This is the same content-addressing model used by IPFS, Git, and Perkeep (formerly Camlistore). Perkeep, now at version 0.12, has been building this model since 2013 — their “blobs” are content-addressed chunks, and their schema layer records metadata separately. Spacedrive (37k GitHub stars, $2M seed round) uses the same BLAKE3-adaptive-hashing approach with SQLite as the local metadata store and synchronizes metadata — not content — across devices. The convergence on this architecture across independent projects is not coincidental. It is the correct decomposition.
What This Is Not
It is not a backup system. Heirloom knows where your files are and maintains their catalog. It does not move or copy your files unless you explicitly trigger replication. The catalog is not a backup; it is a map.
It is not a cloud drive. Google Drive, iCloud Drive, Dropbox — these are remote filesystems. Files live in the cloud, with local caches. Heirloom inverts this: files live on your drives, with a cloud catalog. The content is always local. The intelligence is everywhere.
It is not a NAS or RAID replacement. Those are hardware solutions to physical redundancy. Heirloom operates at the intelligence layer above them. You can use NAS, RAID, or bare drives; Heirloom indexes whatever storage you have.
Enterprise Headroom
The same economics that make a family deployment cheap make enterprise deployments tractable.
A photojournalism agency with 1 PB of RAW files (average 25 MB, ~40 million files) generates approximately 76 GB of metadata. A 1 TB NVMe holds the complete catalog. R2 cost: $1.14/month for catalog sync. The agency’s content — 1 PB of irreplaceable images — never leaves their on-premises storage. The catalog lives on every editor’s laptop.
A university library digitizing 500 years of physical documents, producing 10 billion files averaging 100 KB (100 TB total), generates ~19 TB of metadata. That is a large SQLite database — beyond casual use — but it is still a single NVMe and well within what a workstation can query. Sharded across 20 drives, each holding metadata for 500 million files (950 GB), the architecture scales without a distributed system.
At the very high end — 1 PB of video content averaged at 100 MB per clip, 10 million files — the catalog is ~19 GB. Trivially queryable on a laptop. R2 cost: $0.285/month.
The Competitive Moat
Cloud storage providers are locked into their current model by their own revenue structure. iCloud charges $9.99/month for 2 TB because storing and serving 2 TB of content costs money and they need a margin on top. Backblaze B2 charges $6/TB/month for content. AWS S3 charges egress specifically to prevent lock-in escape. These are not bugs — they are the business model.
Heirloom’s business model does not require storing content. The product price is not set by infrastructure cost — the infrastructure cost is essentially zero. The product price is set by value delivered: the ability to find, understand, and trust your entire personal archive, across every device, forever.
The pitch: “We never touch your files. We just remember everything about them.”
This is not a marketing statement. It is a precise description of the technical architecture. The catalog is cheap. The catalog is the product. The catalog is the moat.
Prior Art
- Perkeep (camlistore.org): Content-addressed personal storage since 2013. Version 0.12, November 2025. Technically correct but requires programmers to operate. No consumer UX.
- Spacedrive: Virtual distributed filesystem using SQLite + BLAKE3, Rust, local-first metadata sync. Pivoted to “indexing as the core product.” 37k GitHub stars. Aimed at power users and developers.
- Apple Photos: Maintains a SQLite metadata database in
~/Pictures/Photos Library.photoslibrary/database/. Indexes every photo’s EXIF, face recognition data, scene labels, and relationship graph locally. Content stays on device (unless iCloud Photos enabled). The metadata database for a large library routinely reaches 2–5 GB. Apple solved this for photos; Heirloom solves it for everything. - rsync / rclone: Transfer tools with basic checksumming. No persistent catalog, no query layer, no AI annotations.
- Syncthing: Peer-to-peer content sync. Syncs the content, not just the catalog. Useful but opposite of the Heirloom model.
None of these combine: consumer UX + full-collection catalog + local-first + cloud catalog sync + AI annotations + drive-agnostic identity. That combination is the product.
Summary
| Principle | Consequence |
|---|---|
| Metadata is ~1,900 bytes per file | A 4 TB NVMe indexes 2+ billion files |
| 1:1,000 ratio vs content | 50 TB family archive = 20 GB catalog |
| R2 at $0.015/GB, zero egress | $0.30/month for a full family catalog |
| Content never uploaded | GDPR compliance is structurally easy |
| BLAKE3 content addressing | Deduplication and integrity without file transfer |
| SQLite to 281 TB, 70k+ writes/s | No distributed database needed at any personal scale |
| Catalog queryable without drives | Offline navigation of the full archive |
The ratio is 1:1,000. The margin is 97%. The moat is that the content never moves.
Derived from a product design conversation, 2026-03-20. Measurements from a production SQLite database (2,404,812 files, 4.2 TB, 4.6 GB metadata). Pricing current as of March 2026: Cloudflare R2 $0.015/GB/month, Backblaze B2 $6/TB/month, iCloud+ 2 TB $9.99/month. SQLite limits from sqlite.org/limits.html. Spacedrive architecture from github.com/spacedriveapp/spacedrive. Perkeep from perkeep.org.