Distributed Block Storage for Home Drives

Most people have more storage than they think. The problem is nobody has the intelligence to use it. Content-addressed block distribution fixes this — not by adding hardware, but by treating the drives you already own as a pool.

The Problem Nobody Has Solved

A typical person owns between 3 and 8 storage devices right now.

A laptop with 1 TB SSD. One or two external USB drives purchased over the years, probably 2–4 TB each. An old laptop in a drawer with 512 GB. Maybe a spare NAS drive. Most of these are 40–70% full of files they don’t fully understand, many of which are duplicated across devices.

Ask them: “How much free space do you have total?” They can’t answer. Ask them: “Is your wedding video backed up?” They’re not sure. Ask them: “What would you lose if that drive died tonight?” They’d have to spend an hour figuring it out.

This is not a hardware problem. The hardware is cheap and abundant. A 2 TB external drive costs around $60. Most people have significant storage capacity sitting idle.

The problem is intelligence. There is no system that knows what’s on all those drives, understands relationships between files, distributes data across available space, or maintains redundancy without a sysadmin running it.

Cloud backup services (Backblaze, iCloud, Google Photos) offer some of this, but they hit a hard constraint: bandwidth. 10 TB of data at a typical residential upload speed of 100 Mbps takes 9 days of continuous upload. Most people never finish their first full backup. And once it’s there, restoring that same 10 TB is another 9 days — right when you need it most.

The solution isn’t cloud. The solution is to apply the intelligence of cloud storage to the drives already in the house.

The Core Insight: Blocks, Not Files

Traditional file storage has a fundamental limitation: a file must fit on a single drive.

If you have a 50 GB video and your drives are 30 GB free, 25 GB free, and 20 GB free, traditional systems tell you: nowhere to put it. The file won’t fit anywhere.

This is wrong. You have 75 GB of free space. The file is 50 GB. There is space.

Content-addressed block storage breaks that constraint.

Every file is atomized into fixed-size chunks. A 50 GB video becomes ~12,800 chunks of 4 MB each. Each chunk gets a hash — specifically a BLAKE3 hash — that is its permanent, universal identity. The chunk is stored by its hash. The file is reconstructed by assembling its chunks in order.

Now the distribution problem becomes tractable. Those 12,800 chunks can be spread across however many drives have free space. The 30 GB drive takes some, the 25 GB drive takes some, the 20 GB drive takes the rest. The file is logically stored. It’s split physically — but that’s invisible to the user.

This is not a novel insight at the infrastructure layer. IPFS does it. Ceph does it. What’s new here is applying it to offline, intermittently-connected, consumer-grade drives with a UX designed for non-technical users.

How It Works

Step 1: Atomization

When a file is ingested into the pool, it is split into fixed-size blocks. 4 MB is a good default — it matches the typical HDD read unit, limits overhead for large files, and keeps block counts manageable. A 100 MB file produces 25 blocks. A 4 TB video archive produces ~1 million blocks.

Each block is hashed with BLAKE3. The hash is the block’s identity. If two files share a byte sequence — a common header, a repeated segment, an identical photo embedded in two documents — they share blocks. Storage deduplication happens automatically, at the content level, with no extra work.

Step 2: Distribution

The system maintains a live map of all drives in the pool: which drives are connected, how much free space each has, and which blocks each currently holds.

When blocks need to be stored, they are distributed according to a placement policy:

Fill factor: prefer drives with more free space
Redundancy: store each block N times across N different physical drives
Locality hints: if a file is frequently accessed, prefer blocks on faster drives (NVMe > USB 3.0 > USB 2.0)
Geographic diversity: if drives are tagged with locations, prefer distributing across locations

The default redundancy factor is 2. Every block exists on at least 2 drives. A single drive failure recovers fully. With drives at different physical locations, this is geographic redundancy without a cloud bill.

Step 3: The Metadata Layer

Here is where the economics get interesting.

The physical blocks can be large — 4 MB each. But the metadata describing them is tiny. For each file, the system needs:

The original file path and name
File timestamps and attributes
An ordered list of block hashes (the “block manifest”)
A record of which drive holds each block

The numbers:

Metric	Value
Bytes per block hash (BLAKE3, 32 bytes)	32 B
Average file size	5 MB (1.25 blocks)
Metadata per file (path + hashes + manifest)	~1,900 B
1 million files	~1.9 GB metadata
10 million files (50 TB photos/video)	~19 GB metadata
4 TB NVMe metadata capacity	~10 PB of content described

The entire metadata layer for a family’s lifetime of digital content fits in RAM. A 32 GB laptop can hold the index for petabytes of content.

This changes the access pattern entirely. Finding a file, checking what blocks it needs, determining which drives hold those blocks — all of this is RAM-speed. The only disk I/O is reading the actual blocks during reconstruction.

Step 4: Reconstruction

When a user requests a file, the system:

Looks up the file in the metadata index (RAM, instant)
Identifies all block hashes in the manifest (RAM, instant)
Queries which drives currently hold each block
Determines the optimal retrieval path (fastest available drive per block)
Reads blocks from drives, assembles in order, streams to user

If all drives holding a block are disconnected, reconstruction for that file is deferred until at least one drive returns. The system knows this in advance and can alert the user: “Plug in the blue WD drive to access this video.”

Step 5: The “Never Delete” Model

Deletion in this system is a metadata operation.

When a user deletes a file, its blocks are dereferenced — the file entry is removed from the index, but the block data remains on disk. Dereferenced blocks are only physically reclaimed when the system needs space and runs garbage collection.

This enables:

Undo at any time: a deleted file can be restored until its blocks are GC’d
Version history: previous versions of modified files keep their old blocks
Ransomware recovery: encrypted file versions are new blocks; original blocks persist until GC

GC is explicitly triggered, never automatic. The user sees: “You have 47 GB of recoverable deleted content. Free space by cleaning it now?” They decide.

Why Existing Solutions Don’t Fit

Cloud Backup (Backblaze B2, iCloud, Google Photos)

The bandwidth constraint is fundamental, not incidental. Residential upload speeds are 10–100 Mbps. 10 TB of data takes 2–22 days to upload. During that window, the backup is incomplete. After a failure, restore takes equally long.

Cloud backup works well for small datasets — documents, a few thousand photos. It breaks down for video, raw photo archives, and anything over 2–3 TB.

More importantly, cloud backup is a single point of failure of a different kind: the service. Backblaze changes pricing. iCloud terms change. Google shuts down products. The user’s data is dependent on an ongoing commercial relationship.

Content-addressed local pools have no monthly bill beyond a few cents for metadata sync.

RAID

RAID requires drives of identical size (or wastes space), in the same physical location, connected to the same machine, with a controller that understands RAID. It’s designed for always-on servers, not a mix of a laptop, two USB drives, and an old machine at a parent’s house.

RAID also has no concept of metadata-layer operations, deduplication, or partial availability. If the RAID controller fails, data recovery requires professional services.

Syncthing / rsync

These replicate whole files between locations. The file must fit on a single drive at both ends. There is no block-level distribution. A 50 GB file on a drive with 30 GB free cannot be synced anywhere.

Syncthing is excellent for keeping files in sync across machines. It is not a storage pool. It does not solve the “scattered files on many drives” problem — it requires those files to exist on a drive before it can help.

Ceph / IPFS

Ceph is content-addressed block storage. IPFS is content-addressed object distribution. Both do what’s described here, at the technical level.

The difference is the deployment model and the user model.

Ceph requires dedicated hardware, always-on nodes, network infrastructure, and an administrator who understands CRUSH maps and OSD management. IPFS is designed for internet-scale distribution with untrusted peers. Neither is designed for:

A USB drive that goes in a bag and comes back three days later
A household where nobody knows what a daemon is
Drives at a parent’s house connected via sneakernet (physical transport)
A single user who just wants their files to be safe

The insight is not technical novelty. The insight is applying content-addressed block distribution to the home user’s actual constraints: offline drives, intermittent connectivity, consumer hardware, no network between nodes, no always-on server.

Metadata Economics

The metadata layer is the architectural key that makes the rest tractable.

What metadata stores

For each file:

{
  id: uuid,
  path: "/Photos/2023/vacation.mp4",
  size: 52_428_800,           // 50 MB in bytes
  mtime: 1704067200,
  blake3_hash: "abc123...",   // whole-file hash (identity)
  blocks: [
    { seq: 0, hash: "def456...", size: 4_194_304 },
    { seq: 1, hash: "ghi789...", size: 4_194_304 },
    // ... 12 more blocks for a 50 MB file
  ]
}

For each block placement:

{
  block_hash: "def456...",
  drive_id: "WD_ELEMENTS_2TB",   // volume label, never drive letter
  drive_path: "/blocks/de/f4/def456...",
  stored_at: 1704067200,
  verified_at: 1704153600
}

Storage ratios

The metadata-to-content ratio is approximately 1:2,600 for typical media files (5 MB average size, 4 MB blocks, 1,900 bytes of metadata per file).

For a family with 10 TB of photos and video across their lifetime:

~2 million files
~3.8 GB of metadata (fits in RAM on any modern laptop)
Metadata sync to R2: ~$0.30/month
Metadata query latency: sub-millisecond (it’s all in RAM)

For a professional photographer with 50 TB of raw assets:

~10 million files
~19 GB of metadata (still fits in 32 GB RAM)
Metadata sync: ~$1.50/month

The metadata layer is effectively free. The expensive operations — I/O — happen only during block read/write, which is already bounded by physical drive speed.

The metadata sync model

Metadata lives in a SQLite database on the primary machine. It is synced periodically to Cloudflare R2 (object storage) as a compressed snapshot.

R2 pricing: $0.015/GB/month. 20 GB of metadata = $0.30/month. This is the only ongoing cost.

The R2 copy enables disaster recovery: if the primary machine is lost, a new machine can download the metadata snapshot and know exactly which blocks are on which drives. Plug in the drives, and the pool re-assembles.

The Redundancy Model

This is not RAID. RAID is a hardware protocol for synchronous redundancy on collocated identical drives.

This is asynchronous, configurable, location-aware replication at the block level.

Configuration

The user sets a redundancy factor N (default 2). Every block is stored on at least N distinct physical drives. The system continuously monitors:

Which blocks are under-replicated (drive went offline, block now has only 1 copy)
Which blocks are over-replicated (user added a drive, rebalancing available)
Which blocks violate location policy (both copies on drives at the same address)

Under-replication triggers a background job: find an online drive with space, copy the block there, update metadata.

Physical transport as a valid replication mechanism

Cloud systems assume network connectivity. This system does not.

A user can take a drive to their parents’ house. The system knows the drive is offline at a different location. When the drive returns, the system syncs: pushes new blocks, pulls any changes, updates the location metadata.

This is sneakernet-compatible distributed storage. No network required between sites.

Drive plug/unplug lifecycle

Drive plugged in
  → System reads drive label (volume identity, never drive letter)
  → Compares drive's block manifest to metadata database
  → Identifies: new blocks, missing blocks, changed blocks
  → Starts replication jobs for under-replicated blocks
  → Updates drive's free space in pool map

Drive unplugged
  → System marks drive as offline
  → Identifies which blocks are now under-replicated
  → If critical: alerts user ("Plug in a drive — 847 blocks have only 1 copy")
  → If redundancy maintained: silent, continue operating

The system degrades gracefully. If blocks are unreachable, the affected files are listed. Other files are unaffected. There is no single point of failure.

Implications for Product Design

The technical architecture above constrains and enables specific product decisions.

Volume identity, never drive letters

Drive letters (C:, D:, E:) are volatile. Replug a drive in a different port and it gets a different letter. The system must use stable volume identifiers: volume labels (user-set names like “WD_ELEMENTS_2TB”) or volume GUIDs where available.

This is a hard rule with no exceptions. Any system that uses drive letters as identifiers will have data integrity failures in normal use.

The pool is the product, not individual files

Users should not think in terms of “my files are on the D: drive.” They should think: “my files are in my pool.” The pool happens to span D:, the blue WD drive, and the old laptop in the closet. That’s an implementation detail.

The product surfaces:

Pool total capacity and free space (aggregate, not per-drive)
Pool health (redundancy coverage percentage)
Which drives are online/offline and what they contribute
File search and access (abstracted from physical location)

Reconstruction on demand, not sync

This system does not keep files on the primary machine. Files exist as blocks distributed across the pool. When you want a file, it’s reconstructed from blocks.

This is a significant UX departure from traditional file sync. The product must handle:

First-open latency (reconstruction time for large files)
Prefetch for frequently accessed files
Offline access (blocks on local machine for files used without external drives)

The tradeoff: you can access the full pool even when a drive is gone, as long as the blocks you need are available elsewhere.

Drive arrival ceremony

When a drive is plugged in, the product should acknowledge it. Not a silent background process. A visible, named event: “Blue WD Drive connected. 847 GB available. 12 blocks replicated.” This builds the mental model: drives are participants in the pool.

Similarly, drive removal: “Blue WD Drive removed. All files remain accessible. 3 files now have only 1 copy — connect another drive to restore redundancy.”

The GC decision belongs to the user

Garbage collection — reclaiming blocks from deleted files — should always be user-initiated and clearly presented. “You have 234 GB of recoverable deleted content from the last 60 days. Permanently delete to free space?”

The system’s job is to make this decision easy to understand, not to make it automatically.

What This Enables for Heirloom

Heirloom is the consumer product built on this architecture. The storage model described here is the foundation layer.

With a content-addressed block pool:

Any file is safe the moment it’s added. It’s immediately distributed. No waiting for a sync or backup window.
Drive failures are events, not disasters. The system knows what was on the failed drive and what needs replication.
Files find available space automatically. The user never needs to decide which drive to put something on.
The family’s storage is one pool. Drives at different houses, owned by different family members, contribute to a shared pool with appropriate access controls.
Version history is free. Old blocks aren’t deleted; they’re dereferenced. Restore any version up to the GC horizon.

The hard problems are not the storage protocol — IPFS and Ceph have solved those. The hard problems are the UX translation: making a drive pool feel as simple as a folder, handling drives that come and go without user anxiety, and building the trust that files are actually safe without requiring users to understand any of this.

That’s the product. The architecture is the foundation.

Summary

Property	This System	Cloud Backup	RAID	Syncthing
Handles 50 GB file with only 30 GB drives free	Yes	Yes (slow)	No	No
Files stay local	Yes	No	Yes	Yes
Works across houses	Yes	Yes	No	Yes
No always-on node required	Yes	N/A	No	No
Drive can be unplugged freely	Yes	N/A	No	Partial
Consumer hardware	Yes	N/A	Partial	Yes
Deduplication	Yes	Varies	No	No
Version history	Yes	Paid tier	No	No
Monthly cost (10 TB)	~$0.30 metadata	$70–100	$0	$0
Bandwidth required	None	9–22 days	None	Varies

Content-addressed block distribution on home drives is not a new idea. The implementation primitives exist and are well-understood. What has not existed is a product that brings them to people who own an external drive and a laptop and just want their stuff to be safe.

That’s the gap. That’s what this architecture is for.