Org Status: 🟡 Dormant Cloudflare: N/A Last Audited: 2026-04-28
A deep technical reference on why the order you read files matters — and how seven decades of engineering have been shaped by that fact.
- The Machine That Started It All
- Physics of the Spinning Disk
- How Disk Geometry Evolved
- The Seek Problem
- Disk Scheduling Algorithms
- OS I/O Schedulers
- Filesystem Layout and Inode Ordering
- FIEMAP and Physical Extent Information
- posix_fadvise, Readahead, and Hinting the Kernel
- Backup Tools and Their Scanning Strategies
- Databases and Sequential I/O
- Classic Algorithms and Engineering Tricks
- Modern Developments
- Side Roads: Topics Worth Their Own Articles
On September 13, 1956, IBM shipped the first unit of what would become the most consequential storage technology in the history of computing: the IBM 350 Disk Storage Unit, part of the IBM 305 RAMAC system (Random Access Method of Accounting and Control).
The team that built it was assembled in 1954 at IBM’s R&D laboratory at 99 Notre Dame Avenue in San Jose, California, led by Reynold B. Johnson. Johnson’s team included R. Manning Hermes, William A. Goddard, Louis D. Stevens, Arthur J. Critchlow, and John W. Haanstra. The first paying customer was Zellerbach Paper in San Francisco.
The specifications of the IBM 350 seem almost comically modest by modern standards, but they were revolutionary for their era:
- 50 platters, each 24 inches in diameter, stacked on a single spindle
- Spindle rotation: 1,200 RPM
- Total capacity: 5 million 6-bit characters (approximately 3.75 MB)
- A single head assembly with two read/write heads served all 50 platters — the head carriage moved between platters mechanically
- Average access time: just under 1 second
- Physical weight: approximately 1,730 pounds for the drive unit alone, plus a 441-pound air compressor required for operation
The “random access” in RAMAC’s name was the revolutionary claim. Prior storage — punched cards and magnetic tape — forced sequential access. You could not jump to the middle of a reel of tape without spooling past everything that came before. RAMAC changed that contract with the computer: any record could be retrieved in roughly one second, regardless of its physical location.
But embedded in that engineering triumph was the seed of a problem that would persist for the next seventy years: seek time. Getting to a random location took almost a full second. Reading data sequentially from a position you had already reached was dramatically faster. The tension between random and sequential access — the cost of the seek — would drive an enormous body of operating systems research, filesystem design, and application engineering.
To understand why sequential I/O is faster, you need to understand the physics of what a hard drive actually does.
The Mechanical Components
A modern hard disk drive (and the IBM 350 before it) consists of:
- Platters: Rigid disks coated with a magnetic material. Data is encoded by magnetizing tiny regions of the surface. Modern drives spin platters at 5,400 or 7,200 RPM (enterprise drives historically ran at 10,000 or 15,000 RPM).
- Read/Write Heads: Electromagnetic transducers that fly nanometers above the platter surface on a cushion of air generated by the platter’s rotation. A head crash — contact between the head and the platter — is catastrophic.
- Actuator Arm: The arm that positions the heads over the correct track. Early drives (including the IBM 350) used a screw mechanism or stepper motor. By the mid-1980s, voice coil motors (VCMs) became standard. A VCM works on the same principle as a speaker: current flowing through a coil in a magnetic field creates a Lorentz force. Reversing the current reverses the direction of movement. The transition from stepper motors to VCMs was essential for the precision needed as track density increased; by 1986, when Conner Peripherals introduced the first voice-coil PC hard disk, track density had reached roughly 1,000 tracks per inch.
- Tracks and Sectors: Each platter surface is divided into concentric rings called tracks. Tracks are further divided into sectors, the smallest individually addressable unit (traditionally 512 bytes, now 4,096 bytes on “Advanced Format” drives). A cylinder is the set of all tracks at the same radial position across all platters — a concept important for CHS addressing.
The Three Latencies
Any read or write operation involves three distinct time components:
1. Seek Time: The time for the actuator to move the head from its current track to the target track. Seek time is non-linear — acceleration and deceleration dominate short seeks, while longer seeks are limited by maximum actuator velocity. Typical values for a 7,200 RPM consumer drive:
- Minimum seek (adjacent track): ~0.1 ms
- Average seek: ~8–12 ms
- Full stroke (innermost to outermost track): 15–20 ms
2. Rotational Latency: After the head reaches the correct track, it must wait for the target sector to rotate under it. Since the disk is always spinning, you might arrive just after the sector passed — in the worst case, you wait for a nearly full rotation. Average rotational latency is half a full rotation:
Average rotational latency = (60 seconds / RPM) / 2
For 7,200 RPM: (60 / 7200) / 2 = 4.17 ms
3. Transfer Time: Once the head is positioned and the sector arrives, data is transferred at the platter’s linear velocity. This is the fast part — sequential reads just keep streaming as sectors pass under the head.
Total random access time for a 7,200 RPM drive: roughly 10 ms average seek + 4.17 ms average rotational latency = ~14 ms per random I/O. That is 140 random reads per second, or about 70 MB/s if each read is 512 KB. Compare that to sustained sequential reads, which easily hit 200–250 MB/s on the same drive. The ratio is around 3–5x for large transfers, and much worse — sometimes 100x or more — for tiny random reads.
Zone Bit Recording: The Outer-Track Advantage
A subtler physics effect compounds the above: Zone Bit Recording (ZBR).
The circumference of a track increases with its distance from the spindle. Outer tracks are physically longer than inner tracks. Early drives used the same number of sectors per track on every track (Constant Angular Velocity recording, CAV), which meant outer tracks had lower bit density — wasted magnetic surface.
Zone Bit Recording, pioneered and patented by Chuck Peddle in 1961 while working for General Electric, and introduced commercially by Bryant’s 4000 series in 1961, solves this by dividing the disk into zones. Each zone is a band of tracks, and outer zones get more sectors per track than inner zones. The result: when reading data from the outermost tracks, more bits pass under the head per unit time, giving higher sustained transfer rates.
ZBR saw widespread commercial adoption in personal computer hard drives in the mid-1990s. Seagate’s Medalist series (1994) and IBM’s Deskstar series (1995) were among the early implementations. By the late 1990s, virtually every new drive used ZBR.
The practical implication: on an HDD, partition placement matters. The first partition on a disk lives on the outermost tracks (lowest LBA addresses) and is physically faster. Tools like hdparm measuring sequential throughput will consistently show higher MB/s at the beginning of a disk than at the end.
CHS: Cylinder, Head, Sector
The original addressing scheme exposed the physical geometry directly. To address a sector, you specified:
- Cylinder: which concentric ring (0 to N-1)
- Head: which platter surface (0 to H-1)
- Sector: which sector within the track (1 to S, 1-indexed)
The BIOS INT 13h interface constrained CHS to: 1,024 cylinders Ă— 16 heads Ă— 63 sectors Ă— 512 bytes = 504 MB. This limit was the bane of PC users in the early 1990s. Extended INT 13h later pushed it to 8.4 GB using 1,024 Ă— 255 Ă— 63.
LBA: Logical Block Addressing
By the mid-1990s, the lie at the heart of CHS had become untenable. ZBR meant drives no longer had a uniform number of sectors per track, making CHS geometry fictitious at the hardware level — drives were already translating. The 1994 ATA-1 standard introduced 28-bit LBA. The 2002 ATA-6 specification introduced 48-bit LBA and formally declared CHS obsolete.
LBA simply numbers every sector sequentially from 0 to N-1. The drive’s internal controller maps LBA addresses to physical locations however it sees fit, applying ZBR geometry and internal optimizations transparently. From the OS’s perspective, the disk is a flat array of 512-byte (or 4,096-byte) blocks.
This abstraction would later become important: the OS has no direct knowledge of where data physically sits on the platters, and must either trust the filesystem’s allocation strategy or query the drive through special interfaces.
Shingled Magnetic Recording (SMR)
Around 2013, drive manufacturers introduced Shingled Magnetic Recording to push areal density beyond what conventional perpendicular recording could achieve.
In conventional recording, each track is written with a gap between adjacent tracks to prevent interference. In SMR, write tracks overlap like roof shingles — each new track partially overwrites the outer edge of the previous one. Because read heads are narrower than write heads, reading still works fine. But writing is a problem: you cannot update an arbitrary track without first reading and rewriting all the tracks that were “shingled” over.
The solution is to organize SMR drives into zones of tracks that must be written sequentially. You write forward through a zone; to rewrite data in the middle of a zone, you must effectively erase (reset) and rewrite the entire zone from the beginning. Random write performance collapses. Sequential write performance is excellent, and read performance is comparable to conventional drives.
SMR drives come in two management models:
- Drive Managed (DM-SMR): The drive’s firmware hides the complexity, using an internal buffer zone for random writes and background migration. Works with any OS, but performance can become unpredictable under heavy random write load.
- Host Managed (HM-SMR): The drive exposes the zone structure to the host OS and requires the host to respect sequential write requirements. Used with specialized software stacks.
SMR is a concrete, modern example where the physics of the storage medium enforce what was previously just good practice: writing sequentially.
NVMe and SSD: The End of Seek Time
NAND flash storage has no moving parts. There is no head to position, no platter to spin. “Seek time” on an NVMe SSD is the time for the controller to identify which flash die contains the data and begin reading — measured in microseconds (typically 70–100 µs for a consumer NVMe), not milliseconds.
NVMe (Non-Volatile Memory Express), introduced in 2011 as a standard interface optimized for flash, supports up to 65,535 queues with up to 65,535 commands per queue, compared to SATA’s single queue with 32 commands. This parallelism means the device can process many I/O operations simultaneously.
The consequences for software are significant. On an HDD, the difference between sequential and random I/O throughput can be 100x or more. On a modern NVMe SSD, random reads are often within 2–3x of sequential reads at high queue depths, because the controller internally parallelizes flash die access. The entire conceptual framework of “sort your work to be sequential so the disk arm doesn’t have to seek” becomes less critical — though not entirely irrelevant, as sequential I/O still has advantages for memory bandwidth, cache coherence, and CPU prefetching.
Imagine 1,000 files scattered randomly across an HDD. Reading each one requires:
- Seek to the inode (metadata) location
- Rotational wait
- Read the inode (512–4096 bytes)
- Seek to the file data location (potentially far away)
- Rotational wait
- Read the file data
For 1,000 small files, you perform approximately 2,000 seeks. At 14 ms average per seek, that is 28 seconds of mechanical time. The same data, if laid out sequentially, could be read in under a second.
This is not a hypothetical edge case. Backup tools, search indexers, antivirus scanners, and find-based scripts all face this problem when operating on large collections of small files on spinning disks.
The solution has multiple levels: the OS can reorder requests internally (scheduling), the filesystem can try to allocate files near each other (allocation policy), and applications can sort their work before issuing reads (algorithmic optimization).
When multiple processes issue I/O requests simultaneously, the OS must decide the order in which to serve them. The naive approach serves requests in arrival order; smarter algorithms minimize total head movement.
FCFS: First-Come, First-Served
The simplest algorithm: serve requests in the order they arrive in the queue. No starvation, perfectly fair. But if requests are spread across the disk, the head may thrash back and forth, covering enormous distances. In the worst case — requests alternating between the innermost and outermost tracks — FCFS wastes most of its time seeking and very little time transferring data.
FCFS is the correct algorithm for SSDs (where seek time doesn’t exist) and is often used in benchmarking to expose raw device characteristics.
SSTF: Shortest Seek Time First
Greedy algorithm: always serve the request closest to the head’s current position. Dramatically reduces average seek time compared to FCFS. The problem is starvation: if the head is near the center of the disk and requests keep arriving nearby, requests at the outer tracks may never be served. SSTF can theoretically delay a request indefinitely. It behaves like a poor elevator that never goes to the top floor because someone always calls it at a middle floor.
SCAN: The Elevator Algorithm
The head moves in one direction, serving all requests it encounters, then reverses direction and serves requests on the way back — exactly like an elevator. This eliminates starvation: every request will eventually be reached as the head sweeps past. SCAN provides better overall throughput than SSTF in practice because it reduces head movement while guaranteeing bounded wait time.
The worst case for SCAN is for requests just behind the head’s current position: they must wait for a full sweep to the disk’s far end and back. This gives non-uniform service times: requests near the edges get served infrequently (the head just reversed) compared to requests near the center (the head passes twice per cycle).
C-SCAN: Circular SCAN
Modification of SCAN: the head sweeps in one direction only, serving requests along the way. When it reaches the last request in that direction, it jumps immediately back to the beginning (the other end) without serving any requests on the return trip. This gives more uniform wait times: every request waits at most one full sweep, regardless of disk position.
LOOK and C-LOOK
Optimizations of SCAN and C-SCAN respectively: instead of sweeping all the way to the physical edge of the disk, the head reverses (or jumps back) at the last pending request in that direction. This eliminates pointless travel to disk regions with no pending I/O. LOOK and C-LOOK are typically more efficient than their SCAN/C-SCAN counterparts in practice.
Why This Matters for Scanning Workloads
A backup or indexing tool that reads thousands of files simultaneously benefits enormously from the OS’s scheduler coalescing nearby reads. But the scheduler can only help if requests are submitted concurrently. A single-threaded tool that opens a file, reads it completely, closes it, then opens the next file gets no benefit from scheduling — each operation completes before the next is submitted, leaving the scheduler with nothing to optimize.
This is one reason why multi-threaded or asynchronous I/O (see io_uring below) enables scheduling optimizations that serial code cannot.
The theoretical algorithms above are implemented — with significant engineering additions — in operating system I/O schedulers.
Linux: The Classic Era
Linux’s block layer sits between the filesystem and the device driver, and for most of Linux’s history, it provided pluggable “elevator” schedulers:
CFQ (Completely Fair Queuing): The default scheduler in Red Hat Enterprise Linux 4, 5, and 6, and many other distributions. CFQ maintains a per-process queue and allocates time slices to each process in a round-robin fashion, attempting to give each process equal I/O bandwidth. Within each time slice, it serves requests in sector-sorted order (a form of LOOK). CFQ performed well for interactive desktop workloads and general-purpose servers but was suboptimal for SSDs (where the seek optimization is irrelevant) and large sequential streaming (where per-process fairness imposes overhead).
Deadline: Designed to prevent starvation. Each request is assigned an expiry deadline (default: 500 ms for reads, 5,000 ms for writes). The scheduler normally serves requests in sorted sector order (like SCAN) but will jump to serve an expiring request regardless of position. Simple and predictable. A good default for SSDs and many database workloads.
Noop: The simplest possible scheduler: a FIFO queue with basic request merging. No reordering. Appropriate for virtual block devices (where the hypervisor handles scheduling) and SSDs (where reordering provides no benefit).
Linux: The Multi-Queue Era
As NVMe and other fast storage emerged, the single-threaded Linux block layer became a bottleneck. The blk-mq (multi-queue block layer) was introduced to support per-CPU submission queues, enabling SSD devices to receive I/O from multiple CPU cores simultaneously without lock contention.
Ubuntu 19.04 (Linux 5.0) switched to blk-mq by default. The classic single-queue schedulers (CFQ, deadline, noop) were deprecated in Linux 5.3. Their multi-queue successors:
BFQ (Budget Fair Queuing): The spiritual successor to CFQ, designed for blk-mq. BFQ tracks “budgets” of I/O service rather than time slices, and includes special logic for interactive applications — it detects synchronous processes waiting for I/O and boosts their priority temporarily. BFQ borrows much of its general structure and code from CFQ. It is the default scheduler for rotational drives in many modern Linux distributions.
mq-deadline: Multi-queue successor to deadline. Same concept — sorted sector-order service with expiry deadlines — adapted for the multi-queue architecture. This is the default in RHEL 8 and newer, and the common choice for SSDs.
Kyber: A newer scheduler designed specifically for low-latency SSDs. It maintains separate queues for reads and writes and uses token buckets to control latency rather than throughput. Less commonly seen as a default.
None: Equivalent to noop for multi-queue devices. Submits I/O in arrival order. The correct choice for NVMe SSDs behind a hypervisor where the virtual device driver handles all scheduling.
Windows I/O Scheduler
Windows does not expose a user-configurable I/O scheduler in the same way as Linux. The kernel’s I/O Manager and Storage Driver Stack perform request coalescing and reordering internally. Windows uses a variation of the SCAN algorithm in its disk I/O reordering. For SSDs, Windows automatically disables seek optimization when it detects a device does not have rotational media.
Even before a request reaches the OS scheduler, the filesystem’s allocation policy determines whether related data lives near each other on disk. A good filesystem layout can make the seek problem nearly disappear; a bad one can defeat any amount of scheduler cleverness.
What Is an Inode?
In Unix-derived filesystems, every file and directory is described by an inode (index node). The inode is a fixed-size data structure (typically 128 or 256 bytes) containing:
- File owner, permissions, timestamps
- File size
- Pointers to the data blocks containing the file’s content
The inode does not contain the filename — that lives in the directory entry, which maps a human-readable name to an inode number. Inode numbers are assigned sequentially at file creation time.
ext2/ext3/ext4: Block Groups
The ext2 filesystem (introduced in 1993) introduced a critical optimization: block groups. The disk is divided into fixed-size regions (128 MB per block group in ext4). Each block group contains:
- A copy of the superblock and group descriptors (in some groups)
- A block bitmap tracking free/allocated blocks
- An inode bitmap
- An inode table
- Data blocks
When the filesystem creates a new file, it tries to allocate both the inode and the data blocks in the same block group as the file’s parent directory. The result: files in the same directory tend to be near each other on disk. Because directories are typically accessed together, this dramatically reduces seek distance for typical workloads.
The Orlov Allocator
The Orlov allocator (implemented in Linux 2.5/2.6 and described in a 2002 LWN article) improved on ext2’s directory-placement strategy. When a new top-level directory is created, the Orlov allocator uses a load-balancing heuristic to distribute directories across block groups, preventing all directories from clustering in block group 0. The allocator tracks an “inodes debt” and “blocks debt” per group (with constants INODE_COST = 64 and BLOCK_COST = 256) and chooses the least-loaded group for new directories. For subdirectories, it prefers the same block group as the parent, maintaining locality.
The insight: directories and their children should be co-located (subdirectory optimization), but distinct top-level directory trees should be spread out (load balancing) to prevent hot spots and allow concurrent access patterns to benefit from parallelism.
ext4: Extents and Delayed Allocation
ext4 (merged into the Linux kernel in 2006) added two further improvements:
Extents: Instead of storing a list of individual block pointers, ext4 stores contiguous ranges of blocks as (start_block, length) pairs. A single file that occupies 1,000 contiguous blocks needs only one extent record rather than 1,000 pointers. This reduces metadata overhead and, more importantly, keeps large files physically contiguous.
Delayed allocation: When an application writes data, ext4 does not immediately choose physical block locations. Instead, it buffers the data in memory and defers the allocation decision until the data is actually flushed to disk. By delaying, the filesystem can see the full size of what’s being written and allocate a single contiguous extent, rather than greedily grabbing whatever block is free at the moment of each write() call. Delayed allocation dramatically reduces fragmentation for large files written sequentially.
The multiblock allocator (mballoc) pairs with delayed allocation: when it is time to allocate, mballoc allocates many blocks in a single operation using a buddy allocator, finding the largest contiguous free region available.
NTFS: The Master File Table
Windows NTFS (introduced in Windows NT 3.1, 1993) uses a different structure called the Master File Table (MFT). The MFT is itself a special file that occupies a reserved zone near the beginning of the volume. Every file and directory on the volume is represented by an MFT record — a fixed-size entry (typically 1,024 bytes) assigned a sequential MFT record number analogous to an inode number.
NTFS reserves the first 16 MFT records for internal metadata files:
- Record 0:
$MFT— the MFT itself - Record 1:
$MFTMirr— a partial backup of the MFT - Record 2:
$LogFile— the journal - Record 5:
$Root— the root directory
For small files (under ~900 bytes), NTFS stores the file data inside the MFT record itself (called a “resident” attribute), eliminating the need for separate data block allocation and dramatically speeding up small-file access.
NTFS tries to keep the MFT contiguous. A defragmented MFT means that scanning directory entries — which involves reading many MFT records — can proceed largely sequentially. However, NTFS does not have block groups in the ext sense; its data allocation is more global, and fragmentation accumulates faster on heavily-used NTFS volumes.
The Windows FSCTL_GET_RETRIEVAL_POINTERS control code (the equivalent of Linux’s FIEMAP) returns the extent map for a file: a list of virtual-to-logical cluster number (VCN-to-LCN) mappings. This is the interface used by Windows defragmentation tools to query and optimize file placement.
HFS+/APFS: macOS Filesystems
HFS+ (Hierarchical File System Plus, introduced in 1998 as a replacement for the original HFS) organizes its metadata into several special files, each implemented as a B-tree:
- The Catalog B-tree: contains all files and directories
- The Extents Overflow B-tree: handles files with more than 8 extents (HFS+ gives each file 8 extent records inline; overflow goes to this tree)
APFS (Apple File System, introduced in 2017) was designed from the ground up for flash storage and SSD characteristics. APFS uses a single container that can span multiple volumes, with all volumes sharing the same free space pool. Its structure relies on B-tree variants for all metadata, including file records and extent maps. APFS supports 64-bit inode numbers, snapshots, clones (copy-on-write file copies with zero space usage until modification), and native encryption.
The key difference between HFS+/APFS and ext4 for sequential I/O: APFS has no concept equivalent to ext4’s block groups. It does not try to co-locate directory contents. This was an acceptable tradeoff for the SSD-first world APFS targets — seek time is irrelevant on flash — but it means APFS filesystems on spinning disks fragment more easily and benefit less from inode-order sorting.
Inode Order as a Proxy, Not a Guarantee
A critical point: inode numbers correlate with, but do not guarantee, physical disk proximity. The correlation is strongest on freshly-created ext2/ext4 filesystems with low fragmentation. Over time, as files are deleted and recreated, inode numbers and physical locations diverge. An inode-sorted scan is a heuristic — usually better than alphabetical order, but not as good as sorting by actual physical extent.
File fragmentation explicitly breaks this model. A fragmented file may have inode number 1,000 but its data blocks scattered across the entire disk. Sorting by inode only helps with the inode table reads, not with the data reads.
If inode ordering is a proxy, can we do better? Yes — with direct knowledge of physical disk layout.
FIBMAP: The Old Way
The original Linux interface for querying block locations was the FIBMAP ioctl. You pass a logical block number within a file, and the kernel returns the corresponding physical block number on the device. Querying a large file required calling FIBMAP once per block — potentially thousands of system calls. FIBMAP required root privileges on many distributions because physical block addresses could be used to construct raw-disk attacks. FIBMAP also didn’t handle extents well and was deprecated as filesystems moved away from block-level addressing.
FIEMAP: The Modern Interface
The FIEMAP ioctl (merged into the Linux kernel in 2008) returns a file’s complete extent map in a single call. Each extent is described by:
fe_logical: offset within the file (bytes)fe_physical: offset on the device (bytes)fe_length: length of the extent (bytes)fe_flags: metadata flags (last extent, unwritten, shared, encrypted, etc.)
A fragmented file produces multiple extents. A file that occupies a single contiguous region has one extent. The theoretical maximum information for optimal I/O ordering is embedded here: sort files by their first physical extent, and you know exactly what order to read them in to minimize head movement.
The filefrag utility (part of the e2fsprogs package) exposes FIEMAP to the command line:
$ filefrag -v somefile
File size of somefile is 1073741824 (262144 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length:
0: 0.. 65535: 16384.. 81919: 65536
1: 65536.. 262143: 102400.. 299007: 196608
This file has two extents. The optimal read order would be determined by sorting by physical_offset.
The Theoretically Optimal Approach
Tools like disk defragmenters, backup software, and search indexers that want maximum HDD performance should:
- Collect FIEMAP data for all files
- Sort files by their first physical extent (or, for fine-grained optimization, sort extents individually)
- Read data in that order
In practice, few tools implement this. The FIEMAP interface requires root privileges or CAP_SYS_RAWIO on some filesystems, and the complexity of issuing one ioctl per file adds overhead for large directory trees. Sorting by inode number is the common compromise: much cheaper to obtain (it’s in the directory entry), generally correlated with physical position, and good enough for most workloads.
Windows Equivalent: FSCTL_GET_RETRIEVAL_POINTERS
On Windows, DeviceIoControl with the FSCTL_GET_RETRIEVAL_POINTERS control code returns the same conceptual information: a list of Virtual Cluster Number (VCN) to Logical Cluster Number (LCN) mappings for a file. Each mapping corresponds to a contiguous run of clusters on the volume. Windows defragmentation APIs use this extensively. The Windows built-in defrag utility and third-party tools like Defraggler operate by reading retrieval pointers, identifying fragmented files, and using FSCTL_MOVE_FILE to relocate extents to contiguous positions.
The Linux kernel performs readahead by default: when you read a file sequentially, the kernel notices the access pattern and speculatively reads ahead of your current position into the page cache, so that the data is already in memory when your read() call reaches it. This converts synchronous I/O from the application’s perspective into asynchronous I/O at the hardware level.
The default readahead window starts small (typically 128 KB) and can grow to 512 KB or more as the kernel detects a sustained sequential pattern.
posix_fadvise()
Applications can explicitly communicate their access pattern to the kernel using posix_fadvise():
int posix_fadvise(int fd, off_t offset, off_t len, int advice);
The advice parameter accepts:
POSIX_FADV_SEQUENTIAL: The application intends to access data in sequential order, from lower to higher offsets. Under Linux, this doubles the readahead window, allowing the kernel to read even further ahead. For large sequential scans, this can meaningfully reduce total wall-clock time.
POSIX_FADV_WILLNEED: The application expects to need the specified region soon. The kernel initiates a non-blocking readahead for that range immediately. Unlike FADV_SEQUENTIAL, this provides explicit, offset-specific prefetch rather than auto-detection. Useful for “I’m about to seek to offset X” situations where the automatic pattern detector wouldn’t catch it.
POSIX_FADV_DONTNEED: The application is done with this region and will not access it again. The kernel can evict these pages from the page cache. Critical for large sequential scans like backups: without DONTNEED, a scan of a 10 TB disk would pollute the page cache with data that will never be accessed again, evicting hot application data. A backup tool that calls FADV_DONTNEED after reading each file leaves the page cache in the same state it found it — a significant courtesy to the rest of the system.
POSIX_FADV_NOREUSE: Hint that the data will not be reused (similar effect to DONTNEED but expressed differently; implementation varies). On Linux, this is essentially a synonym for DONTNEED.
POSIX_FADV_RANDOM: Inhibits readahead, since random access patterns make predictive prefetch counterproductive. Use this when you know you will access a file in random order.
readahead()
Linux also provides a standalone readahead() system call that initiates readahead for a specific file region asynchronously. Like FADV_WILLNEED, it returns immediately without guaranteeing the data is in cache. Useful for implementing manual prefetch pipelines.
madvise() for Memory-Mapped Files
For files opened with mmap(), the equivalent interface is madvise():
int madvise(void *addr, size_t length, int advice);
MADV_SEQUENTIAL: Hints sequential access, triggering aggressive readahead.
MADV_WILLNEED: Initiates immediate readahead for the range.
MADV_DONTNEED: Releases the mapping’s backing pages from the page cache, returning physical memory to the system.
MADV_RANDOM: Disables readahead for random access patterns.
Windows FILE_FLAG_SEQUENTIAL_SCAN
Windows achieves similar results through flags passed to CreateFile():
FILE_FLAG_SEQUENTIAL_SCAN: Optimizes caching for sequential access, analogous to FADV_SEQUENTIAL. Windows will read ahead more aggressively and will not cache data for future random access.
FILE_FLAG_RANDOM_ACCESS: Disables sequential readahead, keeping more of the recently accessed regions in cache. Appropriate for databases and index-heavy workloads.
The Page Cache Interaction
All of the above operate through the page cache (Linux) or system file cache (Windows): a region of RAM managed by the OS that caches disk pages. When you read a file, the OS checks the page cache first; if the page is there (a “hit”), no I/O occurs. Readahead works by pre-populating the cache with pages the application hasn’t requested yet.
The page cache is a shared resource. Large sequential scans are notorious for thrashing the page cache: reading 10 TB of backup data replaces gigabytes of frequently-used application data with data that will never be touched again. FADV_DONTNEED is the correct tool for preventing this.
Backup software is where sequential I/O optimization has the most immediate practical stakes. A backup that reads files in random order on a spinning disk can run 5–10x slower than the same backup reading files in disk order.
tar: The Original Sequential Archiver
The tar command (Tape ARchive) was introduced to Unix in January 1979 in Version 7 Unix, replacing the earlier tp program. Its design philosophy is inherently sequential: there is no central index. Instead, each file’s metadata (name, permissions, timestamps, size) is stored in a 512-byte header immediately preceding the file’s data. The archive is a linear stream.
On tape drives — the target medium — this design is mandatory. Tape drives can only read and write sequentially; random access requires physically spooling the tape, taking minutes. tar’s format is a direct expression of the tape drive’s physics: one block after another, forward only.
When tar reads from disk to write to tape (or to a compressed file or network socket), it traverses the directory tree in filesystem order. The resulting stream is as sequential as the filesystem’s layout allows, making tar a reasonable choice for disk-to-disk backups on well-maintained filesystems.
rsync: Delta Transfer with Checksum
rsync (Remote Sync, created by Andrew Tridgell in 1996) operates differently from tar. Rather than producing a stream archive, it synchronizes a source tree to a destination tree, transferring only changed data. Its key innovation is a rolling checksum algorithm that identifies which portions of a file have changed, allowing partial transfers.
rsync traverses the source directory tree, collects a list of files and their metadata, and then processes files for transfer. Its default directory traversal uses the OS’s readdir() order (alphabetical on most filesystems), which is not optimal for sequential disk access. rsync does not sort by inode or physical extent.
For large-scale backups on HDDs, rsync’s I/O pattern can be inefficient: it interleaves reads of inodes (metadata) with reads of file data, without batching like-positioned operations. On SSDs or over fast networks, this rarely matters; on slow HDDs with thousands of small files, it can be a meaningful bottleneck.
BorgBackup: Chunking and Content-Addressed Deduplication
BorgBackup (derived from Attic, developed from 2010) implements content-addressable deduplication with variable-size chunking. It:
- Reads source files and splits them into chunks (variable-length, based on a rolling hash like Buzhash)
- Hashes each chunk (BLAKE2b by default)
- Checks whether the hash already exists in the repository (deduplication)
- Stores new chunks in “segment files” (sequential append-only log files)
The inode-ordering optimization was explicitly discussed in the BorgBackup (and Attic) issue trackers. Sorting directory entries by inode number before processing — rather than alphabetically — was found to give 30%+ speed improvements on ext3/ext4 for large directory trees, because inode-sorted traversal reads the inode table sequentially instead of jumping between inode table blocks.
BorgBackup’s repository format is itself sequential-write optimized: new data is always appended to segment files, never randomly inserted.
restic: Content-Addressable with Packs
restic (2014) uses a similar content-addressable model. It reads files, splits them into chunks, deduplicates by hash, and stores unique chunks in “pack files”. Pack files are uploaded to the backend storage (filesystem, SFTP, S3, etc.) as large blobs, naturally creating sequential write patterns.
restic uses multiple goroutines for parallel reading and uploading, which improves throughput on fast storage but also means its disk read pattern is not sequential in the strict sense — multiple files may be read concurrently, with the OS scheduler determining actual I/O order.
Bacula/Bareos: Tape Thinking on Disk
Bacula (2000, by Kern Sibbald) and its fork Bareos applied the thinking of tape management software to disk. Tape’s fundamental constraint — sequential access only — shaped the entire architecture: jobs are written sequentially to volumes, volumes are managed as if they were tape reels, and restore operations work by seeking to the right position in a volume file and reading forward.
On disk, this creates naturally sequential write patterns (good for HDDs) at the cost of more complex restore operations that must locate the right position in large volume files.
Apple Time Machine: FSEvents and Change Journaling
Time Machine launched with Mac OS X 10.5 Leopard on October 26, 2007. Its key efficiency innovation is the FSEvents subsystem — a kernel-level filesystem change journal that records every path that was modified since the last observed event ID.
When Time Machine starts a backup, it does not scan the entire disk looking for changed files (which would require reading every directory). Instead, it queries the FSEvents database for paths that changed since the last backup’s event ID. The FSEvents log is stored in a hidden .fseventsd folder at the root of each volume.
This approach inverts the scanning problem: instead of asking “which files changed?”, Time Machine asks “what changes were recorded?” If the FSEvents database is complete and intact, Time Machine builds its change list in seconds regardless of disk size. Only when the FSEvents log is unavailable (disk was disconnected for too long, log corruption, first backup) does Time Machine fall back to a full directory scan with timestamp comparison.
On APFS, Time Machine gained snapshot-based backup starting with macOS 10.13 (2017), further improving efficiency.
The Common Thread
Every backup tool must solve the same problem: how to efficiently identify what has changed and read it. The solutions span a spectrum:
- Full scan + sequential read optimization (inode sort, FIEMAP sort): good for first backups
- Change journaling (FSEvents, Windows USN Journal): eliminates most scanning for incremental backups
- Content-addressed deduplication (Borg, restic): reduces data transferred at the cost of more CPU and memory
Database systems have independently rediscovered the same truths about sequential I/O that filesystem designers learned, and have developed their own solutions.
grep and Text Search: Sequential Scanning Excellence
Before databases, the canonical sequential scanner was grep. Ken Thompson wrote the original grep for Unix Version 4 (around 1973) as a tool to search files for regular expression matches. The name derives from the ed editor command g/re/p (global regular expression print).
GNU grep, the standard implementation on Linux, achieves remarkable performance through:
- Boyer-Moore-Horspool algorithm for fixed strings: skips over sections of text that cannot possibly match, reading far less than the full file content
- DFA/NFA engines for general regular expressions
- Memory-mapping files with
mmap()rather than usingread()system calls, reducing kernel/userspace boundary crossings
ripgrep (created by Andrew Gallant, released 2016) pushes further:
- Uses the Rust
regexcrate with finite automata and SIMD (Single Instruction Multiple Data) acceleration - Implements the Teddy algorithm — a SIMD-based literal matching algorithm that can check 16 or 32 bytes simultaneously for potential pattern matches before invoking the full regex engine
- Automatically chooses between
mmap(for single large files) and buffered reading (for directory traversal) - Respects
.gitignoreand similar filter files, skipping unneeded reads entirely
The performance of both grep and ripgrep depends fundamentally on sequential disk access. A search across a codebase on an HDD in random order (alphabetical) is meaningfully slower than the same search in inode order or physical order.
SQLite: B-tree Layout and Sequential Scans
SQLite stores its database in a single file, organized as a collection of fixed-size pages (default 4,096 bytes, matching the OS page size). Tables are represented as B-trees stored across these pages, with interior nodes pointing to child nodes and leaf nodes containing the actual row data.
For sequential table scans (SELECT * FROM table), SQLite reads leaf pages of the B-tree in order. If the table’s pages are physically contiguous on disk, this is an efficient sequential operation. If the database file is fragmented, or if the table was built through many random insertions, leaf pages may be scattered.
SQLite supports mmap() access via PRAGMA mmap_size, which allows the OS to manage page reads through the virtual memory system rather than explicit read() calls. For databases that fit in memory, mmap effectively eliminates I/O latency — pages are in memory after their first access and subsequent reads hit RAM. For databases larger than available RAM, mmap’s benefit depends on access patterns.
PostgreSQL: CLUSTER and Physical Row Order
PostgreSQL stores table rows as heap files — rows are appended in insertion order, not sorted by any key. A B-tree index provides sorted access but requires random I/O: the index is traversed in key order, but each index entry points to a heap tuple that could be anywhere in the file.
The CLUSTER command physically rewrites the heap file in the order defined by a specified index:
CLUSTER users USING users_created_at_idx;
After clustering, a sequential scan of users in created_at order becomes a sequential disk read rather than a series of random heap accesses. The benefit can be dramatic for range queries on the clustered key: instead of thousands of random I/Os, the database performs one large sequential read.
The limitation: clustering is a one-time operation. New rows inserted after the CLUSTER command go to new heap pages appended at the end, and the physical order diverges from the index order over time. Periodic re-clustering (or using a fill factor that leaves room for ordered inserts) is necessary to maintain the benefit.
PostgreSQL’s planner has a concept of correlation: a statistics value between -1 and 1 indicating how well the physical row order matches the index key order. A correlation near 1 means the table is effectively clustered; the planner will favor index scans for range queries. A correlation near 0 means rows are randomly distributed; the planner will often prefer sequential heap scans + sort over index scans that require random I/O.
MySQL/InnoDB: Clustered Primary Key
InnoDB (MySQL’s default storage engine since MySQL 5.5) takes a different architectural approach: every table has exactly one clustered index, always the primary key. Row data is stored directly in the B-tree leaf pages, in primary key order. There is no separate heap file.
This means a range query on the primary key is inherently a sequential disk read: SELECT * FROM orders WHERE order_id BETWEEN 1000 AND 2000 traverses leaf pages of the primary key B-tree in order, which corresponds to a sequential read of contiguous disk extents (assuming the B-tree is not fragmented).
Secondary indexes in InnoDB store the index key plus the primary key value (not a physical row pointer). A secondary index lookup requires:
- Sequential (or selective) read of the secondary index B-tree
- For each matching entry, a random access to the primary key B-tree to retrieve the full row
This “double lookup” is a known InnoDB performance characteristic. For queries that access many rows through a secondary index, it can generate substantial random I/O.
RocksDB and LSM-Trees: Write-Optimized Sequential
RocksDB (Facebook, 2012, forked from Google’s LevelDB which was created in 2011) implements the Log-Structured Merge-tree (LSM-tree), a data structure whose origins trace to a 1996 paper by Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil.
The LSM-tree’s key insight: sequential writes are fast; random writes are slow. Therefore, never update data in place. Instead:
- Write incoming mutations to an in-memory data structure (a MemTable, typically an ordered skip list or red-black tree)
- When the MemTable is full, sort it and flush it sequentially to disk as an immutable Sorted String Table (SSTable)
- Periodically compact SSTables by merging and re-sorting them (sequential read + sequential write)
- Reads must check the MemTable and all SSTables, using Bloom filters to skip SSTables that cannot contain the key
All disk I/O in an LSM-tree is sequential: writes are append-only flushes and compaction rewrites. The cost is read amplification (multiple files to check) and write amplification (data is written multiple times as SSTables are compacted). The design is explicitly optimized for write throughput, at the expense of read latency compared to B-trees.
RocksDB’s architecture reflects the same engineering truth the IBM 350 team discovered in 1956: organizing your work to be sequential is worth the investment.
Btrfs and ZFS: Copy-on-Write Complexity
Both Btrfs (Linux, initial release 2007, merged 2009) and ZFS (Sun Microsystems, 2005) use copy-on-write (COW) semantics: when a block is modified, the new version is written to a fresh location; the old block is invalidated after the metadata tree is updated to point to the new location.
COW has significant benefits: snapshots are instant and cheap (no data copying), data integrity is strong (old data remains valid until new data is confirmed written). But for sequential read performance on HDDs, COW creates a problem over time: files that are frequently modified accumulate extents scattered across the disk as each modification writes to a new location. A file that was originally sequential becomes fragmented.
Btrfs supports online defragmentation (btrfs filesystem defragment) to consolidate extents, but this breaks COW snapshots (defragmented extents are no longer shared with snapshots). ZFS does not support online defragmentation of fragmented files at the filesystem level; its solution is the VDEV-level prefetcher and the ZFS Adaptive Replacement Cache (ARC), which are optimized for keeping hot data in memory and reducing the impact of fragmentation.
The Inode-Sort Trick
The simplest, most widely applicable optimization for sequential HDD scanning is to sort files by inode number before reading them. The trick is known in the performance optimization community and predates any specific attribution.
The reason it works: on ext2/ext3/ext4 filesystems, the inode table is stored in fixed positions within each block group. Inode N is physically close to inode N+1 in the same block group. Reading inodes in numerical order approximates reading them sequentially from disk, even though the actual file data may not be in perfect order.
This was explicitly implemented in BorgBackup and its predecessor Attic after community benchmarking showed 30%+ speed improvements on ext3/ext4 for large directory trees. The mlocate/updatedb database-building tool also notes in its design that sorting directory entries by inode number improves disk performance.
In practice, the optimization is implemented as:
files.sort(key=lambda f: os.lstat(f).st_ino)
This sort imposes O(N log N) cost in inode lookups, but these lookups are cheap compared to the random I/O they prevent.
Two-Pass Scanning
Many high-performance tools separate the problem into two phases:
- Metadata pass: Walk the directory tree, collect all file paths and their
stat()metadata (including inode numbers, sizes, and modification times). This can be done with relatively few disk seeks if directory entries are read in order. - Content pass: Sort the collected files by inode (or physical extent), then read their content in that order.
This is the approach used by locate/mlocate (which stores pre-built metadata in a sorted database) and by sophisticated backup tools. The cost is the overhead of the metadata pass and the sort; the benefit is dramatically more sequential content reads.
Windows Superfetch and Prefetch
Windows XP introduced the Prefetcher — a component that monitors which files and pages are accessed during application startup and Windows boot. It stores this trace in .pf files in C:\Windows\Prefetch. When an application is next launched, the Prefetcher reads the trace and initiates readahead for all the listed files before the application explicitly requests them.
Windows Vista extended this with SuperFetch (now called SysMain). SuperFetch learns which applications are used at which times of day and begins loading their data into RAM during idle periods, so that commonly-used programs launch from RAM rather than from disk. It applies machine learning to observed access patterns, updating its model continuously.
Critically, Windows Prefetch stores access patterns as a list of files and their physical locations. The Disk Defragmenter reads Layout.ini (generated from Prefetch traces) and physically moves those files to contiguous positions near the beginning of the disk, so that boot-time reads are maximally sequential.
This is the same insight in software form: knowing the access pattern in advance allows you to arrange data so that reads are sequential.
io_uring: Async I/O for the Modern Era
Prior to io_uring, Linux had POSIX AIO (aio_read(), aio_write()): an asynchronous I/O interface introduced in the early 2000s. POSIX AIO had significant problems — many operations secretly fell back to blocking I/O in kernel threads, and the interface was complex and inconsistent across different file types.
io_uring was designed by Jens Axboe and merged into Linux 5.1 in March 2019. It uses two shared ring buffers between the kernel and userspace:
- Submission Queue (SQ): The application writes I/O requests here
- Completion Queue (CQ): The kernel writes completion notifications here
Because the buffers are shared memory, submitting an I/O request does not require a write() or ioctl() system call. The application can batch many submissions and then use io_uring_enter() once to submit them all — or in SQPOLL mode, a kernel thread polls the SQ continuously, eliminating system calls entirely.
For sequential scanning workloads, io_uring enables a pattern previously difficult to implement: submitting many reads in parallel, getting completions in any order, and processing each file as its data arrives. This allows the OS scheduler to reorder requests for optimal disk access while the application remains non-blocking. The result is that even a single-threaded application can have many concurrent I/O operations in flight, letting the elevator scheduler do its work.
io_uring also supports linked requests: operation B can be made to start automatically when operation A completes, without returning to userspace. This enables building pipelines — read file, then process data, then write result — without a round-trip through userspace for each step.
O_DIRECT: Bypassing the Page Cache
For large sequential workloads like backup jobs, the page cache can be counterproductive: reading 10 TB pollutes the cache with data that will never be accessed again, while evicting hot pages used by the rest of the system.
Opening a file with the O_DIRECT flag bypasses the page cache entirely. Data is transferred directly between the device and the application’s buffer via DMA. Requirements:
- All I/O offsets and sizes must be aligned to the device’s logical block size (typically 512 or 4,096 bytes)
- Buffers must be aligned to the same boundary
For streaming workloads (sequential read from start to end, no re-reads), O_DIRECT can achieve the same throughput as cached I/O while leaving the page cache undisturbed. The alternative — buffered I/O with posix_fadvise(FADV_DONTNEED) after each read — achieves similar cache behavior without the alignment requirements and is generally more portable.
Self-caching applications like RocksDB use O_DIRECT to implement their own buffer management, bypassing the OS cache which is unaware of their internal data structure semantics.
NVMe Queues and the Narrowing Gap
As described earlier, NVMe’s multi-queue architecture reduces the performance gap between sequential and random I/O on SSDs. At queue depth 1 (one outstanding I/O at a time), a modern NVMe drive might deliver:
- Sequential read: 3,500 MB/s
- Random 4K read: 50–100 MB/s (a 35–70x gap)
At queue depth 32 (32 concurrent I/Os):
- Sequential read: 3,500 MB/s (saturated by interface)
- Random 4K read: 800 MB/s (8x improvement from parallelism)
The gap is substantially narrowed by concurrency. The implication: on NVMe, submitting many concurrent I/Os often matters more than their ordering. io_uring with SQPOLL mode is the natural fit, enabling applications to maintain high queue depth without the overhead of repeated system calls.
ZNS: Bringing SMR Physics to NVMe
Zoned Namespace (ZNS) NVMe is a command set developed by the NVM Express working group. ZNS Task Group work began in late 2018, with the specification ratified in June 2020 and ZNS Command Set 1.1 released in June 2021.
ZNS divides an NVMe namespace into zones analogous to SMR drive zones: each zone must be written sequentially. Random writes within a zone are not supported; to overwrite data in a zone, the zone must be reset (erased) and rewritten from the beginning.
The motivation is economic: NVMe flash controllers maintain internal overprovisioning buffers and perform write leveling to hide the sequential-write requirement of NAND flash from the host. ZNS eliminates this abstraction layer, exposing the raw sequential-write requirement to the host software in exchange for:
- Less controller complexity and cost
- Better performance predictability (no background garbage collection surprises)
- Higher usable capacity (less overprovisioning needed)
- Lower write amplification (host software decides write ordering)
ZNS SSDs and SMR HDDs share a unified software stack on Linux: the Zoned Block Device (ZBD) interface. Filesystems like f2fs (Flash-Friendly File System) and experimental ZNS-aware extensions to Btrfs and ext4 are being developed to take advantage of ZNS.
ZNS is a striking example of physics reasserting itself: the sequential write requirement of NAND flash, hidden for years behind increasingly complex controllers, is now exposed as an architectural feature.
Cloud Storage and the Reintroduction of Seek Costs
Object storage services like Amazon S3, Google Cloud Storage, and Azure Blob Storage do not have spinning platters (they use HDDs under the hood, at scale). But they reintroduce a different kind of “seek cost”: the round-trip latency of an HTTP request.
A GET request to S3 takes 5–20 ms of overhead before the first byte arrives, regardless of object size. For small objects (< 1 MB), this latency dominates: downloading 1,000 small objects sequentially takes 5–20 seconds of network round-trips alone. The solution — exactly analogous to batching small HDD reads — is to store many small items together in large “pack files” and download them in bulk. This is why backup tools like Borg, restic, and Kopia aggregate chunks into large pack files before uploading: they amortize the per-request overhead over many megabytes, just as good filesystem layout amortizes seek time over large sequential reads.
Amazon S3’s performance guidance explicitly recommends parallelism: using multiple threads to issue concurrent GET requests achieves up to 5,500 read requests per second per prefix, with throughput scaling linearly with concurrency. The S3 backend is itself a massive distributed system backed by billions of HDDs and SSDs, and internally it must perform all the sequential I/O optimizations described in this article — at exabyte scale. As of 2025, S3 stores over 500 trillion objects and serves peaks of approximately 1 petabyte per second.
Content-Addressable Storage and I/O Patterns
Content-addressable storage (CAS) systems — like the repositories used by Borg, restic, and git’s object store — address data by the hash of its content rather than its location. This creates interesting I/O patterns:
- Writes are always sequential appends (new content goes at the end of pack files or object stores)
- Reads for deduplication checks involve reading hash manifests, which can be kept in memory or small sequential files
- Restores require reading pack files to extract specific chunks, which may involve random access within pack files if chunks are not sorted by restore order
The tension in CAS is between write optimization (always append) and read optimization (sequential layout for restores). Systems that pack chunks in write-time order are fast to create but slow to restore if the restore order differs from the write order. Systems that sort chunks by file-and-offset order at write time are faster to restore but add packing overhead at backup time.
The history of sequential I/O optimization connects to many adjacent topics that deserve deeper treatment:
Defragmentation Software History: A complete history of disk defragmenters, from the early DOS tools (Vopt, Speed Disk, Norton Defragmenter) through Windows XP’s built-in defragmenter to modern tools like Defraggler. The evolution from simple “consolidate free space” to sophisticated algorithms that account for file access frequency (Superfetch-informed defrag).
RAID and Artificial Sequentiality: How RAID striping (RAID-0, RAID-5, RAID-6) creates parallel sequential access from distributed writes, effectively converting a set of independent sequential channels into one wider channel. The relationship between stripe size and sequential vs. random I/O performance.
Filesystem Journaling and I/O Overhead: How journaling (write-ahead logging of metadata changes) adds sequential I/O overhead but prevents corruption on crash. The difference between metadata journaling (ext3 default) and data journaling (every data write goes through the journal, halving write throughput but maximizing integrity). How modern filesystems use journal checkpointing to manage this overhead.
Memory-Mapped Files vs. read() for Scanning: The detailed performance comparison between mmap()+MADV_SEQUENTIAL and read() with large buffers. Cases where mmap wins (the OS page table serves as a zero-copy buffer) and where read() wins (large sequential reads where mmap’s page fault overhead dominates). How ripgrep automatically chooses between them.
B-trees and Disk Seek Patterns: The deep relationship between B-tree node size and disk page size, how B-tree splits and merges generate fragmentation over time, and why B-tree height (the number of seeks per lookup) is the central optimization target. How in-memory B-trees differ from on-disk B-trees in implementation.
Log-Structured Merge Trees (LSM) as Sequential Write Optimization: A full treatment of the write-optimized vs. read-optimized tradeoff — why LSM-trees appear in write-heavy systems (Cassandra, RocksDB, LevelDB, Apache HBase, InfluxDB) and how compaction strategies (Tiered vs. Leveled) affect the balance.
Tape Drives and Absolute Sequentiality: The history and physics of magnetic tape — LTO (Linear Tape-Open), IBM TS series — where sequential access is not an optimization but an absolute requirement. The modern role of tape in cold storage (costs < $0.005/GB at scale). The file format LTFS (Linear Tape File System) and how it adds a POSIX-like interface to sequential media.
Block-Level vs. File-Level Deduplication I/O Patterns: How block-level dedup (implemented in storage arrays and ZFS) differs from file-level dedup (Borg, restic) in its I/O profile — block-level dedup reads every write in random order for hash comparison, while file-level dedup processes files sequentially but must maintain a large hash index.
Virtual Memory and TLB Effects on Sequential Scans: How CPU translation lookaside buffer (TLB) pressure affects the performance of sequential scans with mmap(), and why huge pages (MAP_HUGETLB on Linux) can significantly improve throughput for large sequential scans by reducing TLB misses.
The Relationship Between SSDs and Software Architecture: How the shift from HDD to SSD has changed or should change software design assumptions — which optimizations are still worthwhile (sequential writes for write amplification), which are less critical (inode sorting), and which new considerations appear (write endurance, SSD internal parallelism, ZNS).
The IBM 350 RAMAC’s 1-second seek time in 1956 set the terms of a contest that would last seven decades. Every engineering artifact described in this article — the elevator algorithm, ext4 block groups, the Orlov allocator, FIEMAP, posix_fadvise, BorgBackup’s inode sort, RocksDB’s LSM-tree, io_uring’s ring buffer, ZNS’s sequential write zones — is a response to the same fundamental physical constraint: moving a mechanical head across a spinning disk takes time, and that time compounds catastrophically when reads are not sequential.
The sophistication of the solutions increased in proportion to the importance of the problem. As drive capacities grew from megabytes to terabytes and the number of files on a typical system grew from thousands to millions, the cost of random I/O shifted from “noticeable” to “completely untenable.” The response was layers of optimization: hardware scheduling in the drive, kernel-level I/O scheduling, filesystem allocation policy, application-level sorting, and finally, async submission pipelines that let the kernel reorder requests on the application’s behalf.
NVMe and SSDs have changed the picture considerably. The seek time that drove all of this engineering is now measured in microseconds rather than milliseconds. But sequential I/O optimization has not become irrelevant — it has shifted. Write amplification in NAND flash, S3 request overhead, CPU cache and prefetcher behavior, and the raw economics of memory bandwidth all still reward applications that access data in order. The principle generalizes beyond spinning rust: reading in order is almost always faster than reading at random, whatever the medium.
Sources consulted:
- Computer History Museum: First commercial hard disk drive shipped (1956)
- Tom’s Hardware: IBM RAMAC 350 69th anniversary
- LWN.net: Orlov block allocator for ext3
- Linux Kernel Documentation: Ext4 Block and Inode Allocation Policy
- Linux Kernel Documentation: Fiemap Ioctl
- LWN.net: Fiemap, an extent mapping ioctl
- man7.org: posix_fadvise(2)
- Linux Kernel Documentation: BFQ I/O Scheduler
- Jens Axboe: Efficient IO with io_uring
- NVM Express: ZNS Command Set Specification
- BurntsushI: ripgrep is faster than grep
- GitHub: Speed up directory scanning by ordering by inode (Attic issue #91)
- GitHub: Order of traversing files in a directory (Borg issue #905)
- The Eclectic Light Company: A brief history of Time Machine
- Wikipedia: Zone bit recording
- Wikipedia: Cylinder-head-sector
- Wikipedia: tar (computing)
- PostgreSQL Documentation: CLUSTER
- GetStream: The Fundamentals of RocksDB
- Microsoft Learn: FSCTL_GET_RETRIEVAL_POINTERS
- Computer History Museum: First cartridge HDD and voice coil actuator (1965)