DEV Community: Joichiro Mitaka

Using Zstd Frames to Egress Partial Parquet Files

Joichiro Mitaka — Wed, 24 Jun 2026 15:26:22 +0000

Jump Tables, TLV Footers, and the Real Cost of Reading What You Don't Need

You're paying for bytes you never read.

A data engineer on a busy pipeline touches dozens of Parquet files a day: schema discovery, predicate pushdown, column pruning, metadata scrapes for a data catalog sync. In each case, the application needs maybe 200 KB of context from a file that is 4 GB on disk. Without a seekable archive format and a jump table to find the right frame, your HTTP client fetches the whole thing, and your cloud egress invoice reflects every unnecessary gigabyte.

This post quantifies the problem, then walks through how HuskHoard uses seekable Zstd frames, a per-volume jump table, and TLV-encoded footer metadata to make partial egress a first-class citizen across multi-volume archives — disk, cloud, and LTO tape alike.

The Problem, In Dollars

S3 standard egress runs $0.09/GB. GCS is $0.08/GB. Even Cloudflare R2, which is free for egress from R2 to the internet, still costs you in latency and API call count when you cannot bound the range of bytes you need.

Here is a representative read pattern for a cold analytics archive:

Operation	Bytes Needed	Bytes Fetched (naïve)	Ratio
Schema discovery	~50 KB (Parquet footer)	1–8 GB (full file)	~1:16,000
Single column scan	~200 MB (one column chunk)	4 GB (full row group)	1:20
Data catalog sync (1M files)	~50 GB (footers only)	~4 PB (full files)	1:80,000
Selective restore (1 row group)	~128 MB	4 GB	1:32

On 100 TB of cold Parquet data with $0.09/GB egress:

Full read for schema sync: 100 TB × $0.09 = $9,216
Partial read (footers only, avg 100 KB/file, 1M files): ~100 GB × $0.09 = $9.00
Savings per catalog sync: $9,207 — 99.9% reduction

Even a conservative column-scan scenario (pulling 15% of each file's bytes) cuts a $9,216 monthly read bill to $1,382. The ceiling on savings is determined entirely by how precisely you can address the bytes you actually need.

That precision is what frames and jump tables buy you.

Zstd Frames: What They Are and Why They Matter

A single .zst file produced by the standard zstd CLI is one frame. Everything inside is a single compressed stream. You have to start decompression at byte 0 to reach any byte inside.

But the Zstd spec allows a concatenation of independent frames. Each frame is a complete, self-contained unit:

[Frame 0][Frame 1][Frame 2]...[Frame N]
 ^         ^         ^           ^
 16 MB     16 MB     16 MB       partial

Every frame has a known compressed_size and decompressed_size. If you know those sizes in advance (stored in a jump table), you can seek directly to Frame N by summing the compressed sizes of frames 0 through N-1. You never decompress anything you don't need. Frame N is fetched with a single HTTP Range request, decompressed independently, and the relevant bytes are piped downstream.

This is the architectural core of HuskHoard's egress model, and it maps cleanly onto how the Parquet format itself carves up a file.

The Parquet Parallel: Row Groups as Frames

Parquet is deliberately designed for partial reads. A Parquet file contains:

Row groups — horizontal partitions of the data, each independently readable
Column chunks — vertical slices within a row group
Page headers — per-page metadata within each column chunk
Footer — the FileMetaData Thrift struct at the end of the file, containing the schema, row group offsets, column statistics, and key-value metadata. Preceded by a 4-byte footer length and terminated with the magic bytes PAR1.

A reader that wants only the footer performs two range requests: one to get the last 8 bytes (magic + footer length), and one to get the footer itself. Everything else stays on the remote. A reader that wants one column from one row group consults the footer to find the column chunk's byte offset and length, then fires a single range request.

HuskHoard's frame model mirrors this exactly, but at the archive level rather than within a single Parquet file.

HuskHoard's Implementation: Frames, the Catalog, and the Jump Table

When HuskHoard archives a file to any backend — a flat image file acting as a tape volume, a physical LTO cartridge, or an rclone cloud remote — it writes the payload as a sequence of 16 MB Zstd frames. For each frame, it records the mapping between uncompressed byte position and compressed byte position on the volume.

That mapping is the object_frames table in husk_catalog.db:

CREATE TABLE IF NOT EXISTS object_frames (
    file_path           TEXT    NOT NULL,
    version             INTEGER NOT NULL,
    uncompressed_offset INTEGER NOT NULL,   -- where this frame starts in the original file
    compressed_offset   INTEGER NOT NULL,   -- where this frame starts on the storage volume
    compressed_size     INTEGER NOT NULL    -- how many bytes to fetch from the volume
);

CREATE INDEX IF NOT EXISTS idx_frames
    ON object_frames (file_path, version);

This is the jump table. Given a byte range request for bytes=2147483648-2281701376 (a 128 MB window starting at the 2 GB mark), the gateway does:

SELECT compressed_offset, compressed_size
FROM   object_frames
WHERE  file_path = '/warehouse/events/2024-01-01.parquet'
  AND  version   = (SELECT MAX(version) FROM object_frames WHERE file_path = ...)
  AND  uncompressed_offset <= 2147483648
ORDER  BY uncompressed_offset DESC
LIMIT  1;

One row. One seek. One range request against the volume. Everything else stays dark.

The HTTP gateway loop in StreamGate:

HTTP Range request arrives (bytes=X-Y)
        │
        ▼
Query object_frames → nearest frame boundary ≤ X
        │
        ▼
Seek to compressed_offset on volume (tape block, S3 range, local seek)
        │
        ▼
Decompress forward to exact byte X, stream through Y
        │
        ▼
Client receives exactly what it asked for

For a 4K video file seeking to the 2-hour mark, this is why mpv can start playing from tape or S3 in under a second instead of waiting for a multi-gigabyte download.

TLV Footers: Turning the Frame Header Into a Parquet-Style Catalog Entry

Every file archived by HuskHoard is preceded on the storage volume by a strict 4,096-byte ObjectHeader. The first 136 bytes carry the fixed-width mechanics: a magic string (USTDHUSK), the file's UUID, POSIX permissions, BLAKE3 hash, compressed and uncompressed sizes, and a CRC32 of the header itself.

The remaining 3,960 bytes are dedicated to TLV (Type-Length-Value) encoded metadata — the same binary framing used in X.509 certificates, SNMP, and dozens of wire protocols chosen specifically because unknown type codes can be safely skipped by any forward-compatible parser.

Byte  0 –  7: Magic "USTDHUSK"
Byte  8 – 23: Volume UUID (16 bytes)
Byte 24 – 55: BLAKE3 hash (32 bytes)
Byte 56 – 63: Uncompressed payload size
Byte 64 – 71: Compressed payload size
Byte 72 – 79: Original mtime
Byte 80 – 83: POSIX mode
Byte 84 – 87: Header CRC32
Byte 88 –135: File path (null-terminated, 48 bytes max inline)
Byte 136–4095: TLV region (3,960 bytes)

A TLV tag entry looks like:

[Type: u8][Key-Length: u16][Key: bytes][Value-Length: u32][Value: bytes]

This is where the Parquet footer analogy becomes structural rather than metaphorical. For a Parquet file being archived, HuskHoard can embed the Parquet FileMetaData statistics directly into this TLV region:

TLV Type	Key	Value
`0x02`	`parquet.schema`	Serialized Thrift schema (JSON or binary)
`0x02`	`parquet.row_count`	Total row count as little-endian u64
`0x02`	`parquet.col.event_ts.min`	Minimum value of `event_ts` column
`0x02`	`parquet.col.event_ts.max`	Maximum value of `event_ts` column
`0x02`	`parquet.col.user_id.null_count`	Null count for `user_id`
`0x02`	`parquet.row_group.count`	Number of row groups
`0x01`	`workflow.pipeline`	`"ingest_v3"` — POSIX xattr from source
`0x01`	`workflow.owner`	`"data-eng-team"`

These statistics travel physically bonded to the data on every storage medium — disk image, tape cartridge, S3 object. If the SQLite catalog is lost, husk rebuild walks the volume, reads every 4 KB header, and reconstructs the catalog complete with all column statistics. The tape is entirely self-describing.

But the real payoff is what this enables while the catalog is present.

Multi-Volume Catalog Queries: Pruning at the Volume Level

A production HuskHoard deployment might span several volumes:

Volume A (tape, 12 TB) — archive 2022–2023
Volume B (tape, 12 TB) — archive 2023–2024
Volume C (NVMe image, 2 TB) — archive 2024–present
Volume D (S3:us-east-1, 50 TB) — cloud replica

The catalog table records which volume holds each archived version of each file:

SELECT
    c.file_path,
    c.tape_uuid,          -- identifies the volume
    c.tape_offset,        -- byte offset of the ObjectHeader on that volume
    c.payload_size,
    c.compressed_size,
    c.custom_metadata     -- mirrors the TLV tags as JSON
FROM   catalog c
WHERE  json_extract(c.custom_metadata, '$.parquet.col.event_ts.min') >= '2024-01-01'
  AND  json_extract(c.custom_metadata, '$.parquet.col.event_ts.max') <= '2024-03-31'
  AND  json_extract(c.custom_metadata, '$.parquet.row_count')        > 0;

This query executes in milliseconds against the SQLite catalog on your SSD. The tape drives stay spun down. S3 is never contacted. You get back a list of (file_path, tape_uuid, tape_offset) tuples — the exact volumes and positions to touch.

Then, per file, for each column you actually need:

SELECT compressed_offset, compressed_size
FROM   object_frames
WHERE  file_path           = '/warehouse/events/2024-01-15.parquet'
  AND  version             = 3
  AND  uncompressed_offset BETWEEN :col_chunk_start AND :col_chunk_end
ORDER  BY uncompressed_offset;

You issue a range request for only those frames. For a 4 GB Parquet file where you need one 200 MB column chunk:

Step	Data Transferred
Catalog query (SQLite, local)	0 bytes egress
object_frames lookup (SQLite, local)	0 bytes egress
HTTP Range to S3 (compressed frame bytes)	~85 MB (at 2.4:1 Zstd ratio)
Total vs naïve full-file fetch	85 MB vs 1.7 GB

That is a 95% reduction on a per-query basis.

Putting Numbers to the Savings

Let's use a concrete scenario: a data team maintains a 10 TB cold Parquet archive on S3, with an average file size of 4 GB. They run three workloads:

Workload A — Nightly catalog sync (schema + statistics only)

Files: 2,500 Parquet files
Data needed per file: footer only (~150 KB each)
Total needed: ~375 MB
Full-file cost: 10 TB × $0.09 = $921.60/month
Partial-frame cost: 375 MB × $0.09 = $0.03/month
Monthly savings: $921.57

Workload B — Ad-hoc column scan (one column across 20% of files)

Files queried: 500 (selected by TLV statistics predicate)
Column chunk per file: ~200 MB uncompressed → ~85 MB compressed frames
Total fetched: ~42.5 GB
Full-file cost: 500 × 4 GB × $0.09 = $180.00/query
Partial-frame cost: 42.5 GB × $0.09 = $3.83/query
Per-query savings: $176.17 (97.9%)

Workload C — Point-in-time restore of a single row group

1 file × 1 row group = 128 MB uncompressed → ~54 MB compressed
Full-file cost: 4 GB × $0.09 = $0.36
Partial-frame cost: 54 MB × $0.09 = $0.005
Per-restore savings: $0.355 (98.6%)

At scale, Workload A alone — a nightly catalog sync that most teams run without thinking about the bill — generates ~$11,000/year in unnecessary egress on a 10 TB archive. The frame-indexed approach reduces that to under $1/year.

The Self-Describing Volume: Your Catalog Backup Is on the Tape

One underappreciated consequence of storing TLV column statistics in every ObjectHeader is that the volume itself becomes a data catalog. After a complete disaster recovery (catalog database lost, fresh server), husk rebuild walks the storage volume 4 KB at a time, reads every ObjectHeader, validates the CRC32, and inserts a new catalog row including all TLV-encoded metadata — column statistics, schema, pipeline tags, everything.

The catalog is not a separate system that the archive depends on. The catalog is a cache that accelerates access to information already encoded in the archive itself. This is the same philosophical commitment Parquet makes: the footer is not a separate sidecar file; it is part of the format.

For teams integrating with external data catalogs (Apache Atlas, Hive Metastore, Unity Catalog), this means HuskHoard can emit catalog events on husk rebuild just as well as on initial archive — the metadata survives the worst failure scenario, format-native.

Wiring It Up: What a Partial Read Looks Like End-to-End

A data engineer's dbt model lands at the StreamGate HTTP gateway with a Range: bytes=536870912-671088640 request (512 MB – 640 MB, pulling a specific row group from a 4 GB Parquet file on S3):

1. GET http://localhost:8080/v1/stream/warehouse/events/2024-01-15.parquet
   Range: bytes=536870912-671088640

2. Gateway queries object_frames:
   → nearest frame boundary ≤ 536870912 is at uncompressed_offset=536870912
   → compressed_offset=225,978,112 on Volume D (S3:us-east-1)
   → 6 frames needed, compressed total = 56.3 MB

3. Gateway fires:
   GET s3://huskhoard-cold/volume-d.img
   Range: bytes=225978112-285884415

4. Gateway decompresses frames on the fly, streams bytes 536870912–671088640
   to the client.

5. Total egress from S3: 56.3 MB
   Total egress if client had fetched the full file: 1.71 GB
   Savings: 96.7%
   Time to first byte (LAN): ~180ms vs ~14s for full-file download

The client — dbt, Spark, DuckDB, curl, whatever — receives a standard HTTP 206 Partial Content response. No special client library. No SDK. Just the HTTP Range spec, universally supported.

Practical Takeaways for Data Engineers

1. Frame size is a tuning knob. HuskHoard defaults to 16 MB frames, optimized for cloud PUT cost (fewer, larger requests) and Zstd compression ratio. For workloads with very fine-grained access patterns (column-level reads in narrow schemas), smaller frames (1–4 MB) reduce the minimum fetch size at the cost of more catalog rows and higher PUT count. Benchmark against your actual access patterns.

2. TLV statistics are opt-in per file type. For video files you probably don't store column min/max values. For Parquet, CSV, and Arrow IPC files it's worth paying the archiver CPU time to extract and embed statistics at archive time — you pay once and recoup every time a catalog query avoids a volume read.

3. The catalog query is your explain plan. Before a restore or a scan, husk catalog query --path "/warehouse/events/*.parquet" --filter "parquet.col.event_ts.min >= 2024-01-01" shows you which volumes and frame ranges will be touched. Run it first. If the egress estimate is unexpected, the TLV coverage on those files probably needs improving.

4. Multi-volume means cross-volume pruning is free. A query that touches two volumes and skips three is doing volume-level predicate pushdown before any I/O. The catalog does this automatically based on the tape_uuid in each matching row.

5. Egress savings compound with replication. HuskHoard replicates to multiple volumes simultaneously. If your primary volume is on S3 ($0.09/GB egress) and your replica is on Cloudflare R2 ($0.00 egress), the gateway can route the range request to whichever backend minimizes cost. Partial reads from R2 are free. You still benefit from the jump table because API call count and latency still matter.

# Why I Bypassed FUSE: Building a Transparent DataTiering Engine in Rust

Joichiro Mitaka — Fri, 19 Jun 2026 05:51:33 +0000

If you run a home lab or manage large datasets, you’ve hit this wall: NVMe drives are fast but too expensive to hoard data on. Hard drives or cloud buckets are cheap, but they are slow and a pain to manage manually.

The enterprise world solves this with HSM Hierarchical Storage Management automatically shuffling cold data to slow storage while keeping a transparent stub on the fast drive. But enterprise HSMs cost thousands of dollars and lock your data in proprietary black boxes.

I wanted this for Linux, for free. So, I started building HuskHoard, an opensource data tiering engine.

My first thought, like almost every Linux developer building a virtual filesystem, was to use FUSE. But I quickly realized FUSE was the wrong tool for the job. Here is why I abandoned it, and how I used the Linux fanotify API and Rust to build a transparent, zero overhead archiving engine.

The Problem with FUSE

FUSE is fantastic for creating custom filesystems like SSHFS or mounting an S3 bucket. But for an HSM, it creates a massive bottleneck.

When you use FUSE, every single read and write has to go through a context switch:
Application > Kernel > FUSE Daemon Userspace > Kernel > Physical Drive.

If 90% of your data is Hot actively being used on your fast NVMe, forcing it through FUSE overhead completely defeats the purpose of buying expensive NVMe drives in the first place. You sacrifice native I/O performance just to manage the 10% of Cold data.

I needed a solution where the Hot data ran at native speed, touching nothing but the XFS/Ext4 kernel drivers.

The Solution: Enter fanotify

Instead of intercepting every transaction via FUSE, I realized I only needed to intervene in one specific scenario: When a user tries to open a file that has been archived.

Linux has a kernel API called fanotify originally designed for antivirus scanners. It allows a userspace program to monitor a mount point and, crucially, block an application from opening a file until the daemon says it’s okay.

Here is how HuskHoard uses fanotify to create transparent tiering:

The Janitor: A background Rust thread scans my NVMe drive. When it finds a file that hasnt been touched in 30 days, it compresses it Zstd and moves the payload to a cheap HDD, LTO Tape, or S3 bucket.
The Husk Stub: It leaves the original file on the NVMe drive but truncates its allocated size to 0 bytes creating a sparse file. To the OS and the user, the file still looks like it’s 50GB and sits in /home/movies.
The Interceptor: This is where fanotify shines. The HuskHoard daemon listens for FAN_ACCESS_PERM events. If VLC media player tries to open that Husk file, fanotify pauses VLCs execution in the kernel.
The Recall: HuskHoard intercepts the request, streams the 50GB payload from the tape/S3 bucket back into the sparse file on the NVMe, and then tells fanotify to allow VLC to proceed.

VLC thinks it just opened a local file. It has no idea the data was fetched from an S3 bucket 50 milliseconds ago.

The Rust Implementation

Rust was the obvious choice for this. When you are blocking kernellevel I/O requests, memory safety and predictable latency are nonnegotiable.

Handling the fanotify loop requires a few specific Linux capabilities specifically CAP_SYS_ADMIN, but Rust allows us to safely manage the multithreaded heavy lifting of the Archive Worker.

pub fn run_interceptor(config: Arc<HuskConfig>, use_direct_io: bool) -> std::io::Result<()> {
    let watch_dir = &config.hot_tier;
    let db_path = &config.db_path;
    info!("\n[Daemon] Starting fanotify interceptor on '{}'...", watch_dir);
    let abs_dir = std::fs::canonicalize(watch_dir)?;

    let fan_fd = unsafe {
        libc::fanotify_init(libc::FAN_CLASS_PRE_CONTENT, libc::O_RDWR as u32)
    };
    if fan_fd < 0 { 
        let err = std::io::Error::last_os_error();
        error!(" fanotify_init failed: {}. Missing Root or Capabilities!", err);
        return Err(err); 
    }


        let mark_mask = libc::FAN_ACCESS_PERM | libc::FAN_CLOSE_WRITE | libc::FAN_EVENT_ON_CHILD;

        // 1. Recursively mark the root watch directory and all current subdirectories
        info!("[Daemon]  Scanning and attaching listeners to all subdirectories...");
        mark_directory_recursive(fan_fd, &abs_dir, mark_mask, &config);

Escaping Vendor Lockin The Easy Exit Promise

One of the biggest issues with commercial HSMs is that if the daemon dies, your data is gone, trapped in proprietary metadata.

Because I was building this for the opensource community, I enforced a strict Easy Exit architecture:

Payload data is stored in standard Zstd streams verified by BLAKE3.
The catalog metadata the Brain tracking where the cold bytes live is an SQLite database.
You can natively export the entire catalog to Apache Parquet.

This means if you decide to stop using HuskHoard, you dont need my software to get your data back. You can query your catalog with DuckDB or Python and manually extract your Zstd archives.

Whats Next?

Building HuskHoard has been a massive deepdive into Linux kernel APIs and SCSI Tape drivers. Yes, it natively supports physical LTO drives via /dev/nstX to prevent tape shoeshining.

The engine currently supports automated replication across local drives, tapes, and rclone supported cloud buckets.

If you are a Rust developer, a HomeLab data hoarder, or just interested in Linux storage architecture, Id love your feedback or code reviews. It is fully AGPL v3 licensed.

Check out the repo here: [GitHub HuskHoard]https://github.com/huskhoard/huskhoard
More architecture details: [HuskHoard Blog]https://www.huskhoard.com/blog.html