Chapter 54Project Zip Unzip With Progress

Project

Overview

The previous project focused on deterministic text analytics; now we bundle those artifacts and the surrounding diagnostics into a reproducible archive pipeline. 53 We will write a minimalist ZIP creator that streams files into memory, emits the central directory, then verifies extraction while reporting incremental progress. The program leans on the standard library’s ZIP reader, manual header encoding, StringHashMap bookkeeping for CRC32 checks, and structured status updates through std.Progress. zip.zighash_map.zigcrc.zigProgress.zig

Learning Goals

  • Assemble a ZIP archive from scratch by writing Local File Headers, the Central Directory, and the End of Central Directory record in the correct order while honoring size and offset constraints.
  • Capture deterministic integrity metrics (CRC32, SHA-256) alongside the bundle so continuous integration can validate both structure and content on every run. crypto.zig
  • Surface analyst-friendly progress messages that stay scriptable by disabling the animated renderer and emitting plain-text checkpoints with std.Progress.

Designing the pipeline

The workflow runs in three phases: seed sample files, build an archive, and extract plus verify. Each phase increments the root progress node, producing deterministic console summaries that double as acceptance criteria. All filesystem operations occur under a temporary directory managed by std.testing.tmpDir, keeping the real workspace clean. 47testing.zig

For archival metadata, we reuse the same relative paths when writing headers and when later validating the extracted files. Storing the CRC32 and byte count per path inside a StringHashMap gives us a straightforward way to diff expectations against actual outputs after extraction.

Archive assembly

Because Zig 0.15.2 ships a ZIP reader but not a writer, we build the archive in memory using an ArrayList(u8), appending each component in sequence: Local File Header, filename, file bytes. Every header field is written with explicit little-endian helpers so the result is portable across architectures. Once the payloads land in the blob, we append the Central Directory (one record per file) followed by the End of Central Directory record, mirroring the structures defined in the PKWARE APPNOTE and encoded in std.zip. array_list.zigfmt.zig

While writing headers we ensure sizes and offsets fit in 32-bit fields (sticking to the classic ZIP subset) and duplicate the filename once into the map so we can free resources deterministically later. After the archive image is complete, we persist it to disk and compute a SHA-256 digest for downstream regressions—the digest is rendered with std.fmt.bytesToHex so it can be compared inline without any extra tooling.

Extraction and verification

Extraction reuses the standard library iterator, which walks through each Central Directory record and hands the data stream to std.zip.Entry.extract; we normalize the root folder name through std.zip.Diagnostics so we can surface it to the caller. After each file lands on disk, we compute CRC32 again and compare the byte count against the recorded expectation. Any mismatch fails the program immediately, making it safe to embed in CI pipelines or deployment hooks.

std.Progress nodes drive the console output: the root node tracks the three high-level stages, while child nodes count through the file list during seeding, building, and verification. Because printing is disabled, the final messages are ordinary text lines (rendered via a buffered stdout writer) that can be diffed verbatim in automated tests. 47

End-to-end implementation

Zig
const std = @import("std");

const SampleFile = struct {
    path: []const u8,
    contents: []const u8,
};

const sample_files = [_]SampleFile{
    .{ .path = "input/metrics.txt", .contents = "uptime=420s\nrequests=1312\nerrors=3\n" },
    .{ .path = "input/inventory.json", .contents = "{\n  \"service\": \"telemetry\",\n  \"shards\": [\"alpha\", \"beta\", \"gamma\"]\n}\n" },
    .{ .path = "input/logs/app.log", .contents = "[info] ingest started\n[warn] queue delay=87ms\n[info] ingest completed\n" },
    .{ .path = "input/README.md", .contents = "# Telemetry bundle\n\nSynthetic records used for the zip/unzip progress demo.\n" },
};

const EntryMetrics = struct {
    crc32: u32,
    size: usize,
};

const BuildSummary = struct {
    bytes_written: usize,
    sha256: [32]u8,
};

const VerifySummary = struct {
    files_checked: usize,
    total_bytes: usize,
    extracted_root: []const u8,
    owns_root: bool,
};

const archive_path = "artifact/telemetry.zip";
const extract_root = "replay";

fn seedSamples(dir: std.fs.Dir, progress: *std.Progress.Node) !struct { files: usize, bytes: usize } {
    var total_bytes: usize = 0;
    for (sample_files) |sample| {
        if (std.fs.path.dirname(sample.path)) |parent| {
            try dir.makePath(parent);
        }
        var file = try dir.createFile(sample.path, .{ .truncate = true });
        defer file.close();
        try file.writeAll(sample.contents);
        total_bytes += sample.contents.len;
        progress.completeOne();
    }
    return .{ .files = sample_files.len, .bytes = total_bytes };
}

const EntryRecord = struct {
    name: []const u8,
    crc32: u32,
    size: u32,
    offset: u32,
};

fn makeLocalHeader(name_len: u16, crc32: u32, size: u32) [30]u8 {
    var header: [30]u8 = undefined;
    header[0] = 'P';
    header[1] = 'K';
    header[2] = 3;
    header[3] = 4;
    std.mem.writeInt(u16, header[4..6], 20, .little);
    std.mem.writeInt(u16, header[6..8], 0, .little);
    std.mem.writeInt(u16, header[8..10], 0, .little);
    std.mem.writeInt(u16, header[10..12], 0, .little);
    std.mem.writeInt(u16, header[12..14], 0, .little);
    std.mem.writeInt(u32, header[14..18], crc32, .little);
    std.mem.writeInt(u32, header[18..22], size, .little);
    std.mem.writeInt(u32, header[22..26], size, .little);
    std.mem.writeInt(u16, header[26..28], name_len, .little);
    std.mem.writeInt(u16, header[28..30], 0, .little);
    return header;
}

fn makeCentralHeader(entry: EntryRecord) [46]u8 {
    var header: [46]u8 = undefined;
    header[0] = 'P';
    header[1] = 'K';
    header[2] = 1;
    header[3] = 2;
    std.mem.writeInt(u16, header[4..6], 0x0314, .little);
    std.mem.writeInt(u16, header[6..8], 20, .little);
    std.mem.writeInt(u16, header[8..10], 0, .little);
    std.mem.writeInt(u16, header[10..12], 0, .little);
    std.mem.writeInt(u16, header[12..14], 0, .little);
    std.mem.writeInt(u16, header[14..16], 0, .little);
    std.mem.writeInt(u32, header[16..20], entry.crc32, .little);
    std.mem.writeInt(u32, header[20..24], entry.size, .little);
    std.mem.writeInt(u32, header[24..28], entry.size, .little);
    const name_len_u16 = @as(u16, @intCast(entry.name.len));
    std.mem.writeInt(u16, header[28..30], name_len_u16, .little);
    std.mem.writeInt(u16, header[30..32], 0, .little);
    std.mem.writeInt(u16, header[32..34], 0, .little);
    std.mem.writeInt(u16, header[34..36], 0, .little);
    std.mem.writeInt(u16, header[36..38], 0, .little);
    const unix_mode: u32 = 0o100644 << 16;
    std.mem.writeInt(u32, header[38..42], unix_mode, .little);
    std.mem.writeInt(u32, header[42..46], entry.offset, .little);
    return header;
}

fn makeEndRecord(cd_size: u32, cd_offset: u32, entry_count: u16) [22]u8 {
    var footer: [22]u8 = undefined;
    footer[0] = 'P';
    footer[1] = 'K';
    footer[2] = 5;
    footer[3] = 6;
    std.mem.writeInt(u16, footer[4..6], 0, .little);
    std.mem.writeInt(u16, footer[6..8], 0, .little);
    std.mem.writeInt(u16, footer[8..10], entry_count, .little);
    std.mem.writeInt(u16, footer[10..12], entry_count, .little);
    std.mem.writeInt(u32, footer[12..16], cd_size, .little);
    std.mem.writeInt(u32, footer[16..20], cd_offset, .little);
    std.mem.writeInt(u16, footer[20..22], 0, .little);
    return footer;
}

fn buildArchive(
    allocator: std.mem.Allocator,
    dir: std.fs.Dir,
    metrics: *std.StringHashMap(EntryMetrics),
    progress: *std.Progress.Node,
) !BuildSummary {
    if (std.fs.path.dirname(archive_path)) |parent| {
        try dir.makePath(parent);
    }
    var entries = try std.ArrayList(EntryRecord).initCapacity(allocator, sample_files.len);
    defer entries.deinit(allocator);

    try metrics.ensureTotalCapacity(sample_files.len);

    var blob: std.ArrayList(u8) = .empty;
    defer blob.deinit(allocator);

    for (sample_files) |sample| {
        if (sample.path.len > std.math.maxInt(u16)) return error.NameTooLong;

        var file = try dir.openFile(sample.path, .{});
        defer file.close();

        const max_len = 64 * 1024;
        const data = try file.readToEndAlloc(allocator, max_len);
        defer allocator.free(data);

        if (data.len > std.math.maxInt(u32)) return error.InputTooLarge;
        if (blob.items.len > std.math.maxInt(u32)) return error.ArchiveTooLarge;

        var crc = std.hash.crc.Crc32.init();
        crc.update(data);
        const digest = crc.final();

        const offset_u32 = @as(u32, @intCast(blob.items.len));
        const size_u32 = @as(u32, @intCast(data.len));
        const name_len_u16 = @as(u16, @intCast(sample.path.len));

        const header = makeLocalHeader(name_len_u16, digest, size_u32);
        try blob.appendSlice(allocator, header[0..]);
        try blob.appendSlice(allocator, sample.path);
        try blob.appendSlice(allocator, data);

        try entries.append(allocator, .{
            .name = sample.path,
            .crc32 = digest,
            .size = size_u32,
            .offset = offset_u32,
        });

        const gop = try metrics.getOrPut(sample.path);
        if (!gop.found_existing) {
            gop.key_ptr.* = try allocator.dupe(u8, sample.path);
        }
        gop.value_ptr.* = .{ .crc32 = digest, .size = data.len };

        progress.completeOne();
    }

    const central_offset_usize = blob.items.len;
    if (central_offset_usize > std.math.maxInt(u32)) return error.ArchiveTooLarge;
    const central_offset = @as(u32, @intCast(central_offset_usize));

    for (entries.items) |entry| {
        const header = makeCentralHeader(entry);
        try blob.appendSlice(allocator, header[0..]);
        try blob.appendSlice(allocator, entry.name);
    }

    const central_size = @as(u32, @intCast(blob.items.len - central_offset_usize));
    const footer = makeEndRecord(central_size, central_offset, @as(u16, @intCast(entries.items.len)));
    try blob.appendSlice(allocator, footer[0..]);

    var zip_file = try dir.createFile(archive_path, .{ .truncate = true, .read = true });
    defer zip_file.close();
    try zip_file.writeAll(blob.items);

    var sha256 = std.crypto.hash.sha2.Sha256.init(.{});
    sha256.update(blob.items);
    var digest_bytes: [32]u8 = undefined;
    sha256.final(&digest_bytes);

    return .{ .bytes_written = blob.items.len, .sha256 = digest_bytes };
}

fn extractAndVerify(
    allocator: std.mem.Allocator,
    dir: std.fs.Dir,
    metrics: *const std.StringHashMap(EntryMetrics),
    progress: *std.Progress.Node,
) !VerifySummary {
    try dir.makePath(extract_root);
    var dest_dir = try dir.openDir(extract_root, .{ .access_sub_paths = true, .iterate = true });
    defer dest_dir.close();

    var file = try dir.openFile(archive_path, .{});
    defer file.close();

    var read_buf: [4096]u8 = undefined;
    var reader = file.reader(&read_buf);

    var diagnostics = std.zip.Diagnostics{ .allocator = allocator };
    defer diagnostics.deinit();

    try std.zip.extract(dest_dir, &reader, .{ .diagnostics = &diagnostics });

    var files_checked: usize = 0;
    var total_bytes: usize = 0;

    for (sample_files) |sample| {
        var out_file = try dest_dir.openFile(sample.path, .{});
        defer out_file.close();
        const data = try out_file.readToEndAlloc(allocator, 64 * 1024);
        defer allocator.free(data);

        const expected = metrics.get(sample.path) orelse return error.ExpectedEntryMissing;
        var crc = std.hash.crc.Crc32.init();
        crc.update(data);
        if (crc.final() != expected.crc32 or data.len != expected.size) {
            return error.VerificationFailed;
        }
        files_checked += 1;
        total_bytes += data.len;
        progress.completeOne();
    }

    var result_root: []const u8 = "<scattered>";
    var owns_root = false;
    if (diagnostics.root_dir.len > 0) {
        result_root = try allocator.dupe(u8, diagnostics.root_dir);
        owns_root = true;
    }
    return .{
        .files_checked = files_checked,
        .total_bytes = total_bytes,
        .extracted_root = result_root,
        .owns_root = owns_root,
    };
}

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer {
        const leak_status = gpa.deinit();
        std.debug.assert(leak_status == .ok);
    }
    const allocator = gpa.allocator();

    var stdout_buffer: [512]u8 = undefined;
    var stdout_writer = std.fs.File.stdout().writer(&stdout_buffer);
    const out = &stdout_writer.interface;

    var tmp = std.testing.tmpDir(.{});
    defer tmp.cleanup();

    var metrics = std.StringHashMap(EntryMetrics).init(allocator);
    defer {
        var it = metrics.iterator();
        while (it.next()) |kv| {
            allocator.free(kv.key_ptr.*);
        }
        metrics.deinit();
    }

    var progress_root = std.Progress.start(.{
        .root_name = "zip-pipeline",
        .estimated_total_items = 3,
        .disable_printing = true,
    });
    defer progress_root.end();

    var stage_seed = progress_root.start("seed", sample_files.len);
    const seeded = try seedSamples(tmp.dir, &stage_seed);
    stage_seed.end();
    try out.print("[1/3] seeded samples -> files={d}, bytes={d}\n", .{ seeded.files, seeded.bytes });

    var stage_build = progress_root.start("build", sample_files.len);
    const build_summary = try buildArchive(allocator, tmp.dir, &metrics, &stage_build);
    stage_build.end();

    const hex_digest = std.fmt.bytesToHex(build_summary.sha256, .lower);
    try out.print("[2/3] built archive -> bytes={d}\n    sha256={s}\n", .{ build_summary.bytes_written, hex_digest[0..] });

    var stage_verify = progress_root.start("verify", sample_files.len);
    const verify_summary = try extractAndVerify(allocator, tmp.dir, &metrics, &stage_verify);
    stage_verify.end();
    defer if (verify_summary.owns_root) allocator.free(verify_summary.extracted_root);
    try out.print(
        "[3/3] extracted + verified -> files={d}, bytes={d}, root={s}\n",
        .{ verify_summary.files_checked, verify_summary.total_bytes, verify_summary.extracted_root },
    );

    try out.flush();
}
Run
Shell
$ zig run zip_progress_pipeline.zig
Output
Shell
[1/3] seeded samples -> files=4, bytes=250
[2/3] built archive -> bytes=716
    sha256=4a13a3dc1e6ef90c252b0cc797ff14456aa28c670cafbc9d27a025b0079b05d5
[3/3] extracted + verified -> files=4, bytes=250, root=input

The verification step intentionally duplicates the extracted root string when diagnostics discover a common prefix; the summary frees that buffer afterward to keep the general-purpose allocator clean. This mirrors good hygiene for CLI utilities that stream large archives through temporary directories. 52

Notes & Caveats

  • The writer sticks to the classic (non-Zip64) subset; once files exceed 4 GiB you must upgrade the headers and extra fields, or delegate to a dedicated ZIP library. 44
  • Progress nodes are nested but printing is disabled; if you want live TTY updates, drop .disable_printing = true and let the renderer clear frames. Remember that doing so sacrifices determinism in captured logs. 47
  • CRC32 confirms integrity but not authenticity. Combine the SHA-256 digest with a signature or attach the archive to a zig build step for reproducible deployment pipelines. 39

Exercises

  • Extend the builder to emit Zip64 records when any file crosses the 4 GiB boundary. Keep the legacy path for small bundles and write regression tests that validate both. 33
  • Replace the in-memory blob with a streaming writer that flushes to disk in chunks; compare throughput and memory consumption under perf or zig build test with large synthetic files. 41
  • Add a command-line flag that accepts an ignore list (glob patterns) before archiving, then report the exact number of skipped files alongside the existing totals. 36Dir.zig

Caveats, alternatives, edge cases

  • Streaming archives straight to stdout is great for pipelines but makes verification trickier; consider writing to a temporary file first so you can re-open it for checksums before shipping it onward. 28File.zig
  • ZIP encryption is intentionally out of scope. If you need confidentiality, wrap the resulting file with std.crypto primitives or switch to formats like encrypted tarballs with age or minisign. 45
  • For multi-gigabyte corpora, read inputs in chunks and update CRC32 incrementally rather than calling readToEndAlloc; otherwise the temporary allocator will balloon. 10

Help make this chapter better.

Found a typo, rough edge, or missing explanation? Open an issue or propose a small improvement on GitHub.