Chapter 53Project Top K Word Frequency Analyzer

Project

Overview

The debugging chapter introduced tools for explaining why a program misbehaves.52 This project leans on similar discipline to build a deterministic text analytics utility: feed it a log excerpt, collect the most frequent tokens, and emit timing data for each phase. We will combine the tokenization helpers from std.mem, hashed collections from std, the heap sorter, and the Timer API to produce reproducible rankings with measured costs.mem.zighash_map.zigsort.zigtime.zig

Learning Goals

  • Build an end-to-end I/O pipeline that reads a corpus, normalizes text, and accumulates counts with std.StringHashMap and std.ArrayList.array_list.zig
  • Rank frequencies deterministically using an explicit comparator that resolves ties without depending on hash map iteration order.
  • Capture per-phase timings with std.time.Timer to validate regressions and communicate performance expectations.

Designing the Pipeline

Our analyzer accepts an optional file path and an optional k parameter (top k tokens); both default to a bundled corpus and 5, respectively. We read the entire file into memory for simplicity, but the normalization and counting loop is written to operate linearly so it can be adapted to stream chunks later. A GeneralPurposeAllocator backs all dynamic structures, and an arena-friendly workflow (duplicating strings only on first sight) keeps allocations proportional to the vocabulary size.heap.zigprocess.zig

Tokenization happens with std.mem.tokenizeAny, configured with a conservative separator set that trims whitespace, punctuation, and markup characters. Each token is lowercased in a reusable std.ArrayList(u8) before attempting insertion into the map; if the token already exists, only the count increments, keeping temporary allocations bounded.

Counting and Ranking

The complete utility demonstrates StringHashMap, ArrayList, sorting, and timing side by side.

Zig
const std = @import("std");

const Separators = " \t\r\n,.;:!?\"'()[]{}<>-/\\|`~*_";

const Entry = struct {
    word: []const u8,
    count: usize,
};

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer if (gpa.deinit() == .leak) @panic("memory leak");
    const allocator = gpa.allocator();

    var stdout_buffer: [256]u8 = undefined;
    var stdout_writer = std.fs.File.stdout().writer(&stdout_buffer);
    const out = &stdout_writer.interface;

    const args = try std.process.argsAlloc(allocator);
    defer std.process.argsFree(allocator, args);

    const path = if (args.len >= 2) args[1] else "chapters-data/code/53__project-top-k-word-frequency-analyzer/sample_corpus.txt";
    const top_k: usize = blk: {
        if (args.len >= 3) {
            const value = std.fmt.parseInt(usize, args[2], 10) catch {
                try out.print("invalid top-k value: {s}\n", .{args[2]});
                return error.InvalidArgument;
            };
            break :blk if (value == 0) 1 else value;
        }
        break :blk 5;
    };

    var timer = try std.time.Timer.start();

    const corpus = try std.fs.cwd().readFileAlloc(allocator, path, 1024 * 1024 * 4);
    defer allocator.free(corpus);
    const read_ns = timer.lap();

    var scratch = try std.ArrayList(u8).initCapacity(allocator, 32);
    defer scratch.deinit(allocator);

    var frequencies = std.StringHashMap(usize).init(allocator);
    defer frequencies.deinit();

    var total_tokens: usize = 0;

    var it = std.mem.tokenizeAny(u8, corpus, Separators);
    while (it.next()) |raw| {
        scratch.clearRetainingCapacity();
        try scratch.appendSlice(allocator, raw);
        const slice = scratch.items;
        for (slice) |*byte| {
            byte.* = std.ascii.toLower(byte.*);
        }
        if (slice.len == 0) continue;

        const gop = try frequencies.getOrPut(slice);
        if (gop.found_existing) {
            gop.value_ptr.* += 1;
        } else {
            const owned = try allocator.dupe(u8, slice);
            gop.key_ptr.* = owned;
            gop.value_ptr.* = 1;
        }
        total_tokens += 1;
    }
    const tokenize_ns = timer.lap();

    var entries = try std.ArrayList(Entry).initCapacity(allocator, frequencies.count());
    defer {
        for (entries.items) |entry| allocator.free(entry.word);
        entries.deinit(allocator);
    }

    var map_it = frequencies.iterator();
    while (map_it.next()) |kv| {
        try entries.append(allocator, .{ .word = kv.key_ptr.*, .count = kv.value_ptr.* });
    }

    const entry_slice = entries.items;
    std.sort.heap(Entry, entry_slice, {}, struct {
        fn lessThan(_: void, a: Entry, b: Entry) bool {
            if (a.count == b.count) return std.mem.lessThan(u8, a.word, b.word);
            return a.count > b.count;
        }
    }.lessThan);
    const sort_ns = timer.lap();

    const unique_words = entries.items.len;
    const limit = if (unique_words < top_k) unique_words else top_k;

    try out.print("source -> {s}\n", .{path});
    try out.print("tokens -> {d}, unique -> {d}\n", .{ total_tokens, unique_words });
    try out.print("top {d} words:\n", .{limit});
    var index: usize = 0;
    while (index < limit) : (index += 1) {
        const entry = entry_slice[index];
        try out.print("  {d:>2}. {s} -> {d}\n", .{ index + 1, entry.word, entry.count });
    }

    try out.print("timings (ns): read={d}, tokenize={d}, sort={d}\n", .{ read_ns, tokenize_ns, sort_ns });
    try out.flush();
}
Run
Shell
$ zig run chapters-data/code/53__project-top-k-word-frequency-analyzer/topk_word_frequency.zig
Output
Shell
source -> chapters-data/code/53__project-top-k-word-frequency-analyzer/sample_corpus.txt
tokens -> 102, unique -> 86
top 5 words:
   1. the -> 6
   2. a -> 3
   3. and -> 3
   4. are -> 2
   5. latency -> 2
timings (ns): read=284745, tokenize=3390822, sort=236725

std.StringHashMap stores the canonical lowercase spellings, and a separate std.ArrayList collects the final (word, count) pairs for sorting. We choose std.sort.heap because it is deterministic, has no allocator dependencies, and performs well on small datasets; the comparator sorts primarily by descending count and secondarily by lexical ordering to keep ties stable. This is important when rerunning analyses across runs or machines—the field team can diff resulting CSVs without surprises.

Timing and Reproducibility

A single Timer instance measures three phases: file ingestion, tokenization, and sorting. We call lap() after each phase to reset the zero point while recording elapsed nanoseconds, making it easy to spot which step dominates. Because the analyzer normalizes case and uses deterministic sorting, the output for a given corpus remains identical across runs, allowing timing deltas to be attributed to hardware or toolchain changes rather than nondeterministic ordering.

For regressions, rerun with a larger k or a different corpus:

Shell
$ zig run chapters-data/code/53__project-top-k-word-frequency-analyzer/topk_word_frequency.zig -- chapters-data/code/53__project-top-k-word-frequency-analyzer/sample_corpus.txt 10

The optional arguments let you keep the binary scriptable—drop it into CI, compare output artifacts, and alert when the timing budget changes by more than a threshold. When integrating into bigger systems, the map-building loop can be swapped to stream from stdin or a TCP socket while preserving the same deterministic ranking rules.File.zig

Notes & Caveats

  • StringHashMap does not free stored keys automatically; this sample releases them explicitly before dropping the map to keep the general-purpose allocator leak checker happy.
  • The tokenizer is ASCII-focused. For full Unicode segmentation, pair the pipeline with std.unicode.ScalarIterator or integrate ICU bindings.45
  • Reading the entire corpus into memory simplifies the tutorial but may not suit multi-gigabyte logs. Swap readFileAlloc for chunked readAll loops or memory-mapped files when scaling.28

Exercises

  • Emit the report as JSON by serializing the sorted slice, then compare diff friendliness with the textual version.32json.zig
  • Replace the single-threaded analyzer with a two-phase pipeline: shard tokens across threads, then merge hash maps before sorting. Measure the benefit using Timer and summarize the scaling.29
  • Add a --stopwords option that loads a newline-delimited ignore list, removes those tokens before counting, and reports how many candidates were filtered out.36

Caveats, Alternatives, Edge Cases

  • For streaming environments, consider std.PriorityQueue to maintain the top k incrementally instead of recording the entire histogram before sorting.priority_queue.zig
  • If performance requirements outgrow heap sort, experiment with std.sort.pdq or bucket-based approaches while keeping the deterministic comparator contract intact.
  • To support multi-locale text, layer in normalization (NFC/NFKC) and use Unicode-aware casing helpers; the comparator may need locale-specific collation to keep results intuitive.45

Help make this chapter better.

Found a typo, rough edge, or missing explanation? Open an issue or propose a small improvement on GitHub.