Chapter 35Project Gpu Compute In Zig

Project

Overview

Chapter 34 outlined the GPU execution model, address spaces, and dispatch planning; now we build an end-to-end workload that starts with Zig source and ends with a validated binary dump ready for submission to Vulkan or WebGPU queue families. Target.zig The project stitches together three pieces: a SPIR-V kernel authored in pure Zig, a host-side orchestration CLI with a CPU fallback, and a diff utility for comparing captured GPU buffers against expected results. build.zig

Learning Goals

  • Translate a Zig compute kernel into SPIR-V with the self-hosted backend and understand the resource layouts it expects.
  • Coordinate buffers, dispatch geometry, and validation paths from a host application that can run with or without GPU access.
  • Build lightweight diagnostics that evaluate GPU output against a deterministic CPU reference.

Refs: 34__gpu-fundamentals.xml, Random.zig

Building the Compute Pipeline

Our workload squares elements of a vector. The host creates submission metadata and data buffers, the kernel squares each lane, and the diff tool verifies device captures. The static lane capacity mirrors the GPU storage-buffer layout, while the host enforces logical bounds so the kernel can bail out when extra threads are scheduled. builtin.zig

Topology and Data Flow

The dispatch is intentionally modest (1000 elements in blocks of 64 threads), so you can focus on correctness rather than occupancy tuning. The host injects random floating-point values, records a checksum for observability, and emits a binary blob that downstream tooling—or a real GPU driver—can reuse. Because storage buffers operate on raw bytes, we pair every pointer parameter with an extern struct facade to guarantee layout parity with descriptor tables.

Authoring the SPIR-V Kernel

The kernel receives three storage buffers: a submission header describing the logical length, an input vector, and an output vector. Each invocation reads one lane, squares it, and writes the result back if it falls within bounds. Defensive checks prevent stray workgroups from touching memory past the logical length, a common hazard when the host pads the dispatch to match wavefront granularity.

Zig
//! SPIR-V Kernel: element-wise vector squaring
//!
//! This kernel expects three storage buffers: an input vector (`in_values`), an
//! output vector (`out_values`), and a descriptor struct `Submission` that
//! communicates the logical element count. Each invocation squares one element
//! and writes the result back to `out_values`.

const builtin = @import("builtin");

/// Maximum number of elements the kernel will touch.
pub const lane_capacity: u32 = 1024;

/// Submission header shared between the host and the kernel.
///
/// The `extern` layout ensures the struct matches bindings created by Vulkan or
/// WebGPU descriptor tables.
const Submission = extern struct {
    /// Logical element count requested by the host.
    len: u32,
    _padding: u32 = 0,
};

/// Storage buffer layout expected by the kernel.
const VectorPayload = extern struct {
    values: [lane_capacity]f32,
};

/// Squares each element of `in_values` and writes the result to `out_values`.
///
/// The kernel is written defensively: it checks both the logical length passed
/// by the host and the static `lane_capacity` to avoid out-of-bounds writes when
/// dispatched with more threads than necessary.
pub export fn squareVector(
    submission: *addrspace(.storage_buffer) const Submission,
    in_values: *addrspace(.storage_buffer) const VectorPayload,
    out_values: *addrspace(.storage_buffer) VectorPayload,
) callconv(.kernel) void {
    const group_index = @workGroupId(0);
    const group_width = @workGroupSize(0);
    const local_index = @workItemId(0);
    const linear = group_index * group_width + local_index;

    const logical_len = submission.len;
    if (linear >= logical_len or linear >= lane_capacity) return;

    const value = in_values.*.values[linear];
    out_values.*.values[linear] = value * value;
}

// Guard compilation so this file is only compiled when targeting SPIR-V.
comptime {
    switch (builtin.target.cpu.arch) {
        .spirv32, .spirv64 => {},
        else => @compileError("squareVector must be compiled with a SPIR-V target, e.g. -target spirv32-vulkan-none"),
    }
}
Run
Shell
$ zig build-obj -fno-llvm -O ReleaseSmall -target spirv32-vulkan-none \
    -femit-bin=kernels/vector_square.spv \
    01_vector_square_kernel.zig
Output
Shell
no output (binary module generated)

Delete kernels/vector_square.spv when you finish experimenting so repeated runs always rebuild the shader from source. fs.zig

Host Orchestration and CPU Fallback

The host CLI plans the dispatch, seeds deterministic input, runs a CPU fallback, and—when requested—writes a reference dump to out/reference.bin. It also validates the SPIR-V header (0x07230203) so broken builds surface immediately instead of failing deep inside a graphics API. Optional hooks let you drop in a captured GPU buffer (out/gpu_result.bin) for post-run comparison.

Zig
// Project host pipeline for the vector-square kernel.
//
// This program demonstrates the CPU orchestration that pairs with the
// `squareVector` SPIR-V kernel. It prepares input data, plans a dispatch,
// validates the compiled shader module, and runs a CPU fallback that mirrors the
// GPU algorithm. When requested via `--emit-binary`, it also writes the CPU
// output to `out/reference.bin` so external GPU runs can be compared bit-for-bit.

const std = @import("std");

/// Must match `lane_capacity` in 01_vector_square_kernel.zig.
const lane_capacity: u32 = 1024;
const default_problem_len: u32 = 1000;
const workgroup_size: u32 = 64;
const spirv_path = "kernels/vector_square.spv";
const gpu_dump_path = "out/gpu_result.bin";
const cpu_dump_path = "out/reference.bin";

/// Encapsulates the GPU workgroup dispatch geometry, accounting for padding
/// when the total workload doesn't evenly divide into workgroup boundaries.
const DispatchPlan = struct {
    workgroup_size: u32,
    group_count: u32,
    /// Total invocations including padding to fill complete workgroups
    padded_invocations: u32,
    /// Number of unused lanes in the final workgroup
    tail: u32,
};

/// Tracks a validated SPIR-V module alongside its filesystem path for diagnostics.
const ModuleInfo = struct {
    path: []const u8,
    bytes: []u8,
};

pub fn main() !void {
    // Initialize allocator with leak detection for development builds
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer switch (gpa.deinit()) {
        .ok => {},
        .leak => std.log.err("general-purpose allocator detected a leak", .{}),
    };
    const allocator = gpa.allocator();

    // Parse command-line arguments for optional flags
    var args = try std.process.argsWithAllocator(allocator);
    defer args.deinit();
    _ = args.next(); // skip program name

    var emit_binary = false;
    var logical_len: u32 = default_problem_len;

    while (args.next()) |arg| {
        if (std.mem.eql(u8, arg, "--emit-binary")) {
            emit_binary = true;
        } else if (std.mem.eql(u8, arg, "--length")) {
            const value = args.next() orelse return error.MissingLengthValue;
            logical_len = try std.fmt.parseInt(u32, value, 10);
        } else {
            return error.UnknownFlag;
        }
    }

    // Clamp user-provided length to prevent buffer overruns in the kernel
    if (logical_len == 0 or logical_len > lane_capacity) {
        std.log.warn("clamping problem length to lane capacity ({d})", .{lane_capacity});
        logical_len = @min(lane_capacity, logical_len);
        if (logical_len == 0) logical_len = @min(lane_capacity, default_problem_len);
    }

    // Calculate how many workgroups we need to process this many elements
    const plan = computeDispatch(logical_len, workgroup_size);
    std.debug.print(
        "launch plan: {d} groups × {d} lanes => {d} invocations (tail {d})\n",
        .{ plan.group_count, plan.workgroup_size, plan.padded_invocations, plan.tail },
    );

    // Use deterministic PRNG for reproducible test runs across environments
    var prng = std.Random.DefaultPrng.init(0xBEEFFACE);
    const random = prng.random();

    // Generate input data with a predictable pattern plus random noise
    var input = try allocator.alloc(f32, logical_len);
    defer allocator.free(input);
    for (input, 0..input.len) |*slot, idx| {
        const base: f32 = @floatFromInt(idx);
        slot.* = base * 0.5 + random.float(f32);
    }

    // Execute CPU reference implementation to produce expected results
    var cpu_output = try allocator.alloc(f32, logical_len);
    defer allocator.free(cpu_output);
    runCpuFallback(input, cpu_output);

    // Compute simple checksum for quick sanity verification
    const checksum = checksumSlice(cpu_output);
    std.debug.print("cpu fallback checksum: {d:.6}\n", .{checksum});

    // Attempt to load and validate the compiled SPIR-V shader module
    const module = try loadSpirvIfPresent(allocator, spirv_path);
    defer if (module) |info| allocator.free(info.bytes);

    if (module) |info| {
        std.debug.print(
            "gpu module: {s} ({d} bytes, header ok)\n",
            .{ info.path, info.bytes.len },
        );
    } else {
        std.debug.print(
            "gpu module: missing ({s}); run kernel build command to generate it\n",
            .{spirv_path},
        );
    }

    // Check if a GPU execution captured output for comparison
    const maybe_gpu_dump = try loadBinaryIfPresent(allocator, gpu_dump_path);
    defer if (maybe_gpu_dump) |blob| allocator.free(blob);

    if (maybe_gpu_dump) |blob| {
        // Compare GPU results against CPU reference lane-by-lane
        const mismatches = compareF32Slices(cpu_output, blob);
        std.debug.print(
            "gpu capture diff: {d} mismatched lanes\n",
            .{mismatches},
        );
    } else {
        std.debug.print(
            "gpu capture diff: skipped (no {s} file found)\n",
            .{gpu_dump_path},
        );
    }

    // Display first few lanes for manual inspection
    const sample_count = @min(input.len, 6);
    for (input[0..sample_count], cpu_output[0..sample_count], 0..) |original, squared, idx| {
        std.debug.print(
            "lane {d:>3}: in={d:.5} out={d:.5}\n",
            .{ idx, original, squared },
        );
    }

    // Write reference dump if requested for external GPU validation tools
    if (emit_binary) {
        try emitCpuDump(cpu_output);
        std.debug.print("cpu reference written to {s}\n", .{cpu_dump_path});
    }
}

/// Computes dispatch geometry by rounding up to complete workgroups.
/// Returns the number of groups, total padded invocations, and unused tail lanes.
fn computeDispatch(total_items: u32, group_size: u32) DispatchPlan {
    std.debug.assert(group_size > 0);
    // Divide ceiling to ensure all items are covered
    const groups = std.math.divCeil(u32, total_items, group_size) catch unreachable;
    const padded = groups * group_size;
    return .{
        .workgroup_size = group_size,
        .group_count = groups,
        .padded_invocations = padded,
        .tail = padded - total_items,
    };
}

/// Executes the squaring operation on the CPU, mirroring the GPU kernel logic.
/// Each output element is the square of its corresponding input.
fn runCpuFallback(input: []const f32, output: []f32) void {
    std.debug.assert(input.len == output.len);
    for (input, output) |value, *slot| {
        slot.* = value * value;
    }
}

/// Calculates a simple sum of all f32 values in double precision for observability.
fn checksumSlice(values: []const f32) f64 {
    var total: f64 = 0.0;
    for (values) |value| {
        total += @as(f64, @floatCast(value));
    }
    return total;
}

/// Attempts to read and validate a SPIR-V binary module from disk.
/// Returns null if the file doesn't exist; validates the magic number (0x07230203).
fn loadSpirvIfPresent(allocator: std.mem.Allocator, path: []const u8) !?ModuleInfo {
    var file = std.fs.cwd().openFile(path, .{}) catch |err| switch (err) {
        error.FileNotFound => return null,
        else => return err,
    };
    defer file.close();

    const bytes = try file.readToEndAlloc(allocator, 1 << 20);
    errdefer allocator.free(bytes);

    // Validate minimum size for SPIR-V header
    if (bytes.len < 4) return error.SpirvTooSmall;
    // Check little-endian magic number
    const magic = std.mem.readInt(u32, bytes[0..4], .little);
    if (magic != 0x0723_0203) return error.InvalidSpirvMagic;

    return ModuleInfo{ .path = path, .bytes = bytes };
}

/// Loads raw binary data if the file exists; returns null for missing files.
fn loadBinaryIfPresent(allocator: std.mem.Allocator, path: []const u8) !?[]u8 {
    var file = std.fs.cwd().openFile(path, .{}) catch |err| switch (err) {
        error.FileNotFound => return null,
        else => return err,
    };
    defer file.close();
    const bytes = try file.readToEndAlloc(allocator, 1 << 24);
    return bytes;
}

/// Compares two f32 slices for approximate equality within 1e-6 tolerance.
/// Returns the count of mismatched lanes; returns expected.len if sizes differ.
fn compareF32Slices(expected: []const f32, blob_bytes: []const u8) usize {
    // Ensure blob size aligns with f32 boundaries
    if (blob_bytes.len % @sizeOf(f32) != 0) return expected.len;
    const actual = std.mem.bytesAsSlice(f32, blob_bytes);
    if (actual.len != expected.len) return expected.len;

    var mismatches: usize = 0;
    for (expected, actual) |lhs, rhs| {
        // Use floating-point tolerance to account for minor GPU precision differences
        if (!std.math.approxEqAbs(f32, lhs, rhs, 1e-6)) {
            mismatches += 1;
        }
    }
    return mismatches;
}

/// Writes CPU-computed f32 array to disk as raw bytes for external comparison tools.
fn emitCpuDump(values: []const f32) !void {
    // Ensure output directory exists before writing
    try std.fs.cwd().makePath("out");
    var file = try std.fs.cwd().createFile(cpu_dump_path, .{ .truncate = true });
    defer file.close();
    // Convert f32 slice to raw bytes for binary serialization
    const bytes = std.mem.sliceAsBytes(values);
    try file.writeAll(bytes);
}

math.zig

Run
Shell
$ zig build-obj -fno-llvm -O ReleaseSmall -target spirv32-vulkan-none \
    -femit-bin=kernels/vector_square.spv \
    01_vector_square_kernel.zig
$ zig run 02_host_pipeline.zig -- --emit-binary
Output
Shell
launch plan: 16 groups × 64 lanes => 1024 invocations (tail 24)
cpu fallback checksum: 83467485.758038
gpu module: kernels/vector_square.spv (5368 bytes, header ok)
gpu capture diff: skipped (no out/gpu_result.bin file found)
lane   0: in=0.10821 out=0.01171
lane   1: in=1.07972 out=1.16579
lane   2: in=1.03577 out=1.07281
lane   3: in=2.33225 out=5.43938
lane   4: in=2.92146 out=8.53491
lane   5: in=2.89332 out=8.37133
cpu reference written to out/reference.bin

Keep the generated out/reference.bin around if you plan to capture GPU buffers; otherwise, delete it to leave the workspace clean.

Validating Device Dumps

The diff tool consumes two binary dumps (expected versus captured) and reports mismatched lanes, previewing the first few discrepancies to help you spot data-dependent bugs quickly. It assumes little-endian f32 values, matching how most host APIs expose raw mapped buffers. mem.zig

Zig
// Utility to compare two float32 binary dumps.
//
// The files are expected to be raw little-endian 32-bit float arrays. The
// program prints the number of mismatched lanes (based on absolute tolerance)
// and highlights the first few differences for quick diagnostics.

const std = @import("std");

/// Maximum number of mismatched differences to display in diagnostic output
const max_preview = 5;

pub fn main() !void {
    // Initialize allocator with leak detection for development builds
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer switch (gpa.deinit()) {
        .ok => {},
        .leak => std.log.warn("compare_dump leaked memory", .{}),
    };
    const allocator = gpa.allocator();

    // Parse command-line arguments expecting exactly two file paths
    var args = try std.process.argsWithAllocator(allocator);
    defer args.deinit();
    _ = args.next(); // Skip program name

    const expected_path = args.next() orelse return usageError();
    const actual_path = args.next() orelse return usageError();
    if (args.next()) |_| return usageError(); // Reject extra arguments

    // Load both binary dumps into memory for comparison
    const expected_bytes = try readAll(allocator, expected_path);
    defer allocator.free(expected_bytes);

    const actual_bytes = try readAll(allocator, actual_path);
    defer allocator.free(actual_bytes);

    // Reinterpret raw bytes as f32 slices for element-wise comparison
    const expected = std.mem.bytesAsSlice(f32, expected_bytes);
    const actual = std.mem.bytesAsSlice(f32, actual_bytes);

    // Early exit if array lengths differ
    if (expected.len != actual.len) {
        std.debug.print(
            "length mismatch: expected {d} elements, actual {d} elements\n",
            .{ expected.len, actual.len },
        );
        return;
    }

    // Track total mismatches and collect first few for detailed reporting
    var mismatches: usize = 0;
    var first_few: [max_preview]?Diff = .{null} ** max_preview;

    // Compare each lane using floating-point tolerance to account for minor precision differences
    for (expected, actual, 0..) |lhs, rhs, idx| {
        if (!std.math.approxEqAbs(f32, lhs, rhs, 1e-6)) {
            // Store first N differences for diagnostic display
            if (mismatches < max_preview) {
                first_few[mismatches] = Diff{ .index = idx, .expected = lhs, .actual = rhs };
            }
            mismatches += 1;
        }
    }

    // Print summary of comparison results
    std.debug.print("mismatched lanes: {d}\n", .{mismatches});
    
    // Display detailed information for first few mismatches to aid debugging
    for (first_few) |maybe_diff| {
        if (maybe_diff) |diff| {
            std.debug.print(
                "  lane {d}: expected={d:.6} actual={d:.6}\n",
                .{ diff.index, diff.expected, diff.actual },
            );
        }
    }
}

/// Prints usage information and returns an error when invocation is invalid
fn usageError() !void {
    std.debug.print("usage: compare_dump <expected.bin> <actual.bin>\n", .{});
    return error.InvalidInvocation;
}

/// Reads entire file contents into allocated memory with a 64 MiB size limit
fn readAll(allocator: std.mem.Allocator, path: []const u8) ![]u8 {
    var file = try std.fs.cwd().openFile(path, .{});
    defer file.close();
    return try file.readToEndAlloc(allocator, 1 << 26);
}

/// Captures a single floating-point mismatch with its location and values
const Diff = struct {
    index: usize,
    expected: f32,
    actual: f32,
};
Run
Shell
$ zig run 03_compare_dump.zig -- out/reference.bin out/reference.bin
Output
Shell
mismatched lanes: 0

To validate a real GPU run, save the device buffer as out/gpu_result.bin and rerun 03_compare_dump.zig against that file to surface any divergence. Io.zig

Notes & Caveats

  • Storage buffers require explicit alignment; keep your extern struct definitions in lockstep with host descriptor bindings to avoid silent padding bugs.
  • The self-hosted SPIR-V backend rejects unsupported address spaces on CPU targets, so isolate kernel source files from host builds (no @import from CPU binaries).
  • Deterministic PRNG seeding keeps CPU and GPU executions comparable across runs and CI environments.

Exercises

  • Extend the kernel to fuse multiplication and addition (a * a + b) by binding a second input buffer; update the host and diff tool accordingly.
  • Teach the host CLI to emit JSON metadata describing the dispatch plan, so external profilers can ingest the run configuration. json.zig
  • Capture real GPU output (via Vulkan, WebGPU, or wgpu-native) and feed the binary into 03_compare_dump.zig, noting any tolerance adjustments required for your hardware.

Alternatives & Edge Cases

  • Vendors map storage buffers differently; check for required minimum alignments (for example, 16 bytes on some drivers) before assuming f32 arrays are densely packed.
  • For very large buffers, stream comparisons instead of loading entire dumps into memory to avoid allocator pressure on low-end machines.
  • When targeting CUDA (nvptx64), swap the calling convention to .kernel and adjust address spaces (.global/.shared) to satisfy PTX expectations.

Help make this chapter better.

Found a typo, rough edge, or missing explanation? Open an issue or propose a small improvement on GitHub.