Chapter 34Gpu Fundamentals

GPU Fundamentals

Overview

The C interop bridge from the previous chapter lets Zig speak to decades of native code (see 33); the next frontier is harnessing massively parallel devices without abandoning Zig’s ergonomics. We will map GPU execution models onto Zig’s language primitives, examine how address spaces and calling conventions constrain kernel code, and learn the build flags that tame the still-evolving SPIR-V toolchain (see v0.15.2).

Along the way, we will contrast compute-first design with graphics pipelines, highlight where Zig’s standard library already understands GPU targets, and outline pragmatic host/device coordination patterns for projects that still need to run on pure CPU hardware (see Target.zig).

Learning Goals

  • Relate Zig’s compilation model to GPU execution hierarchies and memory classes.
  • Declare and compile GPU kernels with explicit calling conventions and address spaces.
  • Plan launch parameters that gracefully degrade to CPU fallbacks when accelerators are absent.

See builtin.zig and math.zig for related definitions.

GPU Architecture Foundations

GPUs expose thousands of lightweight threads arranged into hierarchies of work items, work groups, and grids; Zig surfaces those indices through builtins like @workGroupId, @workGroupSize, and @workItemId, keeping the model explicit so kernels remain predictable. Because GPU compilers penalize implicit global state, Zig’s bias toward explicit parameters and result locations naturally fits the deterministic flow demanded by SIMT hardware.

Execution model: SIMT and thread groups

Single-instruction, multiple-thread (SIMT) execution bundles lanes into warps or wavefronts that run the same opcode stream until divergence. When you compile for targets such as .spirv32, .spirv64, .nvptx, or .amdgcn, Zig swaps its default calling convention for specialized GPU variants, so callconv(.kernel) emits code that satisfies each platform’s scheduler expectations. Divergence is handled explicitly: branching on per-lane values results in predicate masks that stall inactive threads, so structuring kernels with coarse branching keeps throughput predictable.

Memory hierarchies and address spaces

Zig models GPU memories through first-class address spaces — .global, .shared, .local, .constant, .storage_buffer, and more — each with its own coherence and lifetime rules. The compiler refuses pointer arithmetic that crosses into disallowed spaces, forcing kernel authors to acknowledge when data lives in shared memory versus device-global buffers. Use explicit casts like @addrSpaceCast only when you can prove the access rules remain valid, and prefer extern struct payloads for data shared with host APIs to guarantee layout stability.

Compute vs graphics pipelines

Compute kernels are just SPIR-V or PTX entry points that you enqueue from host code; graphics shaders traverse a fixed pipeline that Zig currently treats as external binaries you author in shading languages or translated SPIR-V blobs. Zig’s @import system does not yet generate render pipelines, but you can embed precompiled SPIR-V and dispatch it through Vulkan or WebGPU hosts written in Zig, integrating with the same allocator and error handling discipline you rely on elsewhere in the standard library.

Targeting GPUs with Zig

The compiler’s view of a build is captured by builtin.target, which records the architecture, OS tag, ABI, and permitted address spaces; toggling -target at the CLI level is enough to retarget code for host CPUs, SPIR-V, or CUDA backends. Zig 0.15.2 ships both the self-hosted SPIR-V backend and an LLVM-based fallback selectable with -fllvm, letting you experiment with whichever pipeline better matches your downstream drivers.

Understanding the Target Structure

Before working with GPU-specific compilation, it’s valuable to understand how Zig represents compilation targets internally. The following diagram shows the complete std.Target structure:

graph TB subgraph "std.Target Structure" TARGET["std.Target"] CPU["cpu: Cpu"] OS["os: Os"] ABI["abi: Abi"] OFMT["ofmt: ObjectFormat"] DYNLINKER["dynamic_linker: DynamicLinker"] TARGET --> CPU TARGET --> OS TARGET --> ABI TARGET --> OFMT TARGET --> DYNLINKER end subgraph "Cpu Components" CPU --> ARCH["arch: Cpu.Arch"] CPU --> MODEL["model: *const Cpu.Model"] CPU --> FEATURES["features: Feature.Set"] ARCH --> ARCHEX["x86_64, aarch64, wasm32, etc"] MODEL --> MODELEX["generic, native, specific variants"] FEATURES --> FEATEX["CPU feature flags"] end subgraph "Os Components" OS --> OSTAG["tag: Os.Tag"] OS --> VERSION["version_range: VersionRange"] OSTAG --> OSEX["linux, windows, macos, wasi, etc"] VERSION --> VERUNION["linux: LinuxVersionRange<br/>windows: WindowsVersion.Range<br/>semver: SemanticVersion.Range<br/>none: void"] end subgraph "Abi and Format" ABI --> ABIEX["gnu, musl, msvc, none, etc"] OFMT --> OFMTEX["elf, macho, coff, wasm, c, spirv"] end

This target structure reveals how GPU compilation integrates with Zig’s type system. When you specify -target spirv32-vulkan-none, you’re setting: CPU arch to spirv32 (32-bit SPIR-V), OS tag to vulkan (Vulkan environment), ABI to none (freestanding, no C runtime), and implicitly ObjectFormat to spirv. The target fully determines code generation behavior: builtin.target.cpu.arch.isSpirV() returns true, address space support is enabled, and the compiler selects the SPIR-V backend instead of x86_64 or ARM code generation. This same structure handles all targets—CPU, GPU, WebAssembly, bare metal—with uniform semantics. The ObjectFormat field (ofmt) tells the linker which binary format to produce: elf for Linux executables, macho for Darwin, coff for Windows, wasm for WebAssembly, and spirv for GPU shaders. Understanding this architecture helps you decode target triples, predict which builtins are available (like @workGroupId on GPU targets), and troubleshoot cross-compilation issues.

Inspecting targets and address spaces

This first example introspects the native build target, reports which GPU address spaces the compiler allows, and synthesizes a cross-compilation triple for SPIR-V. Running it on non-GPU hosts still teaches the vocabulary Zig uses to describe accelerators (see Query.zig).

Zig
const std = @import("std");
const builtin = @import("builtin");

pub fn main() !void {
    // Query the compile-time target information to inspect the environment
    // this binary is being compiled for (host or cross-compilation target)
    const target = builtin.target;

    // Display basic target information: CPU architecture, OS, and object format
    std.debug.print("host architecture: {s}\n", .{@tagName(target.cpu.arch)});
    std.debug.print("host operating system: {s}\n", .{@tagName(target.os.tag)});
    std.debug.print("default object format: {s}\n", .{@tagName(target.ofmt)});

    // Check if we're compiling for a GPU backend by examining the target CPU architecture.
    // GPU architectures include AMD GCN, NVIDIA PTX variants, and SPIR-V targets.
    const is_gpu_backend = switch (target.cpu.arch) {
        .amdgcn, .nvptx, .nvptx64, .spirv32, .spirv64 => true,
        else => false,
    };
    std.debug.print("compiling as GPU backend: {}\n", .{is_gpu_backend});

    // Import address space types for querying GPU-specific memory capabilities
    const AddressSpace = std.builtin.AddressSpace;
    const Context = AddressSpace.Context;

    // Query whether the target supports GPU-specific address spaces:
    // - shared: memory shared within a workgroup/threadblock
    // - constant: read-only memory optimized for uniform access across threads
    const shared_ok = target.cpu.supportsAddressSpace(AddressSpace.shared, null);
    const constant_ok = target.cpu.supportsAddressSpace(AddressSpace.constant, Context.constant);

    std.debug.print("supports shared address space: {}\n", .{shared_ok});
    std.debug.print("supports constant address space: {}\n", .{constant_ok});

    // Construct a custom target query for SPIR-V 64-bit targeting Vulkan
    const gpa = std.heap.page_allocator;
    const query = std.Target.Query{
        .cpu_arch = .spirv64,
        .os_tag = .vulkan,
        .abi = .none,
    };
    
    // Convert the target query to a triple string (e.g., "spirv64-vulkan")
    const triple = try query.zigTriple(gpa);
    defer gpa.free(triple);
    std.debug.print("example SPIR-V triple: {s}\n", .{triple});
}
Run
Shell
$ zig run 01_target_introspection.zig
Output
Shell
host architecture: x86_64
host operating system: linux
default object format: elf
compiling as GPU backend: false
supports shared address space: false
supports constant address space: false
example SPIR-V triple: spirv64-vulkan-none

Even when the native arch is a CPU, synthesizing a SPIR-V triple helps you wire up build steps that emit GPU binaries without switching machines.

Declaring kernels and dispatch metadata

The kernel below stores its dispatch coordinates in a storage-buffer struct, illustrating how GPU-specific calling conventions, address spaces, and builtins compose. Compiling requires a SPIR-V target and the self-hosted backend (-fno-llvm) so Zig emits binary modules ready for Vulkan or WebGPU queue submission.

Zig
//! GPU Kernel: Coordinate Capture
//!
//! This module demonstrates a minimal SPIR-V compute kernel that captures GPU dispatch
//! coordinates into a storage buffer. It shows how to use Zig's GPU-specific builtins
//! and address space annotations to write kernels that compile to SPIR-V.

const builtin = @import("builtin");

/// Represents GPU dispatch coordinates for a single invocation
/// 
/// Uses `extern` layout to guarantee memory layout matches host-side expectations,
/// ensuring the kernel's output can be safely interpreted by CPU code reading the buffer.
const Coordinates = extern struct {
    /// Work group ID (which group this invocation belongs to)
    group: u32,
    /// Work group size (number of invocations per group in this dimension)
    group_size: u32,
    /// Local invocation ID within the work group (0 to group_size-1)
    local: u32,
    /// Global linear ID across all invocations (group * group_size + local)
    linear: u32,
};

/// GPU kernel entry point that captures dispatch coordinates
///
/// This function must be exported so the SPIR-V compiler generates an entry point.
/// The `callconv(.kernel)` calling convention tells Zig to emit GPU-specific function
/// attributes and handle parameter passing according to compute shader ABI.
///
/// Parameters:
///   - out: Pointer to storage buffer where coordinates will be written.
///          The `.storage_buffer` address space annotation ensures proper
///          memory access patterns for device-visible GPU memory.
pub export fn captureCoordinates(out: *addrspace(.storage_buffer) Coordinates) callconv(.kernel) void {
    // Query the work group ID in the X dimension (first dimension)
    // @workGroupId is a GPU-specific builtin that returns the current work group's coordinate
    const group = @workGroupId(0);
    
    // Query the work group size (how many invocations per group in this dimension)
    // This is set at dispatch time by the host and queried here for completeness
    const group_size = @workGroupSize(0);
    
    // Query the local invocation ID within this work group (0 to group_size-1)
    // @workItemId is the per-work-group thread index
    const local = @workItemId(0);
    
    // Calculate global linear index across all invocations
    // This formula converts 2D coordinates (group, local) to a flat 1D index
    const linear = group * group_size + local;

    // Write all captured coordinates to the output buffer
    // The GPU will ensure this write is visible to the host after synchronization
    out.* = .{
        .group = group,
        .group_size = group_size,
        .local = local,
        .linear = linear,
    };
}

// Compile-time validation to ensure this module is only compiled for SPIR-V targets
// This prevents accidental compilation for CPU architectures where GPU builtins are unavailable
comptime {
    switch (builtin.target.cpu.arch) {
        // Accept both 32-bit and 64-bit SPIR-V architectures
        .spirv32, .spirv64 => {},
        // Reject all other architectures with a helpful error message
        else => @compileError("captureCoordinates must be compiled with a SPIR-V target, e.g. -target spirv32-vulkan-none"),
    }
}
Run
Shell
$ zig build-obj -fno-llvm -O ReleaseSmall -target spirv32-vulkan-none \
    -femit-bin=chapters-data/code/34__gpu-fundamentals/capture_coordinates.spv \
    chapters-data/code/34__gpu-fundamentals/02_spirv_fill_kernel.zig
Output
Shell
no output (binary module generated)

The emitted .spv blob slots directly into Vulkan’s vkCreateShaderModule or WebGPU’s wgpuDeviceCreateShaderModule, and the extern struct ensures host descriptors match the kernel’s expected layout.

Toolchain choices and binary formats

Zig’s build system can register GPU artifacts via addObject or addLibrary, allowing you to tuck SPIR-V modules alongside CPU executables in a single workspace. When SPIR-V validation demands specific environments (Vulkan versus OpenCL), set the OS tag in your -target triple accordingly, and pin optimization modes (-O ReleaseSmall for shaders) to control instruction counts and register pressure (see build.zig). Fallbacks like -fllvm unlock vendor-specific features when the self-hosted backend trails the latest SPIR-V extensions.

Object Formats and ABI for GPU Targets

SPIR-V is a first-class object format in Zig, sitting alongside traditional executable formats. The following diagram shows how object formats and ABIs are organized:

graph TB subgraph "Common ABIs" ABI["Abi enum"] ABI --> GNU["gnu<br/>GNU toolchain"] ABI --> MUSL["musl<br/>musl libc"] ABI --> MSVC["msvc<br/>Microsoft Visual C++"] ABI --> NONE["none<br/>freestanding"] ABI --> ANDROID["android, gnueabi, etc<br/>platform variants"] end subgraph "Object Formats" OFMT["ObjectFormat enum"] OFMT --> ELF["elf<br/>Linux, BSD"] OFMT --> MACHO["macho<br/>Darwin systems"] OFMT --> COFF["coff<br/>Windows PE"] OFMT --> WASM["wasm<br/>WebAssembly"] OFMT --> C["c<br/>C source output"] OFMT --> SPIRV["spirv<br/>Shaders"] end

GPU kernels typically use abi = none because they run in freestanding environments without a C runtime—no libc, no standard library initialization, just raw compute. The SPIR-V object format produces .spv binaries that bypass traditional linking: instead of resolving relocations and merging sections like ELF or Mach-O linkers do, SPIR-V modules are complete, self-contained shader programs ready for consumption by Vulkan’s vkCreateShaderModule or WebGPU’s shader creation APIs. This is why you don’t need a separate linking step for GPU code—the compiler emits final binaries directly. When you specify -target spirv32-vulkan-none, the none ABI tells Zig to skip all C runtime setup, and the spirv object format ensures the output is valid SPIR-V bytecode rather than an executable with entry points and program headers.

Code Generation Backend Architecture

Zig supports multiple code generation backends, giving you flexibility in how SPIR-V is produced:

graph TB subgraph "Code Generation" CG["Code Generation"] CG --> LLVM["LLVM Backend<br/>use_llvm flag"] CG --> NATIVE["Native Backends<br/>x86_64, aarch64, wasm, riscv64"] CG --> CBACK["C Backend<br/>ofmt == .c"] end

The LLVM Backend (-fllvm) routes through LLVM’s SPIR-V target, which supports vendor-specific extensions and newer SPIR-V versions. Use this when you need features the self-hosted backend hasn’t implemented yet, or when debugging compiler issues—LLVM’s mature SPIR-V support provides a known-good reference. The Native Backends (-fno-llvm, the default) use Zig’s self-hosted code generation for SPIR-V, which is faster to compile and produces smaller binaries but may lag behind LLVM in extension support. For SPIR-V, the self-hosted backend emits bytecode directly without intermediate representations. The C Backend isn’t applicable to GPU targets, but demonstrates Zig’s multi-backend flexibility. When experimenting with GPU code, start with -fno-llvm for faster iteration; switch to -fllvm if you encounter missing SPIR-V features or need to compare output against a reference implementation. The choice affects compilation speed and feature availability but not the API you write—your kernel code remains identical.

Launch Planning and Data Parallel Patterns

Choosing launch sizes involves balancing GPU occupancy with shared-memory budgets, while CPU fallbacks should reuse the same arithmetic so correctness stays identical across devices. Zig’s strong typing makes these calculations explicit, encouraging reusable helpers for both host planners and kernels.

Choosing workgroup sizes

This helper computes how many work groups you need for a problem size, how much padding the final group introduces, and models the same computation for CPU-side chunking. Using one routine eliminates off-by-one desynchronization between host and device scheduling.

Zig
//! GPU Dispatch Planning Utility
//! 
//! This module demonstrates how to calculate workgroup dispatch parameters for GPU compute shaders.
//! It shows the relationship between total work items, workgroup size, and the resulting dispatch
//! configuration, including handling of "tail" elements that don't fill a complete workgroup.

const std = @import("std");

/// Represents a complete dispatch configuration for parallel execution
/// Contains all necessary parameters to launch a compute kernel or parallel task
const DispatchPlan = struct {
    /// Size of each workgroup (number of threads/invocations per group)
    workgroup_size: u32,
    /// Number of workgroups needed to cover all items
    group_count: u32,
    /// Total invocations including padding (always a multiple of workgroup_size)
    padded_invocations: u32,
    /// Number of padded/unused invocations in the last workgroup
    tail: u32,
};

/// Computes optimal dispatch parameters for a given problem size and workgroup configuration
/// 
/// Calculates how many workgroups are needed to process all items, accounting for the fact
/// that the last workgroup may be partially filled. This is essential for GPU compute shaders
/// where work must be dispatched in multiples of the workgroup size.
fn computeDispatch(total_items: u32, workgroup_size: u32) DispatchPlan {
    // Ensure workgroup size is valid (GPU workgroups cannot be empty)
    std.debug.assert(workgroup_size > 0);
    
    // Calculate number of workgroups needed, rounding up to ensure all items are covered
    const groups = std.math.divCeil(u32, total_items, workgroup_size) catch unreachable;
    
    // Calculate total invocations including padding (GPU always launches complete workgroups)
    const padded = groups * workgroup_size;
    
    return .{
        .workgroup_size = workgroup_size,
        .group_count = groups,
        .padded_invocations = padded,
        // Tail represents wasted invocations that must be handled with bounds checks
        .tail = padded - total_items,
    };
}

/// Simulates CPU-side parallel execution planning using the same dispatch logic
/// 
/// Demonstrates that the workgroup dispatch formula applies equally to CPU thread batching,
/// ensuring consistent behavior between GPU and CPU fallback implementations.
fn simulateCpuFallback(total_items: u32, lanes: u32) DispatchPlan {
    // Reuse the GPU formula so host-side chunking matches device scheduling.
    return computeDispatch(total_items, lanes);
}

pub fn main() !void {
    // Define a sample problem: processing 1000 items
    const problem_size: u32 = 1000;
    
    // Typical GPU workgroup size (often 32, 64, or 256 depending on hardware)
    const workgroup_size: u32 = 64;
    
    // Calculate GPU dispatch configuration
    const plan = computeDispatch(problem_size, workgroup_size);
    std.debug.print(
        "gpu dispatch: {d} groups × {d} lanes => {d} invocations (tail {d})\n",
        .{ plan.group_count, plan.workgroup_size, plan.padded_invocations, plan.tail },
    );

    // Simulate CPU fallback with fewer parallel lanes
    const fallback_threads: u32 = 16;
    const cpu = simulateCpuFallback(problem_size, fallback_threads);
    std.debug.print(
        "cpu chunks: {d} batches × {d} lanes => {d} logical tasks (tail {d})\n",
        .{ cpu.group_count, cpu.workgroup_size, cpu.padded_invocations, cpu.tail },
    );
}
Run
Shell
$ zig run 03_dispatch_planner.zig
Output
Shell
gpu dispatch: 16 groups × 64 lanes => 1024 invocations (tail 24)
cpu chunks: 63 batches × 16 lanes => 1008 logical tasks (tail 8)

Feed the planner’s output back into both kernel launch descriptors and CPU task schedulers so instrumentation stays consistent across platforms.

CPU fallbacks and unified code paths

Modern applications often ship CPU implementations for capability-limited machines; by sharing dispatch planners and extern payloads, you can reuse validation code that checks GPU outputs against CPU recomputations before trusting results in production. Pair this with Zig’s build options (-Dgpu=false) to conditionally exclude kernel modules when packaging for environments without accelerators.

Notes & Caveats

  • Always gate GPU-specific code behind feature checks so CPU-only builds still pass CI.
  • Vulkan validation layers catch many mistakes early; enable them whenever compiling SPIR-V from Zig until your kernel suite stabilizes.
  • Prefer release-small optimization for kernels: it minimizes instruction count, easing pressure on instruction caches and register files.

Exercises

  • Extend the kernel to write multiple dimensions (XYZ) into the coordinate struct and verify the emitted SPIR-V with spirv-dis.
  • Add a CPU-side validator that maps the SPIR-V output buffer back into Zig and cross-checks runtimes against simulateCpuFallback.
  • Modify the build script to emit both SPIR-V and PTX variants by flipping the -target triple and swapping address-space annotations accordingly.

Alternatives & Edge Cases

  • Some GPU drivers demand specialized calling conventions (e.g., AMD’s .amdgcn.kernel), so parameter order and types must match vendor documentation precisely.
  • @workGroupSize returns compile-time constants only when you mark the function inline and supply size literals; otherwise, assume runtime values and guard dynamic paths.
  • OpenCL targets prefer .param address spaces; when cross-compiling, audit every pointer parameter and adjust addrspace annotations to maintain correctness.

Help make this chapter better.

Found a typo, rough edge, or missing explanation? Open an issue or propose a small improvement on GitHub.