Chapter 29Threads And Atomics

Threads & Atomics

Overview

Filesystem plumbing from the previous chapter set the stage for applications that produce and consume data in parallel. Now we focus on how Zig launches OS threads, coordinates work across cores, and keeps shared state consistent with atomic operations (see 28 and Thread.zig).

Zig 0.15.2’s thread primitives combine lightweight spawn APIs with explicit memory ordering, so you decide when a store becomes visible and when contention should block. Understanding these tools now will make the upcoming parallel wordcount project far less mysterious (see atomic.zig and 30).

Learning Goals

  • Spawn and join worker threads responsibly, selecting stack sizes and allocators only when necessary.
  • Choose memory orderings for atomic loads, stores, and compare-and-swap loops when protecting shared state.
  • Detect single-threaded builds at compile time and fall back to synchronous execution paths.

Orchestrating Work with

Zig models kernel threads through std.Thread, exposing helpers to query CPU counts, configure stack sizes, and join handles deterministically. Unlike async I/O, these are real kernel threads—every spawn consumes OS resources, so batching units of work matters.

Thread Pool Pattern

Before diving into manual thread spawning, it’s valuable to understand the thread pool pattern that Zig’s own compiler uses for parallel work. The following diagram shows how std.Thread.Pool distributes work across workers:

graph TB ThreadPool["ThreadPool<br/>(std.Thread.Pool)"] AstGen1["AstGen<br/>(File 1)"] AstGen2["AstGen<br/>(File 2)"] AstGen3["AstGen<br/>(File 3)"] Sema1["Sema<br/>(Function 1)"] Sema2["Sema<br/>(Function 2)"] Codegen1["CodeGen<br/>(Function 1)"] Codegen2["CodeGen<br/>(Function 2)"] ThreadPool --> AstGen1 ThreadPool --> AstGen2 ThreadPool --> AstGen3 ThreadPool --> Sema1 ThreadPool --> Sema2 ThreadPool --> Codegen1 ThreadPool --> Codegen2 ZcuPerThread["Zcu.PerThread<br/>(per-thread state)"] Sema1 -.->|"uses"| ZcuPerThread Sema2 -.->|"uses"| ZcuPerThread

A thread pool maintains a fixed number of worker threads that pull work items from a queue, avoiding the overhead of repeatedly spawning and joining threads. The Zig compiler uses this pattern extensively: std.Thread.Pool dispatches AST generation, semantic analysis, and code generation tasks to workers. Each worker has per-thread state (Zcu.PerThread) to minimize synchronization—only the final results need mutex protection when merging into shared data structures like InternPool.shards. This architecture demonstrates key concurrent design principles: work units should be independent, shared state should be sharded or protected by mutexes, and per-thread caches reduce contention. When your workload involves many small tasks, prefer std.Thread.Pool over manual spawning; when you need a few long-running workers with specific responsibilities, manual spawn/join is appropriate.

Chunking data with spawn/join

The example below partitions an array of integers across a dynamic number of workers, using an atomic fetch-add to accumulate the total of even numbers without locks. It adapts to the host CPU count but never spawns more threads than there are elements to process.

Zig

// This example demonstrates parallel computation using threads and atomic operations in Zig.
// It calculates the sum of even numbers in an array by distributing work across multiple threads.
const std = @import("std");

// Arguments passed to each worker thread for parallel processing
const WorkerArgs = struct {
    slice: []const u64,                  // The subset of numbers this worker should process
    sum: *std.atomic.Value(u64),         // Shared atomic counter for thread-safe accumulation
};

// Worker function that accumulates even numbers from its assigned slice
// Each thread runs this function independently on its own data partition
fn accumulate(args: WorkerArgs) void {
    // Use a local variable to minimize atomic operations (performance optimization)
    var local_total: u64 = 0;
    for (args.slice) |value| {
        if (value % 2 == 0) {
            local_total += value;
        }
    }

    // Atomically add the local result to the shared sum using sequentially consistent ordering
    // This ensures all threads see a consistent view of the shared state
    _ = args.sum.fetchAdd(local_total, .seq_cst);
}

pub fn main() !void {
    // Set up memory allocator with automatic leak detection
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    // Allocate array of 64 numbers for demonstration
    var numbers = try allocator.alloc(u64, 64);
    defer allocator.free(numbers);

    // Initialize array with values following the pattern: index * 7 + 3
    for (numbers, 0..) |*slot, index| {
        slot.* = @as(u64, @intCast(index * 7 + 3));
    }

    // Initialize shared atomic counter that all threads will safely update
    var shared_sum = std.atomic.Value(u64).init(0);

    // Determine optimal number of worker threads based on available CPU cores
    const cpu_count = std.Thread.getCpuCount() catch 1;
    const desired = if (cpu_count == 0) 1 else cpu_count;
    // Don't create more threads than we have numbers to process
    const worker_limit = @min(numbers.len, desired);

    // Allocate thread handles for parallel workers
    var threads = try allocator.alloc(std.Thread, worker_limit);
    defer allocator.free(threads);

    // Calculate chunk size, rounding up to ensure all elements are covered
    const chunk = (numbers.len + worker_limit - 1) / worker_limit;

    // Spawn worker threads, distributing the array into roughly equal chunks
    var start: usize = 0;
    var spawned: usize = 0;
    while (start < numbers.len and spawned < worker_limit) : (spawned += 1) {
        const remaining = numbers.len - start;
        // Give the last thread all remaining elements to handle uneven divisions
        const take = if (worker_limit - spawned == 1) remaining else @min(chunk, remaining);
        const end = start + take;

        // Spawn thread with its assigned slice and shared accumulator
        threads[spawned] = try std.Thread.spawn(.{}, accumulate, .{WorkerArgs{
            .slice = numbers[start..end],
            .sum = &shared_sum,
        }});

        start = end;
    }

    // Track how many threads were actually spawned (may be less than worker_limit)
    const used_threads = spawned;

    // Wait for all worker threads to complete their work
    for (threads[0..used_threads]) |thread| {
        thread.join();
    }

    // Read the final accumulated result from the atomic shared sum
    const even_sum = shared_sum.load(.seq_cst);

    // Perform sequential calculation to verify correctness of parallel computation
    var sequential: u64 = 0;
    for (numbers) |value| {
        if (value % 2 == 0) {
            sequential += value;
        }
    }

    // Set up buffered stdout writer for efficient output
    var stdout_buffer: [256]u8 = undefined;
    var stdout_state = std.fs.File.stdout().writer(&stdout_buffer);
    const out = &stdout_state.interface;

    // Display results: thread count and both parallel and sequential sums
    try out.print("spawned {d} worker(s)\n", .{used_threads});
    try out.print("even sum (threads): {d}\n", .{even_sum});
    try out.print("even sum (sequential check): {d}\n", .{sequential});
    try out.flush();
}
Run
Shell
$ zig run 01_parallel_even_sum.zig
Output
Shell
spawned 8 worker(s)
even sum (threads): 7264
even sum (sequential check): 7264

std.atomic.Value wraps plain integers and routes every access through @atomicLoad, @atomicStore, or @atomicRmw, shielding you from accidentally mixing atomic and non-atomic access to the same memory location.

Spawn configuration and scheduling hints

std.Thread.SpawnConfig lets you override stack sizes or supply a custom allocator if the defaults are unsuitable (for example, deep recursion or pre-allocated arenas). Catch Thread.getCpuCount() errors to provide a safe fallback, and remember to use Thread.yield() or Thread.sleep() when you need cooperative scheduling while waiting on other threads to progress.

Atomic state machines

Zig exposes LLVM’s atomic intrinsics directly: you pick an order such as .acquire, .release, or .seq_cst, and the compiler emits the matching fences. That clarity is valuable when you design small state machines—like a one-time initializer—that multiple threads must observe consistently.

Implementing a once guard with atomic builtins

This program builds a lock-free "call once" helper around @cmpxchgStrong. Threads spin only while another thread is running the initializer, then read the published value via an acquire load.

Zig

// This example demonstrates thread-safe one-time initialization using atomic operations.
// Multiple threads attempt to initialize a shared resource, but only one succeeds in
// performing the expensive initialization exactly once.

const std = @import("std");

// Represents the initialization state using atomic operations
const State = enum(u8) { idle, busy, ready };

// Global state tracking the initialization lifecycle
var once_state: State = .idle;
// The shared configuration value that will be initialized once
var config_value: i32 = 0;
// Counter to verify that initialization only happens once
var init_calls: u32 = 0;

// Simulates an expensive initialization operation that should only run once.
// Uses atomic operations to safely increment the call counter and set the config value.
fn expensiveInit() void {
    // Simulate expensive work with a sleep
    std.Thread.sleep(2 * std.time.ns_per_ms);
    // Atomically increment the initialization call counter
    _ = @atomicRmw(u32, &init_calls, .Add, 1, .seq_cst);
    // Atomically store the initialized value with release semantics
    @atomicStore(i32, &config_value, 9157, .release);
}

// Ensures expensiveInit() is called exactly once across multiple threads.
// Uses a state machine with compare-and-swap to coordinate thread access.
fn callOnce() void {
    while (true) {
        // Check the current state with acquire semantics to see initialization results
        switch (@atomicLoad(State, &once_state, .acquire)) {
            // Initialization complete, return immediately
            .ready => return,
            // Another thread is initializing, yield and retry
            .busy => {
                std.Thread.yield() catch {};
                continue;
            },
            // Not yet initialized, attempt to claim initialization responsibility
            .idle => {
                // Try to atomically transition from idle to busy
                // If successful (returns null), this thread wins and will initialize
                // If it fails (returns the actual value), another thread won, so retry
                if (@cmpxchgStrong(State, &once_state, .idle, .busy, .acq_rel, .acquire)) |_| {
                    continue;
                }
                // This thread successfully claimed the initialization
                break;
            },
        }
    }

    // Perform the one-time initialization
    expensiveInit();
    // Mark initialization as complete with release semantics
    @atomicStore(State, &once_state, .ready, .release);
}

// Arguments passed to each worker thread
const WorkerArgs = struct {
    results: []i32,
    index: usize,
};

// Worker thread function that calls the once-initialization and reads the result.
fn worker(args: WorkerArgs) void {
    // Ensure initialization happens (blocks until complete if another thread is initializing)
    callOnce();
    // Read the initialized value with acquire semantics
    const value = @atomicLoad(i32, &config_value, .acquire);
    // Store the observed value in the thread's result slot
    args.results[args.index] = value;
}

pub fn main() !void {
    // Reset global state for demonstration
    once_state = .idle;
    config_value = 0;
    init_calls = 0;

    // Set up memory allocation
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const worker_count: usize = 4;

    // Allocate array to collect results from each thread
    const results = try allocator.alloc(i32, worker_count);
    defer allocator.free(results);
    // Initialize all result slots to -1 to detect if any thread fails
    for (results) |*slot| slot.* = -1;

    // Allocate array to hold thread handles
    const threads = try allocator.alloc(std.Thread, worker_count);
    defer allocator.free(threads);

    // Spawn all worker threads
    for (threads, 0..) |*thread, index| {
        thread.* = try std.Thread.spawn(.{}, worker, .{WorkerArgs{
            .results = results,
            .index = index,
        }});
    }

    // Wait for all threads to complete
    for (threads) |thread| {
        thread.join();
    }

    // Read final values after all threads complete
    const final_value = @atomicLoad(i32, &config_value, .acquire);
    const called = @atomicLoad(u32, &init_calls, .seq_cst);

    // Set up buffered output
    var stdout_buffer: [256]u8 = undefined;
    var stdout_state = std.fs.File.stdout().writer(&stdout_buffer);
    const out = &stdout_state.interface;

    // Print the value observed by each thread (should all be 9157)
    for (results, 0..) |value, index| {
        try out.print("thread {d} observed {d}\n", .{ index, value });
    }
    // Verify initialization was called exactly once
    try out.print("init calls: {d}\n", .{called});
    // Display the final configuration value
    try out.print("config value: {d}\n", .{final_value});
    try out.flush();
}
Run
Shell
$ zig run 02_atomic_once.zig
Output
Shell
thread 0 observed 9157
thread 1 observed 9157
thread 2 observed 9157
thread 3 observed 9157
init calls: 1
config value: 9157

@cmpxchgStrong returns null on success, so looping while it yields a value is a concise way to retry the CAS without allocating a mutex. Pair the final @atomicStore with .release to publish the results before any waiter performs its .acquire load.

Single-threaded builds & fallbacks

Passing -Dsingle-threaded=true forces the compiler to reject any attempt to spawn OS threads. Code that might run in both configurations should branch on builtin.single_threaded at compile time and substitute an inline execution path. See builtin.zig.

Understanding the Single-Threaded Flag

The single_threaded flag is part of the compiler’s feature configuration system, affecting code generation and optimization:

graph TB subgraph "Code Generation Features" Features["Feature Flags"] Features --> UnwindTables["unwind_tables: bool"] Features --> StackProtector["stack_protector: bool"] Features --> StackCheck["stack_check: bool"] Features --> RedZone["red_zone: ?bool"] Features --> OmitFramePointer["omit_frame_pointer: bool"] Features --> Valgrind["valgrind: bool"] Features --> SingleThreaded["single_threaded: bool"] UnwindTables --> EHFrame["Generate .eh_frame<br/>for exception handling"] StackProtector --> CanaryCheck["Stack canary checks<br/>buffer overflow detection"] StackCheck --> ProbeStack["Stack probing<br/>prevents overflow"] RedZone --> RedZoneSpace["Red zone optimization<br/>(x86_64, AArch64)"] OmitFramePointer --> NoFP["Omit frame pointer<br/>for performance"] Valgrind --> ValgrindSupport["Valgrind client requests<br/>for memory debugging"] SingleThreaded --> NoThreading["Assume single-threaded<br/>enable optimizations"] end

When single_threaded is true, the compiler assumes no concurrent access to memory, enabling several optimizations: atomic operations can be lowered to plain loads and stores (eliminating fence instructions), thread-local storage becomes regular globals, and synchronization primitives can be elided entirely. This flag is set via -Dsingle-threaded=true at build time and flows through Compilation.Config into code generation. Importantly, this is not just an API restriction—it fundamentally changes the generated code. Atomics compiled in single-threaded mode have weaker guarantees than atomics in multi-threaded builds, so you must ensure your code paths remain consistent across both modes to avoid subtle bugs when toggling the flag.

Gating thread usage at compile time

The guard below resets an atomic state machine, then either spawns a worker or executes the task inline based on the build mode. Because the branch is compile-time, the single-threaded configuration never instantiates Thread.spawn, avoiding a compile error altogether.

Zig
const std = @import("std");
const builtin = @import("builtin");

// Enum representing the possible states of task execution
// Uses explicit u8 backing to ensure consistent size across platforms
const TaskState = enum(u8) { idle, threaded_done, inline_done };

// Global atomic state tracking whether task ran inline or in a separate thread
// Atomics ensure thread-safe access even though single-threaded builds won't spawn threads
var task_state = std.atomic.Value(TaskState).init(.idle);

// Simulates a task that runs in a separate thread
// Includes a small delay to demonstrate asynchronous execution
fn threadedTask() void {
    std.Thread.sleep(1 * std.time.ns_per_ms);
    // Release ordering ensures all prior writes are visible to threads that acquire this value
    task_state.store(.threaded_done, .release);
}

// Simulates a task that runs inline in the main thread
// Used as fallback when threading is disabled at compile time
fn inlineTask() void {
    // Release ordering maintains consistency with the threaded path
    task_state.store(.inline_done, .release);
}

pub fn main() !void {
    // Set up buffered stdout writer for efficient output
    var stdout_buffer: [256]u8 = undefined;
    var stdout_state = std.fs.File.stdout().writer(&stdout_buffer);
    const out = &stdout_state.interface;

    // Reset state to idle with sequential consistency
    // seq_cst provides strongest ordering guarantees for initialization
    task_state.store(.idle, .seq_cst);

    // Check compile-time flag to determine execution strategy
    // builtin.single_threaded is true when compiled with -fsingle-threaded
    if (builtin.single_threaded) {
        try out.print("single-threaded build; running task inline\n", .{});
        // Execute task directly without spawning a thread
        inlineTask();
    } else {
        try out.print("multi-threaded build; spawning worker\n", .{});
        // Spawn separate thread to execute task concurrently
        var worker = try std.Thread.spawn(.{}, threadedTask, .{});
        // Block until worker thread completes
        worker.join();
    }

    // Acquire ordering ensures we observe all writes made before the release store
    const final_state = task_state.load(.acquire);
    
    // Convert enum state to human-readable string for output
    const label = switch (final_state) {
        .idle => "idle",
        .threaded_done => "threaded_done",
        .inline_done => "inline_done",
    };

    // Display final execution state and flush buffer to ensure output is visible
    try out.print("task state: {s}\n", .{label});
    try out.flush();
}
Run
Shell
$ zig run 03_single_thread_guard.zig
Output
Shell
multi-threaded build; spawning worker
task state: threaded_done

When you build with -Dsingle-threaded=true, the inline branch is the only one compiled, so keep the logic symmetrical and make sure any shared state is still set via the same atomic helpers to avoid diverging semantics.

Notes & Caveats

  • Threads must be joined or detached exactly once; leaking handles leads to resource exhaustion. Thread.join consumes the handle, so store it in a slice you can iterate later.
  • Atomics operate on raw memory—never mix atomic and non-atomic accesses to the same location, even if you 'know' the race cannot happen. Wrap shared scalars in std.atomic.Value to keep your intent obvious.
  • Compare-and-swap loops may live-spin; consider Thread.yield() or event primitives like Thread.ResetEvent when a wait might last longer than a few cycles.

Debugging Concurrent Code with ThreadSanitizer

Zig provides built-in race detection through ThreadSanitizer, a powerful tool for finding data races, deadlocks, and other concurrency bugs:

SanitizerConfig FieldPurposeRequirements
Thread Sanitizerany_sanitize_threadData race detectionLLVM backend
UBSanany_sanitize_cC undefined behaviorLLVM backend, C code
Fuzzingany_fuzzFuzzing instrumentationlibfuzzer integration
graph TB subgraph "Sanitizer Configuration" Sanitizers["Sanitizer Flags"] Sanitizers --> TSan["any_sanitize_thread"] Sanitizers --> UBSan["any_sanitize_c"] Sanitizers --> Fuzz["any_fuzz"] TSan --> TSanLib["tsan_lib: ?CrtFile"] TSan --> TSanRuntime["ThreadSanitizer runtime<br/>linked into binary"] UBSan --> UBSanLib["ubsan_rt_lib: ?CrtFile<br/>ubsan_rt_obj: ?CrtFile"] UBSan --> UBSanRuntime["UBSan runtime<br/>C undefined behavior checks"] Fuzz --> FuzzLib["fuzzer_lib: ?CrtFile"] Fuzz --> FuzzRuntime["libFuzzer integration<br/>for fuzz testing"] end

Enable ThreadSanitizer with -Dsanitize-thread when building your program. TSan instruments all memory accesses and synchronization operations, tracking happens-before relationships to detect races. When a race is detected, TSan prints detailed reports showing the conflicting accesses and their stack traces. The instrumentation adds significant runtime overhead (2-5x slowdown, 5-10x memory usage), so use it during development and testing, not in production. TSan is particularly valuable for validating atomic code: even if your logic appears correct, TSan can catch subtle ordering issues or missing synchronization. For the examples in this chapter, try running them with -Dsanitize-thread to verify they’re race-free—the parallel sum and atomic once patterns should pass cleanly, demonstrating proper synchronization.

Exercises

  • Extend the parallel sum to accept a predicate callback so you can swap "even numbers" for any classification you like; measure the effect of .acquire vs .monotonic loads on contention.
  • Rework the callOnce demo to stage errors: have the initializer return !void and store the failure in an atomic slot so callers can rethrow the same error consistently.
  • Introduce a std.Thread.WaitGroup around the once-guard code so you can wait for arbitrary numbers of worker threads without storing handles manually.

Caveats, alternatives, edge cases

  • On platforms without pthreads or Win32 threads Zig emits a compile error; plan to fall back to event loops or async when targeting WASI without --threading support.
  • Atomics operate on plain integers and enums; for composite state consider using a mutex or designing an array of atomics to avoid torn updates.
  • Single-threaded builds can still use atomics, but the instructions compile to ordinary loads/stores. Keep the code paths consistent so you do not rely accidentally on the stronger ordering in multi-threaded builds.

Platform-Specific Threading Constraints

Not all platforms support threading, and some have special requirements for thread-local storage:

graph TB subgraph "Threading Configuration" TARG["Target Platform"] TARG --> SINGLETHREAD["defaultSingleThreaded()<br/>WASM, Haiku"] TARG --> EMULATETLS["useEmulatedTls()<br/>OpenBSD, old Android"] SINGLETHREAD --> NOTHREAD["No thread support"] EMULATETLS --> TLSEMU["Software TLS"] end

Certain targets default to single-threaded mode because they lack OS thread support: WebAssembly (without the --threading flag) and Haiku OS both fall into this category. On these platforms, attempting to spawn threads results in a compile error unless you’ve explicitly enabled threading support in your build configuration. A related concern is thread-local storage (TLS): OpenBSD and older Android versions don’t provide native TLS, so Zig uses emulated TLS—a software implementation that’s slower but portable. When writing cross-platform concurrent code, check target.defaultSingleThreaded() and target.useEmulatedTls() to understand platform constraints. For WASM, you can enable threading with the atomics and bulk-memory features plus the --import-memory --shared-memory linker flags, but not all WASM runtimes support this. Design your code to gracefully degrade: use builtin.single_threaded to provide synchronous fallbacks, and avoid assuming TLS is zero-cost on all platforms.

Summary

  • std.Thread offers lightweight spawn/join semantics, but you remain responsible for scheduling and cleanup.
  • Atomic intrinsics such as @atomicLoad, @atomicStore, and @cmpxchgStrong make small lock-free state machines practical when you match the orderings to your invariant.
  • Using builtin.single_threaded keeps shared components working across single-threaded builds and multi-core deployments without forking the codebase.

Help make this chapter better.

Found a typo, rough edge, or missing explanation? Open an issue or propose a small improvement on GitHub.