Overview
Filesystem plumbing from the previous chapter set the stage for applications that produce and consume data in parallel. Now we focus on how Zig launches OS threads, coordinates work across cores, and keeps shared state consistent with atomic operations (see 28 and Thread.zig).
Zig 0.15.2’s thread primitives combine lightweight spawn APIs with explicit memory ordering, so you decide when a store becomes visible and when contention should block. Understanding these tools now will make the upcoming parallel wordcount project far less mysterious (see atomic.zig and 30).
Learning Goals
- Spawn and join worker threads responsibly, selecting stack sizes and allocators only when necessary.
- Choose memory orderings for atomic loads, stores, and compare-and-swap loops when protecting shared state.
- Detect single-threaded builds at compile time and fall back to synchronous execution paths.
Orchestrating Work with
Zig models kernel threads through std.Thread, exposing helpers to query CPU counts, configure stack sizes, and join handles deterministically. Unlike async I/O, these are real kernel threads—every spawn consumes OS resources, so batching units of work matters.
Thread Pool Pattern
Before diving into manual thread spawning, it’s valuable to understand the thread pool pattern that Zig’s own compiler uses for parallel work. The following diagram shows how std.Thread.Pool distributes work across workers:
A thread pool maintains a fixed number of worker threads that pull work items from a queue, avoiding the overhead of repeatedly spawning and joining threads. The Zig compiler uses this pattern extensively: std.Thread.Pool dispatches AST generation, semantic analysis, and code generation tasks to workers. Each worker has per-thread state (Zcu.PerThread) to minimize synchronization—only the final results need mutex protection when merging into shared data structures like InternPool.shards. This architecture demonstrates key concurrent design principles: work units should be independent, shared state should be sharded or protected by mutexes, and per-thread caches reduce contention. When your workload involves many small tasks, prefer std.Thread.Pool over manual spawning; when you need a few long-running workers with specific responsibilities, manual spawn/join is appropriate.
Chunking data with spawn/join
The example below partitions an array of integers across a dynamic number of workers, using an atomic fetch-add to accumulate the total of even numbers without locks. It adapts to the host CPU count but never spawns more threads than there are elements to process.
// This example demonstrates parallel computation using threads and atomic operations in Zig.
// It calculates the sum of even numbers in an array by distributing work across multiple threads.
const std = @import("std");
// Arguments passed to each worker thread for parallel processing
const WorkerArgs = struct {
slice: []const u64, // The subset of numbers this worker should process
sum: *std.atomic.Value(u64), // Shared atomic counter for thread-safe accumulation
};
// Worker function that accumulates even numbers from its assigned slice
// Each thread runs this function independently on its own data partition
fn accumulate(args: WorkerArgs) void {
// Use a local variable to minimize atomic operations (performance optimization)
var local_total: u64 = 0;
for (args.slice) |value| {
if (value % 2 == 0) {
local_total += value;
}
}
// Atomically add the local result to the shared sum using sequentially consistent ordering
// This ensures all threads see a consistent view of the shared state
_ = args.sum.fetchAdd(local_total, .seq_cst);
}
pub fn main() !void {
// Set up memory allocator with automatic leak detection
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
// Allocate array of 64 numbers for demonstration
var numbers = try allocator.alloc(u64, 64);
defer allocator.free(numbers);
// Initialize array with values following the pattern: index * 7 + 3
for (numbers, 0..) |*slot, index| {
slot.* = @as(u64, @intCast(index * 7 + 3));
}
// Initialize shared atomic counter that all threads will safely update
var shared_sum = std.atomic.Value(u64).init(0);
// Determine optimal number of worker threads based on available CPU cores
const cpu_count = std.Thread.getCpuCount() catch 1;
const desired = if (cpu_count == 0) 1 else cpu_count;
// Don't create more threads than we have numbers to process
const worker_limit = @min(numbers.len, desired);
// Allocate thread handles for parallel workers
var threads = try allocator.alloc(std.Thread, worker_limit);
defer allocator.free(threads);
// Calculate chunk size, rounding up to ensure all elements are covered
const chunk = (numbers.len + worker_limit - 1) / worker_limit;
// Spawn worker threads, distributing the array into roughly equal chunks
var start: usize = 0;
var spawned: usize = 0;
while (start < numbers.len and spawned < worker_limit) : (spawned += 1) {
const remaining = numbers.len - start;
// Give the last thread all remaining elements to handle uneven divisions
const take = if (worker_limit - spawned == 1) remaining else @min(chunk, remaining);
const end = start + take;
// Spawn thread with its assigned slice and shared accumulator
threads[spawned] = try std.Thread.spawn(.{}, accumulate, .{WorkerArgs{
.slice = numbers[start..end],
.sum = &shared_sum,
}});
start = end;
}
// Track how many threads were actually spawned (may be less than worker_limit)
const used_threads = spawned;
// Wait for all worker threads to complete their work
for (threads[0..used_threads]) |thread| {
thread.join();
}
// Read the final accumulated result from the atomic shared sum
const even_sum = shared_sum.load(.seq_cst);
// Perform sequential calculation to verify correctness of parallel computation
var sequential: u64 = 0;
for (numbers) |value| {
if (value % 2 == 0) {
sequential += value;
}
}
// Set up buffered stdout writer for efficient output
var stdout_buffer: [256]u8 = undefined;
var stdout_state = std.fs.File.stdout().writer(&stdout_buffer);
const out = &stdout_state.interface;
// Display results: thread count and both parallel and sequential sums
try out.print("spawned {d} worker(s)\n", .{used_threads});
try out.print("even sum (threads): {d}\n", .{even_sum});
try out.print("even sum (sequential check): {d}\n", .{sequential});
try out.flush();
}
$ zig run 01_parallel_even_sum.zigspawned 8 worker(s)
even sum (threads): 7264
even sum (sequential check): 7264std.atomic.Value wraps plain integers and routes every access through @atomicLoad, @atomicStore, or @atomicRmw, shielding you from accidentally mixing atomic and non-atomic access to the same memory location.
Spawn configuration and scheduling hints
std.Thread.SpawnConfig lets you override stack sizes or supply a custom allocator if the defaults are unsuitable (for example, deep recursion or pre-allocated arenas). Catch Thread.getCpuCount() errors to provide a safe fallback, and remember to use Thread.yield() or Thread.sleep() when you need cooperative scheduling while waiting on other threads to progress.
Atomic state machines
Zig exposes LLVM’s atomic intrinsics directly: you pick an order such as .acquire, .release, or .seq_cst, and the compiler emits the matching fences. That clarity is valuable when you design small state machines—like a one-time initializer—that multiple threads must observe consistently.
Implementing a once guard with atomic builtins
This program builds a lock-free "call once" helper around @cmpxchgStrong. Threads spin only while another thread is running the initializer, then read the published value via an acquire load.
// This example demonstrates thread-safe one-time initialization using atomic operations.
// Multiple threads attempt to initialize a shared resource, but only one succeeds in
// performing the expensive initialization exactly once.
const std = @import("std");
// Represents the initialization state using atomic operations
const State = enum(u8) { idle, busy, ready };
// Global state tracking the initialization lifecycle
var once_state: State = .idle;
// The shared configuration value that will be initialized once
var config_value: i32 = 0;
// Counter to verify that initialization only happens once
var init_calls: u32 = 0;
// Simulates an expensive initialization operation that should only run once.
// Uses atomic operations to safely increment the call counter and set the config value.
fn expensiveInit() void {
// Simulate expensive work with a sleep
std.Thread.sleep(2 * std.time.ns_per_ms);
// Atomically increment the initialization call counter
_ = @atomicRmw(u32, &init_calls, .Add, 1, .seq_cst);
// Atomically store the initialized value with release semantics
@atomicStore(i32, &config_value, 9157, .release);
}
// Ensures expensiveInit() is called exactly once across multiple threads.
// Uses a state machine with compare-and-swap to coordinate thread access.
fn callOnce() void {
while (true) {
// Check the current state with acquire semantics to see initialization results
switch (@atomicLoad(State, &once_state, .acquire)) {
// Initialization complete, return immediately
.ready => return,
// Another thread is initializing, yield and retry
.busy => {
std.Thread.yield() catch {};
continue;
},
// Not yet initialized, attempt to claim initialization responsibility
.idle => {
// Try to atomically transition from idle to busy
// If successful (returns null), this thread wins and will initialize
// If it fails (returns the actual value), another thread won, so retry
if (@cmpxchgStrong(State, &once_state, .idle, .busy, .acq_rel, .acquire)) |_| {
continue;
}
// This thread successfully claimed the initialization
break;
},
}
}
// Perform the one-time initialization
expensiveInit();
// Mark initialization as complete with release semantics
@atomicStore(State, &once_state, .ready, .release);
}
// Arguments passed to each worker thread
const WorkerArgs = struct {
results: []i32,
index: usize,
};
// Worker thread function that calls the once-initialization and reads the result.
fn worker(args: WorkerArgs) void {
// Ensure initialization happens (blocks until complete if another thread is initializing)
callOnce();
// Read the initialized value with acquire semantics
const value = @atomicLoad(i32, &config_value, .acquire);
// Store the observed value in the thread's result slot
args.results[args.index] = value;
}
pub fn main() !void {
// Reset global state for demonstration
once_state = .idle;
config_value = 0;
init_calls = 0;
// Set up memory allocation
var gpa = std.heap.GeneralPurposeAllocator(.{}){};
defer _ = gpa.deinit();
const allocator = gpa.allocator();
const worker_count: usize = 4;
// Allocate array to collect results from each thread
const results = try allocator.alloc(i32, worker_count);
defer allocator.free(results);
// Initialize all result slots to -1 to detect if any thread fails
for (results) |*slot| slot.* = -1;
// Allocate array to hold thread handles
const threads = try allocator.alloc(std.Thread, worker_count);
defer allocator.free(threads);
// Spawn all worker threads
for (threads, 0..) |*thread, index| {
thread.* = try std.Thread.spawn(.{}, worker, .{WorkerArgs{
.results = results,
.index = index,
}});
}
// Wait for all threads to complete
for (threads) |thread| {
thread.join();
}
// Read final values after all threads complete
const final_value = @atomicLoad(i32, &config_value, .acquire);
const called = @atomicLoad(u32, &init_calls, .seq_cst);
// Set up buffered output
var stdout_buffer: [256]u8 = undefined;
var stdout_state = std.fs.File.stdout().writer(&stdout_buffer);
const out = &stdout_state.interface;
// Print the value observed by each thread (should all be 9157)
for (results, 0..) |value, index| {
try out.print("thread {d} observed {d}\n", .{ index, value });
}
// Verify initialization was called exactly once
try out.print("init calls: {d}\n", .{called});
// Display the final configuration value
try out.print("config value: {d}\n", .{final_value});
try out.flush();
}
$ zig run 02_atomic_once.zigthread 0 observed 9157
thread 1 observed 9157
thread 2 observed 9157
thread 3 observed 9157
init calls: 1
config value: 9157@cmpxchgStrong returns null on success, so looping while it yields a value is a concise way to retry the CAS without allocating a mutex. Pair the final @atomicStore with .release to publish the results before any waiter performs its .acquire load.
Single-threaded builds & fallbacks
Passing -Dsingle-threaded=true forces the compiler to reject any attempt to spawn OS threads. Code that might run in both configurations should branch on builtin.single_threaded at compile time and substitute an inline execution path. See builtin.zig.
Understanding the Single-Threaded Flag
The single_threaded flag is part of the compiler’s feature configuration system, affecting code generation and optimization:
When single_threaded is true, the compiler assumes no concurrent access to memory, enabling several optimizations: atomic operations can be lowered to plain loads and stores (eliminating fence instructions), thread-local storage becomes regular globals, and synchronization primitives can be elided entirely. This flag is set via -Dsingle-threaded=true at build time and flows through Compilation.Config into code generation. Importantly, this is not just an API restriction—it fundamentally changes the generated code. Atomics compiled in single-threaded mode have weaker guarantees than atomics in multi-threaded builds, so you must ensure your code paths remain consistent across both modes to avoid subtle bugs when toggling the flag.
Gating thread usage at compile time
The guard below resets an atomic state machine, then either spawns a worker or executes the task inline based on the build mode. Because the branch is compile-time, the single-threaded configuration never instantiates Thread.spawn, avoiding a compile error altogether.
const std = @import("std");
const builtin = @import("builtin");
// Enum representing the possible states of task execution
// Uses explicit u8 backing to ensure consistent size across platforms
const TaskState = enum(u8) { idle, threaded_done, inline_done };
// Global atomic state tracking whether task ran inline or in a separate thread
// Atomics ensure thread-safe access even though single-threaded builds won't spawn threads
var task_state = std.atomic.Value(TaskState).init(.idle);
// Simulates a task that runs in a separate thread
// Includes a small delay to demonstrate asynchronous execution
fn threadedTask() void {
std.Thread.sleep(1 * std.time.ns_per_ms);
// Release ordering ensures all prior writes are visible to threads that acquire this value
task_state.store(.threaded_done, .release);
}
// Simulates a task that runs inline in the main thread
// Used as fallback when threading is disabled at compile time
fn inlineTask() void {
// Release ordering maintains consistency with the threaded path
task_state.store(.inline_done, .release);
}
pub fn main() !void {
// Set up buffered stdout writer for efficient output
var stdout_buffer: [256]u8 = undefined;
var stdout_state = std.fs.File.stdout().writer(&stdout_buffer);
const out = &stdout_state.interface;
// Reset state to idle with sequential consistency
// seq_cst provides strongest ordering guarantees for initialization
task_state.store(.idle, .seq_cst);
// Check compile-time flag to determine execution strategy
// builtin.single_threaded is true when compiled with -fsingle-threaded
if (builtin.single_threaded) {
try out.print("single-threaded build; running task inline\n", .{});
// Execute task directly without spawning a thread
inlineTask();
} else {
try out.print("multi-threaded build; spawning worker\n", .{});
// Spawn separate thread to execute task concurrently
var worker = try std.Thread.spawn(.{}, threadedTask, .{});
// Block until worker thread completes
worker.join();
}
// Acquire ordering ensures we observe all writes made before the release store
const final_state = task_state.load(.acquire);
// Convert enum state to human-readable string for output
const label = switch (final_state) {
.idle => "idle",
.threaded_done => "threaded_done",
.inline_done => "inline_done",
};
// Display final execution state and flush buffer to ensure output is visible
try out.print("task state: {s}\n", .{label});
try out.flush();
}
$ zig run 03_single_thread_guard.zigmulti-threaded build; spawning worker
task state: threaded_doneWhen you build with -Dsingle-threaded=true, the inline branch is the only one compiled, so keep the logic symmetrical and make sure any shared state is still set via the same atomic helpers to avoid diverging semantics.
Notes & Caveats
- Threads must be joined or detached exactly once; leaking handles leads to resource exhaustion.
Thread.joinconsumes the handle, so store it in a slice you can iterate later. - Atomics operate on raw memory—never mix atomic and non-atomic accesses to the same location, even if you 'know' the race cannot happen. Wrap shared scalars in
std.atomic.Valueto keep your intent obvious. - Compare-and-swap loops may live-spin; consider
Thread.yield()or event primitives likeThread.ResetEventwhen a wait might last longer than a few cycles.
Debugging Concurrent Code with ThreadSanitizer
Zig provides built-in race detection through ThreadSanitizer, a powerful tool for finding data races, deadlocks, and other concurrency bugs:
| Sanitizer | Config Field | Purpose | Requirements |
|---|---|---|---|
| Thread Sanitizer | any_sanitize_thread | Data race detection | LLVM backend |
| UBSan | any_sanitize_c | C undefined behavior | LLVM backend, C code |
| Fuzzing | any_fuzz | Fuzzing instrumentation | libfuzzer integration |
Enable ThreadSanitizer with -Dsanitize-thread when building your program. TSan instruments all memory accesses and synchronization operations, tracking happens-before relationships to detect races. When a race is detected, TSan prints detailed reports showing the conflicting accesses and their stack traces. The instrumentation adds significant runtime overhead (2-5x slowdown, 5-10x memory usage), so use it during development and testing, not in production. TSan is particularly valuable for validating atomic code: even if your logic appears correct, TSan can catch subtle ordering issues or missing synchronization. For the examples in this chapter, try running them with -Dsanitize-thread to verify they’re race-free—the parallel sum and atomic once patterns should pass cleanly, demonstrating proper synchronization.
Exercises
- Extend the parallel sum to accept a predicate callback so you can swap "even numbers" for any classification you like; measure the effect of
.acquirevs.monotonicloads on contention. - Rework the
callOncedemo to stage errors: have the initializer return!voidand store the failure in an atomic slot so callers can rethrow the same error consistently. - Introduce a
std.Thread.WaitGrouparound the once-guard code so you can wait for arbitrary numbers of worker threads without storing handles manually.
Caveats, alternatives, edge cases
- On platforms without pthreads or Win32 threads Zig emits a compile error; plan to fall back to event loops or async when targeting WASI without
--threadingsupport. - Atomics operate on plain integers and enums; for composite state consider using a mutex or designing an array of atomics to avoid torn updates.
- Single-threaded builds can still use atomics, but the instructions compile to ordinary loads/stores. Keep the code paths consistent so you do not rely accidentally on the stronger ordering in multi-threaded builds.
Platform-Specific Threading Constraints
Not all platforms support threading, and some have special requirements for thread-local storage:
Certain targets default to single-threaded mode because they lack OS thread support: WebAssembly (without the --threading flag) and Haiku OS both fall into this category. On these platforms, attempting to spawn threads results in a compile error unless you’ve explicitly enabled threading support in your build configuration. A related concern is thread-local storage (TLS): OpenBSD and older Android versions don’t provide native TLS, so Zig uses emulated TLS—a software implementation that’s slower but portable. When writing cross-platform concurrent code, check target.defaultSingleThreaded() and target.useEmulatedTls() to understand platform constraints. For WASM, you can enable threading with the atomics and bulk-memory features plus the --import-memory --shared-memory linker flags, but not all WASM runtimes support this. Design your code to gracefully degrade: use builtin.single_threaded to provide synchronous fallbacks, and avoid assuming TLS is zero-cost on all platforms.
Summary
std.Threadoffers lightweight spawn/join semantics, but you remain responsible for scheduling and cleanup.- Atomic intrinsics such as
@atomicLoad,@atomicStore, and@cmpxchgStrongmake small lock-free state machines practical when you match the orderings to your invariant. - Using
builtin.single_threadedkeeps shared components working across single-threaded builds and multi-core deployments without forking the codebase.