Overview
Our CLI survey set the stage for disciplined experimentation. 38 Now we focus on how Zig translates those command-line toggles into machine-level behavior. Semantic inlining, call modifiers, and explicit SIMD all give you levers to shape hot paths—provided you measure carefully and respect the compiler’s defaults. #inline fn
The next chapter formalizes that measurement loop by layering profiling and hardening workflows on top. 40
Learning Goals
- Force or forbid inlining when compile-time semantics must win over heuristics.
- Sample hot loops with
@callandstd.time.Timerto compare build modes. - Use
@Vectormath as a bridge to portable SIMD before reaching for target-specific intrinsics.
Semantic Inlining vs Optimizer Heuristics
Zig’s inline keyword changes evaluation rules rather than hinting at the optimizer: compile-time known arguments become compile-time constants, allowing you to generate types or precompute values that ordinary calls would defer to runtime.
Inline functions restrict the compiler’s freedom, so reach for them only when semantics matter—propagating comptime data, improving debugging, or satisfying real benchmarks.
Understanding Optimization Modes
Before exploring inlining behavior, it’s important to understand the optimization modes that affect how the compiler treats your code. The following diagram shows the optimization configuration:
Zig provides four distinct optimization modes, each making different tradeoffs between safety, speed, and binary size. Debug mode disables optimizations and keeps full runtime safety checks, making it ideal for development and debugging. The compiler preserves stack frames, emits symbol information, and never inlines functions unless semantically required. ReleaseSafe enables optimizations while retaining all safety checks (bounds checking, integer overflow detection, etc.), balancing performance with error detection. ReleaseFast maximizes speed by disabling runtime safety checks and enabling aggressive optimizations including heuristic inlining. This is the mode used in the benchmarks throughout this chapter. ReleaseSmall prioritizes binary size over speed, often disabling inlining entirely to reduce code duplication.
Additionally, Link Time Optimization (LTO) can be enabled independently via -flto, allowing the linker to perform whole-program optimization across compilation units. When benchmarking inlining behavior, these modes dramatically affect results: inline functions behave identically across modes (semantic guarantee), but heuristic inlining in ReleaseFast may inline functions that Debug or ReleaseSmall would leave as calls. The chapter’s examples use -OReleaseFast to showcase optimizer behavior, but you should test across modes to understand the full performance spectrum.
Example: compile-time math with inline functions
inline recursion lets us bake small computations into the binary while leaving a fallback runtime path for larger inputs. The @call builtin provides a direct handle to evaluate call sites at compile time when arguments are available.
// This file demonstrates Zig's inline semantics and compile-time execution features.
// It shows how the `inline` keyword and `@call` builtin can control when and how
// functions are evaluated at compile-time versus runtime.
const std = @import("std");
/// Computes the nth Fibonacci number using recursion.
/// The `inline` keyword forces this function to be inlined at all call sites,
/// and the `comptime n` parameter ensures the value can be computed at compile-time.
/// This combination allows the result to be available as a compile-time constant.
inline fn fib(comptime n: usize) usize {
return if (n <= 1) n else fib(n - 1) + fib(n - 2);
}
/// Computes the factorial of n using recursion.
/// Unlike `fib`, this function is not marked `inline`, so the compiler
/// decides whether to inline it based on optimization heuristics.
/// It can be called at either compile-time or runtime.
fn factorial(n: usize) usize {
return if (n <= 1) 1 else n * factorial(n - 1);
}
// Demonstrates that an inline function with comptime parameters
// propagates compile-time execution to its call sites.
// The entire computation happens at compile-time within the comptime block.
test "inline fibonacci propagates comptime" {
comptime {
const value = fib(10);
try std.testing.expectEqual(@as(usize, 55), value);
}
}
// Demonstrates the `@call` builtin with `.compile_time` modifier.
// This forces the function call to be evaluated at compile-time,
// even though `factorial` is not marked `inline` and takes non-comptime parameters.
test "@call compile_time modifier" {
const result = @call(.compile_time, factorial, .{5});
try std.testing.expectEqual(@as(usize, 120), result);
}
// Verifies that a non-inline function can still be called at runtime.
// The input is a runtime value, so the computation happens during execution.
test "runtime factorial still works" {
const input: usize = 6;
const value = factorial(input);
try std.testing.expectEqual(@as(usize, 720), value);
}
$ zig test 01_inline_semantics.zigAll 3 tests passed.The .compile_time modifier fails if the callee touches runtime-only state. Wrap such experiments in comptime blocks first, then add runtime tests so release builds remain covered.
Directing Calls for Measurement
Zig 0.15.2’s self-hosted backends reward accurate microbenchmarks. They can deliver dramatic speedups when paired with the new threaded code generation pipeline. v0.15.2
Use @call modifiers to compare inline, default, and never-inline dispatches without refactoring your call sites.
Example: comparing call modifiers under ReleaseFast
This benchmark pins the optimizer (-OReleaseFast) while we toggle call modifiers. Every variant produces the same result, but the timing highlights how never_inline can balloon hot loops when function call overhead dominates.
const std = @import("std");
const builtin = @import("builtin");
// Number of iterations to run each benchmark variant
const iterations: usize = 5_000_000;
/// A simple mixing function that demonstrates the performance impact of inlining.
/// Uses bit rotation and arithmetic operations to create a non-trivial workload
/// that the optimizer might handle differently based on call modifiers.
fn mix(value: u32) u32 {
// Rotate left by 7 bits after XORing with a prime-like constant
const rotated = std.math.rotl(u32, value ^ 0x9e3779b9, 7);
// Apply additional mixing with wrapping arithmetic to prevent compile-time evaluation
return rotated *% 0x85eb_ca6b +% 0xc2b2_ae35;
}
/// Runs the mixing function in a tight loop using the specified call modifier.
/// This allows direct comparison of how different inlining strategies affect performance.
fn run(comptime modifier: std.builtin.CallModifier) u32 {
var acc: u32 = 0;
var i: usize = 0;
while (i < iterations) : (i += 1) {
// The @call builtin lets us explicitly control inlining behavior at the call site
acc = @call(modifier, mix, .{acc});
}
return acc;
}
pub fn main() !void {
// Benchmark 1: Let the compiler decide whether to inline (default heuristics)
var timer = try std.time.Timer.start();
const auto_result = run(.auto);
const auto_ns = timer.read();
// Benchmark 2: Force inlining at every call site
timer = try std.time.Timer.start();
const inline_result = run(.always_inline);
const inline_ns = timer.read();
// Benchmark 3: Prevent inlining, always emit a function call
timer = try std.time.Timer.start();
const never_result = run(.never_inline);
const never_ns = timer.read();
// Verify all three strategies produce identical results
std.debug.assert(auto_result == inline_result);
std.debug.assert(auto_result == never_result);
// Display the optimization mode and iteration count for reproducibility
std.debug.print(
"optimize-mode={s} iterations={}\n",
.{
@tagName(builtin.mode),
iterations,
},
);
// Report timing results for each call modifier
std.debug.print("auto call : {d} ns\n", .{auto_ns});
std.debug.print("always_inline: {d} ns\n", .{inline_ns});
std.debug.print("never_inline : {d} ns\n", .{never_ns});
}
$ zig run 03_call_benchmark.zig -OReleaseFastoptimize-mode=ReleaseFast iterations=5000000
auto call : 161394 ns
always_inline: 151745 ns
never_inline : 2116797 nsPortable Vectorization with @Vector
When the compiler cannot infer SIMD usage on its own, @Vector types offer a portable shim that respects safety checks and fallback scalar execution. Paired with @reduce, you can express horizontal reductions without writing target-specific intrinsics. #reduce
Example: SIMD-friendly dot product
The scalar and vectorized versions produce identical results. Profiling determines whether the extra vector plumbing pays off on your target.
const std = @import("std");
// Number of parallel operations per vector
const lanes = 4;
// Vector type that processes 4 f32 values simultaneously using SIMD
const Vec = @Vector(lanes, f32);
/// Loads 4 consecutive f32 values from a slice into a SIMD vector.
/// The caller must ensure that start + 3 is within bounds.
fn loadVec(slice: []const f32, start: usize) Vec {
return .{
slice[start + 0],
slice[start + 1],
slice[start + 2],
slice[start + 3],
};
}
/// Computes the dot product of two f32 slices using scalar operations.
/// This is the baseline implementation that processes one element at a time.
fn dotScalar(values_a: []const f32, values_b: []const f32) f32 {
std.debug.assert(values_a.len == values_b.len);
var sum: f32 = 0.0;
// Multiply corresponding elements and accumulate the sum
for (values_a, values_b) |a, b| {
sum += a * b;
}
return sum;
}
/// Computes the dot product using SIMD vectorization for improved performance.
/// Processes 4 elements at a time, then reduces the vector accumulator to a scalar.
/// Requires that the input length is a multiple of the lane count (4).
fn dotVectorized(values_a: []const f32, values_b: []const f32) f32 {
std.debug.assert(values_a.len == values_b.len);
std.debug.assert(values_a.len % lanes == 0);
// Initialize accumulator vector with zeros
var accum: Vec = @splat(0.0);
var index: usize = 0;
// Process 4 elements per iteration using SIMD
while (index < values_a.len) : (index += lanes) {
const lhs = loadVec(values_a, index);
const rhs = loadVec(values_b, index);
// Perform element-wise multiplication and add to accumulator
accum += lhs * rhs;
}
// Sum all lanes of the accumulator vector into a single scalar value
return @reduce(.Add, accum);
}
// Verifies that the vectorized implementation produces the same result as the scalar version.
test "vectorized dot product matches scalar" {
const lhs = [_]f32{ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 };
const rhs = [_]f32{ 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0 };
const scalar = dotScalar(&lhs, &rhs);
const vector = dotVectorized(&lhs, &rhs);
// Allow small floating-point error tolerance
try std.testing.expectApproxEqAbs(scalar, vector, 0.0001);
}
$ zig test 02_vector_reduction.zigAll 1 tests passed.Once you start mixing vectors and scalars, use @splat to lift constants and avoid the implicit casts forbidden by the vector rules.
Notes & Caveats
- Inline recursion counts against the compile-time branch quota. Raise it with
@setEvalBranchQuotaonly when measurements prove the extra compile-time work is worthwhile. #setevalbranchquota - Switching between
@call(.always_inline, …)and theinlinekeyword matters: the former applies to a single site, whereasinlinemodifies the callee definition and every future call. - Vector lengths other than powers of two may fall back to scalar loops on some targets. Capture the generated assembly with
zig build-exe -femit-asmbefore banking on a win.
Code Generation Features Affecting Performance
Beyond optimization modes, several code generation features affect runtime performance and debuggability. Understanding these flags helps you reason about performance tradeoffs:
The omit_frame_pointer flag is particularly relevant for performance work: when enabled (typical in ReleaseFast), the compiler frees the frame pointer register (RBP on x86_64, FP on ARM) for general use, improving register allocation and enabling more aggressive optimizations. However, this makes stack unwinding harder. Debuggers and profilers may produce incomplete or missing stack traces.
The red_zone optimization (x86_64 and AArch64 only) allows functions to use 128 bytes below the stack pointer without adjusting RSP, reducing prologue/epilogue overhead in leaf functions. Stack protection adds canary checks to detect buffer overflows but adds runtime cost. This is why ReleaseFast disables it. Stack checking instruments functions to probe the stack and prevent overflow, useful for deep recursion but costly. Unwind tables generate .eh_frame sections for exception handling and debugger stack walks. Debug mode always includes them; release modes may omit them for size.
When the exercises suggest measuring allocator hot paths with @call(.never_inline, …), these flags explain why Debug mode shows better stack traces (frame pointers preserved) at the cost of slower execution (extra instructions, no register optimization). Performance-critical code should benchmark with ReleaseFast but validate correctness with Debug to catch issues the optimizer might hide.
Exercises
- Add a
--modeflag to the benchmark program so you can flip between Debug, ReleaseSafe, and ReleaseFast runs without editing the code. 38 - Extend the dot-product example with a remainder loop that handles slices whose lengths are not multiples of four. Measure the crossover point where SIMD still wins.
- Experiment with
@call(.never_inline, …)on allocator hot paths from Chapter 10 to confirm whether improved stack traces in Debug are worth the runtime cost. 10
Alternatives & Edge Cases:
- Microbenchmarks that run inside
zig runshare the compilation cache. Warm the cache with a dummy run before comparing timings to avoid skew. #entry points and command structure - The self-hosted x86 backend is fast but not perfect. Fall back to
-fllvmif you notice miscompilations while exploring aggressive inline patterns. - ReleaseSmall often disables inlining entirely to save size. When you need both tiny binaries and tuned hot paths, isolate the hot functions and call them from a ReleaseFast-built shared library.