Chapter 39Performance And Inlining

Performance & Inlining

Overview

Our CLI survey set the stage for disciplined experimentation. 38 Now we focus on how Zig translates those command-line toggles into machine-level behavior. Semantic inlining, call modifiers, and explicit SIMD all give you levers to shape hot paths—provided you measure carefully and respect the compiler’s defaults. #inline fn

The next chapter formalizes that measurement loop by layering profiling and hardening workflows on top. 40

Learning Goals

  • Force or forbid inlining when compile-time semantics must win over heuristics.
  • Sample hot loops with @call and std.time.Timer to compare build modes.
  • Use @Vector math as a bridge to portable SIMD before reaching for target-specific intrinsics.

#call, Timer.zig, #vectors

Semantic Inlining vs Optimizer Heuristics

Zig’s inline keyword changes evaluation rules rather than hinting at the optimizer: compile-time known arguments become compile-time constants, allowing you to generate types or precompute values that ordinary calls would defer to runtime.

Inline functions restrict the compiler’s freedom, so reach for them only when semantics matter—propagating comptime data, improving debugging, or satisfying real benchmarks.

Understanding Optimization Modes

Before exploring inlining behavior, it’s important to understand the optimization modes that affect how the compiler treats your code. The following diagram shows the optimization configuration:

graph TB subgraph "Optimization" OPTIMIZE["Optimization Settings"] OPTIMIZE --> OPTMODE["optimize_mode: OptimizeMode<br/>Debug, ReleaseSafe, ReleaseFast, ReleaseSmall"] OPTIMIZE --> LTO["lto: bool<br/>Link Time Optimization"] end

Zig provides four distinct optimization modes, each making different tradeoffs between safety, speed, and binary size. Debug mode disables optimizations and keeps full runtime safety checks, making it ideal for development and debugging. The compiler preserves stack frames, emits symbol information, and never inlines functions unless semantically required. ReleaseSafe enables optimizations while retaining all safety checks (bounds checking, integer overflow detection, etc.), balancing performance with error detection. ReleaseFast maximizes speed by disabling runtime safety checks and enabling aggressive optimizations including heuristic inlining. This is the mode used in the benchmarks throughout this chapter. ReleaseSmall prioritizes binary size over speed, often disabling inlining entirely to reduce code duplication.

Additionally, Link Time Optimization (LTO) can be enabled independently via -flto, allowing the linker to perform whole-program optimization across compilation units. When benchmarking inlining behavior, these modes dramatically affect results: inline functions behave identically across modes (semantic guarantee), but heuristic inlining in ReleaseFast may inline functions that Debug or ReleaseSmall would leave as calls. The chapter’s examples use -OReleaseFast to showcase optimizer behavior, but you should test across modes to understand the full performance spectrum.

Example: compile-time math with inline functions

inline recursion lets us bake small computations into the binary while leaving a fallback runtime path for larger inputs. The @call builtin provides a direct handle to evaluate call sites at compile time when arguments are available.

Zig

// This file demonstrates Zig's inline semantics and compile-time execution features.
// It shows how the `inline` keyword and `@call` builtin can control when and how
// functions are evaluated at compile-time versus runtime.

const std = @import("std");

/// Computes the nth Fibonacci number using recursion.
/// The `inline` keyword forces this function to be inlined at all call sites,
/// and the `comptime n` parameter ensures the value can be computed at compile-time.
/// This combination allows the result to be available as a compile-time constant.
inline fn fib(comptime n: usize) usize {
    return if (n <= 1) n else fib(n - 1) + fib(n - 2);
}

/// Computes the factorial of n using recursion.
/// Unlike `fib`, this function is not marked `inline`, so the compiler
/// decides whether to inline it based on optimization heuristics.
/// It can be called at either compile-time or runtime.
fn factorial(n: usize) usize {
    return if (n <= 1) 1 else n * factorial(n - 1);
}

// Demonstrates that an inline function with comptime parameters
// propagates compile-time execution to its call sites.
// The entire computation happens at compile-time within the comptime block.
test "inline fibonacci propagates comptime" {
    comptime {
        const value = fib(10);
        try std.testing.expectEqual(@as(usize, 55), value);
    }
}

// Demonstrates the `@call` builtin with `.compile_time` modifier.
// This forces the function call to be evaluated at compile-time,
// even though `factorial` is not marked `inline` and takes non-comptime parameters.
test "@call compile_time modifier" {
    const result = @call(.compile_time, factorial, .{5});
    try std.testing.expectEqual(@as(usize, 120), result);
}

// Verifies that a non-inline function can still be called at runtime.
// The input is a runtime value, so the computation happens during execution.
test "runtime factorial still works" {
    const input: usize = 6;
    const value = factorial(input);
    try std.testing.expectEqual(@as(usize, 720), value);
}
Run
Shell
$ zig test 01_inline_semantics.zig
Output
Shell
All 3 tests passed.

The .compile_time modifier fails if the callee touches runtime-only state. Wrap such experiments in comptime blocks first, then add runtime tests so release builds remain covered.

Directing Calls for Measurement

Zig 0.15.2’s self-hosted backends reward accurate microbenchmarks. They can deliver dramatic speedups when paired with the new threaded code generation pipeline. v0.15.2

Use @call modifiers to compare inline, default, and never-inline dispatches without refactoring your call sites.

Example: comparing call modifiers under ReleaseFast

This benchmark pins the optimizer (-OReleaseFast) while we toggle call modifiers. Every variant produces the same result, but the timing highlights how never_inline can balloon hot loops when function call overhead dominates.

Zig
const std = @import("std");
const builtin = @import("builtin");

// Number of iterations to run each benchmark variant
const iterations: usize = 5_000_000;

/// A simple mixing function that demonstrates the performance impact of inlining.
/// Uses bit rotation and arithmetic operations to create a non-trivial workload
/// that the optimizer might handle differently based on call modifiers.
fn mix(value: u32) u32 {
    // Rotate left by 7 bits after XORing with a prime-like constant
    const rotated = std.math.rotl(u32, value ^ 0x9e3779b9, 7);
    // Apply additional mixing with wrapping arithmetic to prevent compile-time evaluation
    return rotated *% 0x85eb_ca6b +% 0xc2b2_ae35;
}

/// Runs the mixing function in a tight loop using the specified call modifier.
/// This allows direct comparison of how different inlining strategies affect performance.
fn run(comptime modifier: std.builtin.CallModifier) u32 {
    var acc: u32 = 0;
    var i: usize = 0;
    while (i < iterations) : (i += 1) {
        // The @call builtin lets us explicitly control inlining behavior at the call site
        acc = @call(modifier, mix, .{acc});
    }
    return acc;
}

pub fn main() !void {
    // Benchmark 1: Let the compiler decide whether to inline (default heuristics)
    var timer = try std.time.Timer.start();
    const auto_result = run(.auto);
    const auto_ns = timer.read();

    // Benchmark 2: Force inlining at every call site
    timer = try std.time.Timer.start();
    const inline_result = run(.always_inline);
    const inline_ns = timer.read();

    // Benchmark 3: Prevent inlining, always emit a function call
    timer = try std.time.Timer.start();
    const never_result = run(.never_inline);
    const never_ns = timer.read();

    // Verify all three strategies produce identical results
    std.debug.assert(auto_result == inline_result);
    std.debug.assert(auto_result == never_result);

    // Display the optimization mode and iteration count for reproducibility
    std.debug.print(
        "optimize-mode={s} iterations={}\n",
        .{
            @tagName(builtin.mode),
            iterations,
        },
    );
    // Report timing results for each call modifier
    std.debug.print("auto call   : {d} ns\n", .{auto_ns});
    std.debug.print("always_inline: {d} ns\n", .{inline_ns});
    std.debug.print("never_inline : {d} ns\n", .{never_ns});
}
Run
Shell
$ zig run 03_call_benchmark.zig -OReleaseFast
Output
Shell
optimize-mode=ReleaseFast iterations=5000000
auto call   : 161394 ns
always_inline: 151745 ns
never_inline : 2116797 ns

Performing the same run under -OReleaseSafe makes the gap larger because additional safety checks amplify the per-call overhead. v0.15.2 Use zig run --time-report from the previous chapter when you want compiler-side attribution for slow code paths. 38

Portable Vectorization with @Vector

When the compiler cannot infer SIMD usage on its own, @Vector types offer a portable shim that respects safety checks and fallback scalar execution. Paired with @reduce, you can express horizontal reductions without writing target-specific intrinsics. #reduce

Example: SIMD-friendly dot product

The scalar and vectorized versions produce identical results. Profiling determines whether the extra vector plumbing pays off on your target.

Zig
const std = @import("std");

// Number of parallel operations per vector
const lanes = 4;
// Vector type that processes 4 f32 values simultaneously using SIMD
const Vec = @Vector(lanes, f32);

/// Loads 4 consecutive f32 values from a slice into a SIMD vector.
/// The caller must ensure that start + 3 is within bounds.
fn loadVec(slice: []const f32, start: usize) Vec {
    return .{
        slice[start + 0],
        slice[start + 1],
        slice[start + 2],
        slice[start + 3],
    };
}

/// Computes the dot product of two f32 slices using scalar operations.
/// This is the baseline implementation that processes one element at a time.
fn dotScalar(values_a: []const f32, values_b: []const f32) f32 {
    std.debug.assert(values_a.len == values_b.len);
    var sum: f32 = 0.0;
    // Multiply corresponding elements and accumulate the sum
    for (values_a, values_b) |a, b| {
        sum += a * b;
    }
    return sum;
}

/// Computes the dot product using SIMD vectorization for improved performance.
/// Processes 4 elements at a time, then reduces the vector accumulator to a scalar.
/// Requires that the input length is a multiple of the lane count (4).
fn dotVectorized(values_a: []const f32, values_b: []const f32) f32 {
    std.debug.assert(values_a.len == values_b.len);
    std.debug.assert(values_a.len % lanes == 0);

    // Initialize accumulator vector with zeros
    var accum: Vec = @splat(0.0);
    var index: usize = 0;
    // Process 4 elements per iteration using SIMD
    while (index < values_a.len) : (index += lanes) {
        const lhs = loadVec(values_a, index);
        const rhs = loadVec(values_b, index);
        // Perform element-wise multiplication and add to accumulator
        accum += lhs * rhs;
    }

    // Sum all lanes of the accumulator vector into a single scalar value
    return @reduce(.Add, accum);
}

// Verifies that the vectorized implementation produces the same result as the scalar version.
test "vectorized dot product matches scalar" {
    const lhs = [_]f32{ 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0 };
    const rhs = [_]f32{ 8.0, 7.0, 6.0, 5.0, 4.0, 3.0, 2.0, 1.0 };
    const scalar = dotScalar(&lhs, &rhs);
    const vector = dotVectorized(&lhs, &rhs);
    // Allow small floating-point error tolerance
    try std.testing.expectApproxEqAbs(scalar, vector, 0.0001);
}
Run
Shell
$ zig test 02_vector_reduction.zig
Output
Shell
All 1 tests passed.

Once you start mixing vectors and scalars, use @splat to lift constants and avoid the implicit casts forbidden by the vector rules.

Notes & Caveats

  • Inline recursion counts against the compile-time branch quota. Raise it with @setEvalBranchQuota only when measurements prove the extra compile-time work is worthwhile. #setevalbranchquota
  • Switching between @call(.always_inline, …​) and the inline keyword matters: the former applies to a single site, whereas inline modifies the callee definition and every future call.
  • Vector lengths other than powers of two may fall back to scalar loops on some targets. Capture the generated assembly with zig build-exe -femit-asm before banking on a win.

Code Generation Features Affecting Performance

Beyond optimization modes, several code generation features affect runtime performance and debuggability. Understanding these flags helps you reason about performance tradeoffs:

graph TB subgraph "Code Generation Features" Features["Feature Flags"] Features --> UnwindTables["unwind_tables: bool"] Features --> StackProtector["stack_protector: bool"] Features --> StackCheck["stack_check: bool"] Features --> RedZone["red_zone: ?bool"] Features --> OmitFramePointer["omit_frame_pointer: bool"] Features --> Valgrind["valgrind: bool"] Features --> SingleThreaded["single_threaded: bool"] UnwindTables --> EHFrame["Generate .eh_frame<br/>for exception handling"] StackProtector --> CanaryCheck["Stack canary checks<br/>buffer overflow detection"] StackCheck --> ProbeStack["Stack probing<br/>prevents overflow"] RedZone --> RedZoneSpace["Red zone optimization<br/>(x86_64, AArch64)"] OmitFramePointer --> NoFP["Omit frame pointer<br/>for performance"] Valgrind --> ValgrindSupport["Valgrind client requests<br/>for memory debugging"] SingleThreaded --> NoThreading["Assume single-threaded<br/>enable optimizations"] end

The omit_frame_pointer flag is particularly relevant for performance work: when enabled (typical in ReleaseFast), the compiler frees the frame pointer register (RBP on x86_64, FP on ARM) for general use, improving register allocation and enabling more aggressive optimizations. However, this makes stack unwinding harder. Debuggers and profilers may produce incomplete or missing stack traces.

The red_zone optimization (x86_64 and AArch64 only) allows functions to use 128 bytes below the stack pointer without adjusting RSP, reducing prologue/epilogue overhead in leaf functions. Stack protection adds canary checks to detect buffer overflows but adds runtime cost. This is why ReleaseFast disables it. Stack checking instruments functions to probe the stack and prevent overflow, useful for deep recursion but costly. Unwind tables generate .eh_frame sections for exception handling and debugger stack walks. Debug mode always includes them; release modes may omit them for size.

When the exercises suggest measuring allocator hot paths with @call(.never_inline, …​), these flags explain why Debug mode shows better stack traces (frame pointers preserved) at the cost of slower execution (extra instructions, no register optimization). Performance-critical code should benchmark with ReleaseFast but validate correctness with Debug to catch issues the optimizer might hide.

Exercises

  • Add a --mode flag to the benchmark program so you can flip between Debug, ReleaseSafe, and ReleaseFast runs without editing the code. 38
  • Extend the dot-product example with a remainder loop that handles slices whose lengths are not multiples of four. Measure the crossover point where SIMD still wins.
  • Experiment with @call(.never_inline, …​) on allocator hot paths from Chapter 10 to confirm whether improved stack traces in Debug are worth the runtime cost. 10

Alternatives & Edge Cases:

  • Microbenchmarks that run inside zig run share the compilation cache. Warm the cache with a dummy run before comparing timings to avoid skew. #entry points and command structure
  • The self-hosted x86 backend is fast but not perfect. Fall back to -fllvm if you notice miscompilations while exploring aggressive inline patterns.
  • ReleaseSmall often disables inlining entirely to save size. When you need both tiny binaries and tuned hot paths, isolate the hot functions and call them from a ReleaseFast-built shared library.

Help make this chapter better.

Found a typo, rough edge, or missing explanation? Open an issue or propose a small improvement on GitHub.