Chapter 45Text Formatting And Unicode

Text, Formatting, and Unicode

Overview

After mastering collections for structured data, 44 you now turn to text—the fundamental medium of human-computer interaction. This chapter explores std.fmt for formatting and parsing, std.ascii for ASCII character operations, std.unicode for UTF-8/UTF-16 handling, and encoding utilities like base64. fmt.zigascii.zig

Unlike high-level languages that hide encoding complexities, Zig exposes the mechanics: you choose between []const u8 (byte slices) and proper Unicode code point iteration, control number formatting precision, and handle encoding errors explicitly.

Text processing in Zig demands awareness of byte vs. character boundaries, allocator usage for dynamic formatting, and the performance implications of different string operations. By chapter’s end, you’ll format numbers with custom precision, parse integers and floats safely, manipulate ASCII efficiently, navigate UTF-8 sequences, and encode binary data for transport—all with Zig’s characteristic explicitness and zero hidden costs. unicode.zig

Learning Goals

  • Format values with Writer.print() using format specifiers for integers, floats, and custom types. Writer.zig
  • Parse strings into integers (parseInt) and floats (parseFloat) with proper error handling.
  • Use std.ascii for character classification (isDigit, isAlpha, toUpper, toLower).
  • Navigate UTF-8 sequences with std.unicode and understand code point vs. byte distinctions.
  • Encode and decode Base64 data for binary-to-text transformations. base64.zig
  • Implement custom formatters for user-defined types using the {f} specifier in Zig 0.15.2.

Formatting with std.fmt

Zig’s formatting revolves around Writer.print(fmt, args), which writes formatted output to any Writer implementation. Format strings use {} placeholders with optional specifiers: {d} for decimal, {x} for hex, {s} for strings, {any} for debug representation, and {f} for custom formatters.

The simplest pattern: capture a buffer with std.io.fixedBufferStream, then print into it.

Zig
const std = @import("std");

pub fn main() !void {
    var buffer: [100]u8 = undefined;
    var fbs = std.io.fixedBufferStream(&buffer);
    const writer = fbs.writer();

    try writer.print("Answer={d}, pi={d:.2}", .{ 42, 3.14159 });

    std.debug.print("Formatted: {s}\n", .{fbs.getWritten()});
}
Build and Run
Shell
$ zig build-exe format_basic.zig && ./format_basic
Output
Shell
Formatted: Answer=42, pi=3.14

std.io.fixedBufferStream provides a Writer backed by a fixed buffer. No allocation needed. For dynamic output, use std.ArrayList(u8).writer(). fixed_buffer_stream.zig

Format Specifiers

Zig’s format specifiers control number bases, precision, alignment, and padding.

Zig
const std = @import("std");

pub fn main() !void {
    const value: i32 = 255;
    const pi = 3.14159;
    const large = 123.0;

    std.debug.print("Decimal: {d}\n", .{value});
    std.debug.print("Hexadecimal (lowercase): {x}\n", .{value});
    std.debug.print("Hexadecimal (uppercase): {X}\n", .{value});
    std.debug.print("Binary: {b}\n", .{value});
    std.debug.print("Octal: {o}\n", .{value});
    std.debug.print("Float with 2 decimals: {d:.2}\n", .{pi});
    std.debug.print("Scientific notation: {e}\n", .{large});
    std.debug.print("Padded: {d:0>5}\n", .{42});
    std.debug.print("Right-aligned: {d:>5}\n", .{42});
}
Build and Run
Shell
$ zig build-exe format_specifiers.zig && ./format_specifiers
Output
Shell
Decimal: 255
Hexadecimal (lowercase): ff
Hexadecimal (uppercase): FF
Binary: 11111111
Octal: 377
Float with 2 decimals: 3.14
Scientific notation: 1.23e2
Padded: 00042
Right-aligned:    42

Use {d} for decimal, {x} for hex, {b} for binary, {o} for octal. Precision (.N) and width work with floats and integers. Padding with 0 creates zero-filled fields.

Parsing Strings

Zig provides parseInt and parseFloat for converting text to numbers, returning errors for invalid input rather than crashing or silently failing.

Parsing Integers

parseInt(T, buf, base) converts a string to an integer of type T in the specified base (2-36, or 0 for auto-detection).

Zig
const std = @import("std");

pub fn main() !void {
    const decimal = try std.fmt.parseInt(i32, "42", 10);
    std.debug.print("Parsed decimal: {d}\n", .{decimal});

    const hex = try std.fmt.parseInt(i32, "FF", 16);
    std.debug.print("Parsed hex: {d}\n", .{hex});

    const binary = try std.fmt.parseInt(i32, "111", 2);
    std.debug.print("Parsed binary: {d}\n", .{binary});

    // Auto-detect base with prefix
    const auto = try std.fmt.parseInt(i32, "0x1234", 0);
    std.debug.print("Auto-detected (0x): {d}\n", .{auto});

    // Error handling
    const result = std.fmt.parseInt(i32, "not_a_number", 10);
    if (result) |_| {
        std.debug.print("Unexpected success\n", .{});
    } else |err| {
        std.debug.print("Parse error: {}\n", .{err});
    }
}
Build and Run
Shell
$ zig build-exe parse_int.zig && ./parse_int
Output
Shell
Parsed decimal: 42
Parsed hex: 255
Parsed binary: 7
Auto-detected (0x): 4660
Parse error: InvalidCharacter

parseInt returns error{Overflow, InvalidCharacter}. Always handle these explicitly or propagate with try. Base 0 auto-detects 0x (hex), 0o (octal), 0b (binary) prefixes.

Parsing Floats

parseFloat(T, buf) converts a string to a floating-point number, handling scientific notation and special values (nan, inf).

Zig
const std = @import("std");

pub fn main() !void {
    const pi = try std.fmt.parseFloat(f64, "3.14159");
    std.debug.print("Parsed: {d}\n", .{pi});

    const scientific = try std.fmt.parseFloat(f64, "1.23e5");
    std.debug.print("Scientific: {d}\n", .{scientific});

    const infinity = try std.fmt.parseFloat(f64, "inf");
    std.debug.print("Special (inf): {d}\n", .{infinity});
}
Build and Run
Shell
$ zig build-exe parse_float.zig && ./parse_float
Output
Shell
Parsed: 3.14159
Scientific: 123000
Special (inf): inf

parseFloat supports decimal notation (3.14), scientific notation (1.23e5), hexadecimal floats (0x1.8p3), and special values (nan, inf, -inf). parse_float.zig

ASCII Character Operations

std.ascii provides fast character classification and case conversion for 7-bit ASCII. Functions gracefully handle values outside the ASCII range by returning false or leaving them unchanged.

Character Classification

Test whether characters are digits, letters, whitespace, etc.

Zig
const std = @import("std");

pub fn main() void {
    const chars = [_]u8{ 'A', '5', ' ' };

    for (chars) |c| {
        std.debug.print("'{c}': alpha={}, digit={}, ", .{ c, std.ascii.isAlphabetic(c), std.ascii.isDigit(c) });

        if (c == 'A') {
            std.debug.print("upper={}\n", .{std.ascii.isUpper(c)});
        } else if (c == '5') {
            std.debug.print("upper={}\n", .{std.ascii.isUpper(c)});
        } else {
            std.debug.print("whitespace={}\n", .{std.ascii.isWhitespace(c)});
        }
    }
}
Build and Run
Shell
$ zig build-exe ascii_classify.zig && ./ascii_classify
Output
Shell
'A': alpha=true, digit=false, upper=true
'5': alpha=false, digit=true, upper=false
' ': alpha=false, digit=false, whitespace=true

ASCII functions operate on bytes (u8). Non-ASCII bytes (>127) return false for classification checks.

Case Conversion

Convert between uppercase and lowercase for ASCII characters.

Zig
const std = @import("std");

pub fn main() void {
    const text = "Hello, World!";
    var upper_buf: [50]u8 = undefined;
    var lower_buf: [50]u8 = undefined;

    _ = std.ascii.upperString(&upper_buf, text);
    _ = std.ascii.lowerString(&lower_buf, text);

    std.debug.print("Original: {s}\n", .{text});
    std.debug.print("Uppercase: {s}\n", .{upper_buf[0..text.len]});
    std.debug.print("Lowercase: {s}\n", .{lower_buf[0..text.len]});
}
Build and Run
Shell
$ zig build-exe ascii_case.zig && ./ascii_case
Output
Shell
Original: Hello, World!
Uppercase: HELLO, WORLD!
Lowercase: hello, world!

std.ascii functions operate byte-by-byte and only affect ASCII characters. For full Unicode case mapping, use dedicated Unicode libraries or manually handle UTF-8 sequences.

Unicode and UTF-8

Zig strings are []const u8 byte slices, typically UTF-8 encoded. std.unicode provides utilities for validating UTF-8, decoding code points, and converting between UTF-8 and UTF-16.

UTF-8 Validation

Check whether a byte sequence is valid UTF-8.

Zig
const std = @import("std");

pub fn main() void {
    const valid = "Hello, 世界";
    const invalid = "\xff\xfe";

    if (std.unicode.utf8ValidateSlice(valid)) {
        std.debug.print("Valid UTF-8: {s}\n", .{valid});
    }

    if (!std.unicode.utf8ValidateSlice(invalid)) {
        std.debug.print("Invalid UTF-8 detected\n", .{});
    }
}
Build and Run
Shell
$ zig build-exe utf8_validate.zig && ./utf8_validate
Output
Shell
Valid UTF-8: Hello, 世界
Invalid UTF-8 detected

Use std.unicode.utf8ValidateSlice to verify entire strings. Invalid UTF-8 can cause undefined behavior in code that assumes well-formed sequences.

Iterating Code Points

Decode UTF-8 byte sequences into Unicode code points using std.unicode.Utf8View.

Zig
const std = @import("std");

pub fn main() !void {
    const text = "Hello, 世界";

    var view = try std.unicode.Utf8View.init(text);
    var iter = view.iterator();

    var byte_count: usize = 0;
    var codepoint_count: usize = 0;

    while (iter.nextCodepoint()) |codepoint| {
        const len: usize = std.unicode.utf8CodepointSequenceLength(codepoint) catch unreachable;
        const c = iter.bytes[iter.i - len .. iter.i];
        std.debug.print("Code point: U+{X:0>4} ({s})\n", .{ codepoint, c });
        byte_count += c.len;
        codepoint_count += 1;
    }

    std.debug.print("Byte count: {d}, Code point count: {d}\n", .{ text.len, codepoint_count });
}
Build and Run
Shell
$ zig build-exe utf8_iterate.zig && ./utf8_iterate
Output
Shell
Code point: U+0048 (H)
Code point: U+0065 (e)
Code point: U+006C (l)
Code point: U+006C (l)
Code point: U+006F (o)
Code point: U+002C (,)
Code point: U+0020 ( )
Code point: U+4E16 (世)
Code point: U+754C (界)
Byte count: 13, Code point count: 9

UTF-8 is variable-width: ASCII characters are 1 byte, but many Unicode characters require 2-4 bytes. Always iterate code points when character semantics matter, not bytes.

Base64 Encoding

Base64 encodes binary data as printable ASCII, useful for embedding binary in text formats (JSON, XML, URLs). Zig provides standard, URL-safe, and custom Base64 variants.

Encoding and Decoding

Encode binary data to Base64 and decode it back.

Zig
const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const original = "Hello, World!";

    // Encode
    const encoded_len = std.base64.standard.Encoder.calcSize(original.len);
    const encoded = try allocator.alloc(u8, encoded_len);
    defer allocator.free(encoded);
    _ = std.base64.standard.Encoder.encode(encoded, original);

    std.debug.print("Original: {s}\n", .{original});
    std.debug.print("Encoded: {s}\n", .{encoded});

    // Decode
    var decoded_buf: [100]u8 = undefined;
    const decoded_len = try std.base64.standard.Decoder.calcSizeForSlice(encoded);
    try std.base64.standard.Decoder.decode(&decoded_buf, encoded);

    std.debug.print("Decoded: {s}\n", .{decoded_buf[0..decoded_len]});
}
Build and Run
Shell
$ zig build-exe base64_basic.zig && ./base64_basic
Output
Shell
Original: Hello, World!
Encoded: SGVsbG8sIFdvcmxkIQ==
Decoded: Hello, World!

std.base64.standard.Encoder and .Decoder provide encode/decode methods. The == padding is optional and can be controlled with encoder options.

Custom Formatters

Implement the format function for your types to control how they’re printed with Writer.print().

Zig
const std = @import("std");

const Point = struct {
    x: i32,
    y: i32,

    pub fn format(self: @This(), writer: *std.Io.Writer) std.Io.Writer.Error!void {
        try writer.print("({d}, {d})", .{ self.x, self.y });
    }
};

pub fn main() !void {
    const p = Point{ .x = 10, .y = 20 };
    std.debug.print("Point: {f}\n", .{p});
}
Build and Run
Shell
$ zig build-exe custom_formatter.zig && ./custom_formatter
Output
Shell
Point: (10, 20)

In Zig 0.15.2, the format method signature is simplified to: pub fn format(self: @This(), writer: *std.Io.Writer) std.Io.Writer.Error!void. Use the {f} format specifier to invoke custom formatters (e.g., "{f}", not "{}").

Formatting to Buffers

For stack-allocated formatting without allocation, use std.fmt.bufPrint.

Zig
const std = @import("std");

pub fn main() !void {
    var buffer: [100]u8 = undefined;
    const result = try std.fmt.bufPrint(&buffer, "x={d}, y={d:.2}", .{ 42, 3.14159 });
    std.debug.print("Formatted: {s}\n", .{result});
}
Build and Run
Shell
$ zig build-exe bufprint.zig && ./bufprint
Output
Shell
Formatted: x=42, y=3.14

bufPrint returns error.NoSpaceLeft if the buffer is too small. Always size buffers appropriately or handle the error.

Dynamic Formatting with Allocation

For dynamically sized output, use std.fmt.allocPrint which allocates and returns a formatted string.

Zig
const std = @import("std");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}){};
    defer _ = gpa.deinit();
    const allocator = gpa.allocator();

    const result = try std.fmt.allocPrint(allocator, "The answer is {d}", .{42});
    defer allocator.free(result);

    std.debug.print("Dynamic: {s}\n", .{result});
}
Build and Run
Shell
$ zig build-exe allocprint.zig && ./allocprint
Output
Shell
Dynamic: The answer is 42

allocPrint returns a slice you must free with allocator.free(result). Use this when output size is unpredictable.

Exercises

  • Write a CSV parser using std.mem.split and parseInt to read rows of numbers from a comma-separated file. mem.zig
  • Implement a hex dump utility that formats binary data as hexadecimal with ASCII representation (similar to hexdump -C).
  • Create a string validation function that checks if a string contains only ASCII printable characters, rejecting control codes and non-ASCII bytes.
  • Build a simple URL encoder/decoder using Base64 for the encoding portion and custom logic for percent-encoding special characters.

Caveats, alternatives, edge cases

  • UTF-8 vs. bytes: Zig strings are []const u8. Always clarify whether you’re working with bytes (indexing) or code points (semantic characters). Mismatched assumptions cause bugs with multi-byte characters.
  • Locale-sensitive operations: std.ascii and std.unicode don’t handle locale-specific case mapping or collation. For Turkish i vs. I or locale-aware sorting, you need external libraries.
  • Float formatting precision: parseFloat round-trips through text may lose precision for very large or very small numbers. For exact decimal representation, use fixed-point arithmetic or dedicated decimal libraries.
  • Base64 variants: Standard Base64 uses +/, URL-safe uses -_. Choose the correct encoder/decoder for your use case (std.base64.standard vs. std.base64.url_safe_no_pad).
  • Format string safety: Format strings are comptime-checked, but runtime-constructed format strings won’t benefit from compile-time validation. Avoid building format strings dynamically when possible.
  • Writer interface: All formatting functions accept anytype Writers, allowing output to files, sockets, ArrayLists, or custom destinations. Ensure your Writer implements write(self, bytes: []const u8) !usize.

Help make this chapter better.

Found a typo, rough edge, or missing explanation? Open an issue or propose a small improvement on GitHub.