Skip to content

chansen/c-utf8-valid

Repository files navigation

utf8_valid.h

Fast UTF-8 validation and decoding in C header files, conforming to the Unicode and ISO/IEC 10646 specification.

Usage

#include "utf8_dfa32.h"   // or utf8_dfa64.h for the 64-bit variant
#include "utf8_valid.h

// Check if a string is valid UTF-8
bool ok = utf8_valid(src, len);

// Check and find the error position
size_t cursor;
if (!utf8_check(src, len, &cursor)) {
  // cursor is the length of the maximal valid prefix
}

// Length of the maximal subpart of an ill-formed sequence
size_t n = utf8_maximal_subpart(src, len);

API

bool   utf8_valid(const char *src, size_t len);
bool   utf8_valid_ascii(const char *src, size_t len);
bool   utf8_check(const char *src, size_t len, size_t *cursor);
bool   utf8_check_ascii(const char *src, size_t len, size_t *cursor);
size_t utf8_maximal_subpart(const char *src, size_t len);

utf8_valid returns true if src[0..len) is valid UTF-8.

utf8_check returns true if valid. On failure, if cursor is non-NULL, sets *cursor to the length of the maximal valid prefix — i.e. the byte offset of the first invalid sequence.

utf8_maximal_subpart returns the length of the maximal subpart of the ill-formed sequence starting at src, as defined by Unicode 17.0 Table 3-8. The return value is always at least 1. This is intended to be called after utf8_check reports failure, pointing src at the cursor position.

utf8_valid_ascii and utf8_check_ascii are variants with an ASCII fast path that skips the DFA for 16-byte chunks containing only ASCII bytes. Throughput relative to utf8_valid depends on content mix and microarchitecture, see the Performance section for measured results. Otherwise identical in behaviour to utf8_valid and utf8_check.

Requirements

  • C99 or later

How it works

utf8_dfa32.h implements a shift-based DFA originally described by Per Vognsen, with 32-bit row packing derived by Dougall Johnson using an SMT solver (see tool/smt_solver.py).

The DFA has 9 states. Each state is assigned a bit offset within a 32-bit integer. Each byte in the input maps to a row in a 256-entry lookup table; the row encodes all 9 state transitions packed into a single uint32_t. The inner loop is a single table lookup and shift:

state = (dfa[byte] >> state) & 31;

The error state is fixed at bit offset 0. Since error transitions shift a zero value into the row, they contribute nothing to the row value. The SMT solver finds offsets for the remaining 8 states that fit without collision inside 32 bits.

The fast path scans 16 bytes at a time, skipping the DFA entirely for pure ASCII chunks.

utf8_valid.h implements the validation functions in terms of utf8_dfa_step() and UTF8_DFA_ACCEPT/UTF8_DFA_REJECT — symbols defined by whichever DFA header is included before it. Include exactly one of utf8_dfa.h or utf8_dfa64.h before utf8_valid.h.

64-bit variant

utf8_dfa64.h provides the same API using a 64-bit table. The error state is fixed at bit offset 0 and contributes nothing to row values. The remaining 8 state offsets are fixed multiples of 6 (6, 12, 18, ..., 48), using 54 bits of the 64-bit row. Bits 56–62 store a per-byte payload mask used to accumulate codepoint bits without a separate lead/tail dispatch:

    cp = (cp << 6) | (byte & (row >> 56))

ASCII rows carry 0x7F, 2/3/4-byte lead rows carry 0x1F/0x0F/0x07, continuation rows carry 0x3F, and error rows carry 0x00.

Performance

Corpus

Documents are retrieved from Wikipedia and converted to plain text, available in benchmark/corpus/.

File Size Code points Distribution Best utf8_valid Best utf8_valid_ascii
ar.txt 25 KB 14K 19% ASCII, 81% 2-byte 4110 MB/s 5221 MB/s
el.txt 102 KB 59K 23% ASCII, 77% 2-byte 4108 MB/s 5009 MB/s
en.txt 80 KB 82K 99.9% ASCII 4108 MB/s 33142 MB/s
ja.txt 176 KB 65K 11% ASCII, 89% 3-byte 4111 MB/s 5221 MB/s
lv.txt 135 KB 127K 92% ASCII, 7% 2-byte 4108 MB/s 6164 MB/s
ru.txt 148 KB 85K 23% ASCII, 77% 2-byte 4110 MB/s 4896 MB/s
sv.txt 94 KB 93K 96% ASCII, 4% 2-byte 4108 MB/s 8776 MB/s

Best numbers from Clang 20 -O3 -march=x86-64-v3 (Raptor Lake).


Raptor Lake (Clang 20, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2379 MB/s 2501 MB/s ascii fast path competitive on multibyte
-O2 -march=x86-64-v3 4101 MB/s 3977 MB/s BMI2 SHRX eliminates CL constraint
-O3 -march=x86-64-v3 4105 MB/s 5104 MB/s fast path profitable on all content types
-O3 -march=native 4101 MB/s 4953 MB/s similar to x86-64-v3

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 29–33 GB/s at -O3.

Raptor Lake (GCC 14, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2178 MB/s 1628 MB/s ascii fast path hurts on multibyte
-O2 -march=x86-64-v3 2175 MB/s 1667 MB/s GCC not generating SHRX at -O2
-O3 -march=x86-64-v3 3679 MB/s 3480 MB/s O3 needed for SHRX
-O3 -march=native 4101 MB/s 3950 MB/s native closes gap with Clang

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 29–33 GB/s at -O3.

Haswell (Clang 22, x86-64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2094 MB/s 1670 MB/s ascii fast path hurts on multibyte
-O2 -march=x86-64-v3 3454 MB/s 2569 MB/s BMI2 SHRX eliminates CL constraint
-O3 -march=x86-64-v3 3452 MB/s 3162 MB/s gap narrows at O3

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches 14–19 GB/s at all optimization levels.

Apple M1 (Clang 21, AArch64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 2592 MB/s 2684 MB/s essentially equal on multibyte
-O2 -mtune=apple-m1 2779 MB/s 2764 MB/s essentially equal on multibyte
-O3 -mtune=apple-m1 2787 MB/s 4970 MB/s O3 unlocks NEON fast path

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~21 GB/s at -O3.

Apple M1 (GCC 15, AArch64)

Flags utf8_valid utf8_valid_ascii Notes
-O2 1483 MB/s 1096 MB/s GCC significantly lags Clang
-O2 -mtune=apple-m1 1485 MB/s 1094 MB/s mtune no effect
-O3 -mtune=apple-m1 2771 MB/s 2785 MB/s O3 closes gap with Clang O2

Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii reaches ~36 GB/s at -O3, though GCC's non-ASCII throughput lags Clang throughout.

Observations

  • utf8_valid is the most consistent performer across content mixes and compilers in these measurements. Across all tested platforms, utf8_valid processes approximately one byte per clock cycle at peak throughput, consistent with the unrolled DFA loop executing one table lookup and one shift per byte, pipelined by the out-of-order engine across the 16-byte unroll.
  • On x86, -march=x86-64-v3 or -march=native enables BMI2 SHRX, which removes the variable-shift dependency on CL and roughly doubles throughput for utf8_valid. GCC requires -O3 to take advantage of this; Clang does so at -O2.
  • utf8_valid_ascii profitability on multibyte content is microarchitecture dependent. On Haswell it is slower than utf8_valid on multibyte-heavy input at all tested flags. On Raptor Lake with Clang -O3 -march=x86-64-v3 it is faster on all content types, reflecting the wider OOO window and stronger branch predictor on the newer microarchitecture.
  • On Apple M1 with Clang, utf8_valid_ascii is essentially equal to utf8_valid on multibyte content at -O2, and significantly faster at -O3 where the NEON fast path vectorizes aggressively. No penalty is observed.
  • On Apple M1 with GCC 15, both implementations lag Clang at -O2; at -O3, utf8_valid reaches near-Clang throughput, while utf8_valid_ascii remains much stronger on near-pure ASCII input.

Benchmark

The benchmark is in benchmark/bench.c. Compile and run with:

make bench
make bench BENCH_OPTFLAGS="-O3 -march=x86-64-v3"
./bench                        # benchmarks benchmark/corpus/
./bench -d <directory>         # benchmarks all .txt files in directory
./bench -f <file>              # benchmarks single file
./bench -s <MB>                # resize input to <MB> before benchmarking

Benchmark mode (mutually exclusive):

Flag Description
-t <secs> run each implementation for <secs> seconds (default: 20)
-n <reps> run each implementation for <reps> repetitions
-b <MB> run each implementation until <MB> total data processed

Warmup (-w <n>): before timing, each implementation is run for n iterations to warm up caches and branch predictors. By default the warmup count is derived from input size — targeting approximately 256 MB of warmup data, capped between 1 and 100 iterations. Use -w 0 to disable warmup entirely.

Output format: For each file, the header line shows the filename, byte size, code point count, and average bytes per code point (units/point). The code point distribution breaks down the input by Unicode range. Results are sorted slowest to fastest; the percentage shows improvement over the slowest implementation.

sv.txt: 94 KB; 93K code points; 1.04 units/point
  U+0000..U+007F          90K  96.4%
  U+0080..U+07FF           3K   3.5%
  U+0800..U+FFFF          171   0.2%
  hoehrmann                    422 MB/s
  utf8_valid_old              1746 MB/s  (+314%)
  utf8_valid                  2778 MB/s  (+559%)
  utf8_valid_ascii           10012 MB/s  (+2274%)

units/point is a rough content-mix indicator: 1.00 is near-pure ASCII, ~1.7–1.9 is common for 2-byte-heavy text , and ~2.7–3.0 for CJK-heavy text.

The benchmark includes two reference implementations: hoehrmann a widely used table-driven DFA implementation and utf8_valid_old (previous scalar decoder) to track regression and quantify gains from the current DFA approach.

License

BSD 2-Clause License. Copyright (c) 2017–2026 Christian Hansen.

See Also

About

Fast UTF-8 validation and decoding in C header files

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors