Fast UTF-8 validation and decoding in C header files, conforming to the Unicode and ISO/IEC 10646 specification.
#include "utf8_dfa32.h" // or utf8_dfa64.h for the 64-bit variant
#include "utf8_valid.h
// Check if a string is valid UTF-8
bool ok = utf8_valid(src, len);
// Check and find the error position
size_t cursor;
if (!utf8_check(src, len, &cursor)) {
// cursor is the length of the maximal valid prefix
}
// Length of the maximal subpart of an ill-formed sequence
size_t n = utf8_maximal_subpart(src, len);bool utf8_valid(const char *src, size_t len);
bool utf8_valid_ascii(const char *src, size_t len);
bool utf8_check(const char *src, size_t len, size_t *cursor);
bool utf8_check_ascii(const char *src, size_t len, size_t *cursor);
size_t utf8_maximal_subpart(const char *src, size_t len);utf8_valid returns true if src[0..len) is valid UTF-8.
utf8_check returns true if valid. On failure, if cursor is
non-NULL, sets *cursor to the length of the maximal valid prefix —
i.e. the byte offset of the first invalid sequence.
utf8_maximal_subpart returns the length of the maximal subpart of
the ill-formed sequence starting at src, as defined by Unicode 17.0
Table 3-8. The return value is always at least 1. This is intended to
be called after utf8_check reports failure, pointing src at the
cursor position.
utf8_valid_ascii and utf8_check_ascii are variants with an
ASCII fast path that skips the DFA for 16-byte chunks containing only
ASCII bytes. Throughput relative to utf8_valid depends on content mix
and microarchitecture, see the Performance section for measured results.
Otherwise identical in behaviour to utf8_valid and utf8_check.
- C99 or later
utf8_dfa32.h implements a shift-based DFA originally described by
Per Vognsen,
with 32-bit row packing derived by
Dougall Johnson
using an SMT solver (see tool/smt_solver.py).
The DFA has 9 states. Each state is assigned a bit offset within a 32-bit integer. Each byte in the input maps to a row in a 256-entry lookup table; the row encodes all 9 state transitions packed into a single uint32_t. The inner loop is a single table lookup and shift:
state = (dfa[byte] >> state) & 31;The error state is fixed at bit offset 0. Since error transitions shift a zero value into the row, they contribute nothing to the row value. The SMT solver finds offsets for the remaining 8 states that fit without collision inside 32 bits.
The fast path scans 16 bytes at a time, skipping the DFA entirely for pure ASCII chunks.
utf8_valid.h implements the validation functions in terms of
utf8_dfa_step() and UTF8_DFA_ACCEPT/UTF8_DFA_REJECT — symbols
defined by whichever DFA header is included before it. Include exactly
one of utf8_dfa.h or utf8_dfa64.h before utf8_valid.h.
utf8_dfa64.h provides the same API using a 64-bit table. The error
state is fixed at bit offset 0 and contributes nothing to row values.
The remaining 8 state offsets are fixed multiples of 6 (6, 12, 18, ...,
48), using 54 bits of the 64-bit row. Bits 56–62 store a per-byte payload
mask used to accumulate codepoint bits without a separate lead/tail dispatch:
cp = (cp << 6) | (byte & (row >> 56))ASCII rows carry 0x7F, 2/3/4-byte lead rows carry 0x1F/0x0F/0x07,
continuation rows carry 0x3F, and error rows carry 0x00.
Documents are retrieved from Wikipedia and converted to plain text, available in benchmark/corpus/.
| File | Size | Code points | Distribution | Best utf8_valid |
Best utf8_valid_ascii |
|---|---|---|---|---|---|
| ar.txt | 25 KB | 14K | 19% ASCII, 81% 2-byte | 4110 MB/s | 5221 MB/s |
| el.txt | 102 KB | 59K | 23% ASCII, 77% 2-byte | 4108 MB/s | 5009 MB/s |
| en.txt | 80 KB | 82K | 99.9% ASCII | 4108 MB/s | 33142 MB/s |
| ja.txt | 176 KB | 65K | 11% ASCII, 89% 3-byte | 4111 MB/s | 5221 MB/s |
| lv.txt | 135 KB | 127K | 92% ASCII, 7% 2-byte | 4108 MB/s | 6164 MB/s |
| ru.txt | 148 KB | 85K | 23% ASCII, 77% 2-byte | 4110 MB/s | 4896 MB/s |
| sv.txt | 94 KB | 93K | 96% ASCII, 4% 2-byte | 4108 MB/s | 8776 MB/s |
Best numbers from Clang 20 -O3 -march=x86-64-v3 (Raptor Lake).
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2379 MB/s | 2501 MB/s | ascii fast path competitive on multibyte |
-O2 -march=x86-64-v3 |
4101 MB/s | 3977 MB/s | BMI2 SHRX eliminates CL constraint |
-O3 -march=x86-64-v3 |
4105 MB/s | 5104 MB/s | fast path profitable on all content types |
-O3 -march=native |
4101 MB/s | 4953 MB/s | similar to x86-64-v3 |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 29–33 GB/s at -O3.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2178 MB/s | 1628 MB/s | ascii fast path hurts on multibyte |
-O2 -march=x86-64-v3 |
2175 MB/s | 1667 MB/s | GCC not generating SHRX at -O2 |
-O3 -march=x86-64-v3 |
3679 MB/s | 3480 MB/s | O3 needed for SHRX |
-O3 -march=native |
4101 MB/s | 3950 MB/s | native closes gap with Clang |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 29–33 GB/s at -O3.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2094 MB/s | 1670 MB/s | ascii fast path hurts on multibyte |
-O2 -march=x86-64-v3 |
3454 MB/s | 2569 MB/s | BMI2 SHRX eliminates CL constraint |
-O3 -march=x86-64-v3 |
3452 MB/s | 3162 MB/s | gap narrows at O3 |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches 14–19 GB/s at all optimization levels.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
2592 MB/s | 2684 MB/s | essentially equal on multibyte |
-O2 -mtune=apple-m1 |
2779 MB/s | 2764 MB/s | essentially equal on multibyte |
-O3 -mtune=apple-m1 |
2787 MB/s | 4970 MB/s | O3 unlocks NEON fast path |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches ~21 GB/s at -O3.
| Flags | utf8_valid |
utf8_valid_ascii |
Notes |
|---|---|---|---|
-O2 |
1483 MB/s | 1096 MB/s | GCC significantly lags Clang |
-O2 -mtune=apple-m1 |
1485 MB/s | 1094 MB/s | mtune no effect |
-O3 -mtune=apple-m1 |
2771 MB/s | 2785 MB/s | O3 closes gap with Clang O2 |
Numbers shown for ar.txt (81% 2-byte). On near-pure ASCII (en.txt) utf8_valid_ascii
reaches ~36 GB/s at -O3, though GCC's non-ASCII throughput lags Clang throughout.
utf8_validis the most consistent performer across content mixes and compilers in these measurements. Across all tested platforms,utf8_validprocesses approximately one byte per clock cycle at peak throughput, consistent with the unrolled DFA loop executing one table lookup and one shift per byte, pipelined by the out-of-order engine across the 16-byte unroll.- On x86,
-march=x86-64-v3or-march=nativeenables BMI2SHRX, which removes the variable-shift dependency onCLand roughly doubles throughput forutf8_valid. GCC requires-O3to take advantage of this; Clang does so at-O2. utf8_valid_asciiprofitability on multibyte content is microarchitecture dependent. On Haswell it is slower thanutf8_validon multibyte-heavy input at all tested flags. On Raptor Lake with Clang-O3 -march=x86-64-v3it is faster on all content types, reflecting the wider OOO window and stronger branch predictor on the newer microarchitecture.- On Apple M1 with Clang,
utf8_valid_asciiis essentially equal toutf8_validon multibyte content at-O2, and significantly faster at-O3where the NEON fast path vectorizes aggressively. No penalty is observed. - On Apple M1 with GCC 15, both implementations lag Clang at
-O2; at-O3,utf8_validreaches near-Clang throughput, whileutf8_valid_asciiremains much stronger on near-pure ASCII input.
The benchmark is in benchmark/bench.c. Compile and run with:
make bench
make bench BENCH_OPTFLAGS="-O3 -march=x86-64-v3"
./bench # benchmarks benchmark/corpus/
./bench -d <directory> # benchmarks all .txt files in directory
./bench -f <file> # benchmarks single file
./bench -s <MB> # resize input to <MB> before benchmarkingBenchmark mode (mutually exclusive):
| Flag | Description |
|---|---|
-t <secs> |
run each implementation for <secs> seconds (default: 20) |
-n <reps> |
run each implementation for <reps> repetitions |
-b <MB> |
run each implementation until <MB> total data processed |
Warmup (-w <n>): before timing, each implementation is run for n
iterations to warm up caches and branch predictors. By default the warmup
count is derived from input size — targeting approximately 256 MB of warmup
data, capped between 1 and 100 iterations. Use -w 0 to disable warmup
entirely.
Output format: For each file, the header line shows the filename,
byte size, code point count, and average bytes per code point
(units/point). The code point distribution breaks down the input by
Unicode range. Results are sorted slowest to fastest; the percentage
shows improvement over the slowest implementation.
sv.txt: 94 KB; 93K code points; 1.04 units/point
U+0000..U+007F 90K 96.4%
U+0080..U+07FF 3K 3.5%
U+0800..U+FFFF 171 0.2%
hoehrmann 422 MB/s
utf8_valid_old 1746 MB/s (+314%)
utf8_valid 2778 MB/s (+559%)
utf8_valid_ascii 10012 MB/s (+2274%)
units/point is a rough content-mix indicator: 1.00 is near-pure ASCII,
~1.7–1.9 is common for 2-byte-heavy text , and ~2.7–3.0 for CJK-heavy text.
The benchmark includes two reference implementations: hoehrmann a widely used
table-driven DFA implementation and utf8_valid_old (previous scalar decoder)
to track regression and quantify gains from the current DFA approach.
BSD 2-Clause License. Copyright (c) 2017–2026 Christian Hansen.
- Flexible and Economical UTF-8 Decoder by Björn Höhrmann
- Branchless UTF-8 Decoder by Chris Wellons
- simdutf Unicode validation and transcoding at SIMD speed