Add micro-benchmark suite (folly Benchmark) by gmarzot · Pull Request #115 · openmoq/moqx

gmarzot · 2026-04-05T16:41:22Z

Summary

Add micro-benchmark suite using folly Benchmark, modeled after Quicr/libquicr.

9 benchmark files, 63 benchmarks covering:
- MOQT framer: varint encode/decode (cold + warm), frame writes and parses (Subscribe, Subgroup, StreamObject, PublishNamespace, Fetch, Goaway), Subscribe roundtrip, TrackNamespace/FullTrackName operations
- MOQT extensions: serialize, deserialize, and roundtrip at 1/10/100/1000 extensions (direct comparison to libquicr)
- Stats: collector callbacks, snapshot, histogram, Prometheus text formatting at scale
- Service matcher: exact/wildcard/fallback/no-match routing
- Config: YAML load, resolve, schema generation
folly Benchmark via the existing standalone folly fetch — no new top-level dep. Links Folly::follybenchmark (already built as part of the moxygen standalone build).
--benchmark flag for scripts/build.sh
CI benchmark jobs on ubuntu-22.04 and macos-latest
macOS gflags shared linking fix (GFLAGS_SHARED=ON on Darwin)
macOS cmake version check fix (grep -P → sed)

Build & run

./scripts/build.sh --benchmark
./build/benchmark/moqx_benchmark

Test plan

Full suite passes locally on WSL2 (linux x86)
Full suite passes on argo (macOS ARM64)
CI benchmark passes on ubuntu-22.04
CI benchmark passes on macos-latest
Same-hardware comparison with libquicr on argo (see benchmark report comment)

Future benchmarks

Relay forwarding throughput (equivalent to libquicr's PQ_ConnDataForwarding, requires mock sessions)
Subgroup object header parse (hot path for object delivery)
Multi-session stats aggregation (StatsRegistry::aggregateAsync)

This change is

8 benchmark files covering MOQT framer (varint, frame writes, TrackNamespace/FullTrackName ops), stats collector, histogram, Prometheus formatting, service matcher, and config loader/resolver. 41 benchmarks total, modeled after Quicr/libquicr's benchmark suite. Build with: ./scripts/build.sh --benchmark Run with: ./build/benchmark/moqx_benchmark Also adds a benchmark CI job on ubuntu-22.04 for apples-to-apples comparison with libquicr's GitHub-hosted runner results.

afrind

@afrind made 2 comments.
Reviewable status: 0 of 12 files reviewed, 2 unresolved discussions (waiting on gmarzot).

benchmark/CMakeLists.txt line 10 at r1 (raw file):

  FetchContent_Declare(
    benchmark
    GIT_REPOSITORY https://github.com/google/benchmark.git

We can use folly/Benchmark and not add another dep?

benchmark/moqt_framer.cpp line 9 at r1 (raw file):

namespace {

using namespace moxygen;

This likely belongs in moxygen proper?

afrind · 2026-04-05T20:09:54Z

Thanks for the data.

The varint encoder might be measuring the test rather than real-world usage -- we typically have an allocated buffer in advance, so this is free in practice.

moxygen's varint parser is designed to work across discontiguous buffers to avoid memcpy -- stream data can arrive e.g. 1 byte at a time on the stream. But maybe we're optimized for the pessimistic rather than optimistic case? Or maybe memcpy is actually faster than setting up cursor infrastructure.

The namespace design is a nice libquicr optimization that we should apply. We've just never looked -- these are only in control plane messages.

- ExtensionsDeserialize/N: isolated parse (compare to libquicr) - ParseSubscribeRequest, ParseFetch, ParsePublishNamespace, ParseGoaway - SubscribeRoundTrip: full write+parse cycle 62 benchmarks total. Tested on WSL2 (linux) and argo (macOS ARM64).

Split varint encode into Cold (new IOBufQueue per call) and Warm (reused IOBufQueue). Warm reflects production where the buffer is pre-allocated. Shows 2-2.7x improvement over cold on both platforms. Addresses Alan's feedback that the cold benchmark measures allocation overhead rather than real-world varint encode cost.

gmarzot · 2026-04-06T11:30:01Z

benchmark/CMakeLists.txt line 10 at r1 (raw file):

Previously, afrind wrote…

We can use folly/Benchmark and not add another dep?

Should i make that change before initial merge? its your call. claude says its doable ..

gmarzot · 2026-04-06T11:31:28Z

benchmark/moqt_framer.cpp line 9 at r1 (raw file):

Previously, afrind wrote…

This likely belongs in moxygen proper?

i guess we can phase out benchmarks as they are upstreamed but focus is on moqx right? can play it howefver you wish

gmarzot · 2026-04-06T12:10:51Z

Benchmark Report

9 files, 63 benchmarks. All numbers from argo (Apple M4, 10 cores). libquicr built and run on the same machine for a fair comparison.

Extensions Serialize/Deserialize/RoundTrip

Both serialize MOQT extension key-value pairs per draft-ietf-moq-transport.

Benchmark	moqx	libquicr	Ratio
Serialize/1	50.6 ns	387 ns	moqx 7.6x faster
Serialize/10	234 ns	2390 ns	moqx 10.2x faster
Serialize/100	2575 ns	22800 ns	moqx 8.9x faster
Serialize/1000	27.1 μs	241 μs	moqx 8.9x faster
Deserialize/1	42.4 ns	395 ns	moqx 9.3x faster
Deserialize/10	218 ns	2290 ns	moqx 10.5x faster
Deserialize/100	1751 ns	21900 ns	moqx 12.5x faster
Deserialize/1000	16.8 μs	231 μs	moqx 13.7x faster
RoundTrip/1	96.6 ns	789 ns	moqx 8.1x faster
RoundTrip/10	483 ns	4680 ns	moqx 9.7x faster
RoundTrip/100	4437 ns	45700 ns	moqx 10.3x faster
RoundTrip/1000	42.6 μs	493 μs	moqx 11.6x faster

The implementations use different extension value types and serialization strategies, so these ratios reflect end-to-end framework differences, not a single design choice.

QUIC Varint Encode/Decode

Both encode/decode RFC 9000 variable-length integers through different APIs.

Benchmark	moqx	libquicr	Notes
Encode (cold — new IOBufQueue)	26.2 ns	0.23 ns	Not comparable — moqx allocates a new IOBufQueue per call. libquicr writes raw bytes to a pre-allocated buffer.
Encode (warm — reused IOBufQueue)	9.67 ns	0.23 ns	Remaining gap is folly cursor/appender setup. In production the queue is pre-allocated and reused across objects.
Decode (small)	2.07 ns	0.23 ns	moqx uses `ContiguousReadCursor`, designed for discontiguous buffers. libquicr `Decode` is a trivial cast.
Decode (real parse)	2.07 ns	2.20 ns	~equal when both do actual multi-byte parsing (`UIntVar_FromBytes`).

The encode gap is IOBuf framework overhead. moxygen's parser is designed for the pessimistic case of fragmented stream data arriving across discontiguous buffers, which adds overhead for the contiguous case.

TrackNamespace

Both implement MOQT TrackNamespace with different internal representations.

Benchmark	moqx	libquicr	Notes
Hash	6.49 ns	1.50 ns	`folly::hash_range` over `vector<string>` vs custom flat hash
Construct	15.7 ns	5.30 ns	`vector<string>` heap allocation vs compact/inline storage

Architectural difference in moxygen's type design. Only in control plane messages. A flatter representation is a potential optimization to apply upstream.

Frame Write and Parse (moqx only — no libquicr equivalent)

Benchmark	Write	Parse	RoundTrip
SubscribeRequest	146 ns	33.2 ns	155 ns
Fetch	176 ns	33.3 ns	—
PublishNamespace	88.1 ns	19.2 ns	—
Goaway	78.1 ns	33.2 ns	—
SubgroupHeader	51.0 ns	—	—
StreamObject	19.6 ns	—	—

moqx Application Layer (no libquicr equivalent)

Benchmark	Time
StatsCollector callback	0.70 ns
StatsCollector Snapshot	28.5 ns
Histogram AddValue	0.92 ns
Prometheus Format	4.3 μs
ServiceMatcher Exact	9.65 ns
ServiceMatcher Wildcard	12.0 ns
Config Load YAML (3 svc)	47.9 μs
Config Resolve (50 svc)	12.8 μs

libquicr-only (no moqx equivalent yet)

Benchmark	Time	Notes
PQ_ConnDataForwarding	874 ns	Relay forwarding — highest value future addition
PQ_Push/Pop	53-308 ns	Priority queue ops
DataStorage_Push	19.4 ns	Object storage

Future work

Relay forwarding throughput benchmark (requires mock MoQ sessions)
Subgroup object header parse
Multi-session stats aggregation
Consider migration to folly/Benchmark (already a dependency)

Per review feedback (avoid adding google/benchmark as a new top-level dep when folly is already pulled in transitively via the moxygen standalone build). Translation: - #include <benchmark/benchmark.h> -> #include <folly/Benchmark.h> - void BM_X(benchmark::State& state) { for (auto _ : state) {...} } -> BENCHMARK(BM_X, iters) { for (unsigned i = 0; i < iters; ++i) {...} } with folly::BenchmarkSuspender wrapping the previously-untimed setup so semantics match Google's "setup is not timed" model. - benchmark::DoNotOptimize -> folly::doNotOptimizeAway - BENCHMARK(F)->Arg(N)->Arg(M) -> BENCHMARK_NAMED_PARAM(F, _N, N) / BENCHMARK_NAMED_PARAM(F, _M, M) (one line per arg) - BENCHMARK_MAIN auto-generated main -> explicit benchmark_main.cpp with folly::Init + folly::runBenchmarks. CMake: drop FetchContent on google/benchmark, link Folly::follybenchmark (already built as part of the standalone folly fetch — zero new deps). CI workflows unchanged: bare `./build/benchmark/moqx_benchmark` works with folly's defaults; no Google-specific CLI flags were used. Same 63 benchmarks in the same 9 source files.

…-folly # Conflicts: # .github/workflows/ci-main.yml # .github/workflows/ci-pr.yml # scripts/build.sh

main renamed lower_case headers to CamelCase across the 215-commit drift since this PR was opened, and the moqx headers live at "$PROJECT/src/<...>" with the include path set to $PROJECT/src — not under a "moqx/" prefix. Update benchmark sources to match what the rest of the codebase uses: - "stats/BoundedHistogram.h" (was <moqx/stats/BoundedHistogram.h>) - "stats/StatsRegistry.h" (was <moqx/stats/StatsRegistry.h>) - "stats/MoQStatsCollector.h" (was <moqx/stats/MoQStatsCollector.h>) - "config/loader/Loader.h" (was <moqx/config/loader/loader.h>) - "config/loader/ConfigResolver.h" (was <moqx/config/loader/config_resolver.h>) - "config/loader/ParsedConfig.h" (was <moqx/config/loader/parsed_config.h>) - "ServiceMatcher.h" (was <moqx/ServiceMatcher.h>)

- --bm_json_verbose=bench-results.json — machine-readable output for perf-regression detection across PRs and absolute throughput computation (vs. raw ns/op which has ~1-5% framework-overhead noise). - Step summary renders the human-readable table directly in the GitHub Actions run UI (no artifact download needed for casual review). - Both bench-results.json (machine-readable) and bench-output.txt (human-readable) uploaded as CI artifacts per platform. Per-platform artifact name: bench-results-{linux,macos}.

…ATIVE on warm varints UserCounters[bytes_per_iter] on the four moqt_extensions families (Serialize, SerializeArray, Deserialize, RoundTrip) — the wire byte count is the natural normalization point for libquicr comparison and removes per-iteration framework overhead from the throughput calculation: throughput (bytes/sec) = bytes_per_iter / ns_per_iter * 1e9 That number is framework-independent — comparable apples-to-apples between this folly Benchmark suite and libquicr's Google Benchmark output. BENCHMARK_RELATIVE on BM_VarintEncode_Warm and _WarmLarge so the output renders the pre-allocation speedup as a percentage relative to BM_VarintEncode_Cold. Makes the optimization story explicit in the table.

afrind

A bunch of feedback about the ways I think these can be improved, but marked non-blocking and approving - feel free to land when you are satisfied:

there's several places where the BM is including more work than we're trying to measure
some benchmarks seem stilly/provide little value
still using snake_case filenames

@afrind reviewed 15 files and all commit messages, made 19 comments, and resolved 2 discussions.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on akash-a-n, michalhosna, mondain, Oxyd, peterchave, suhasHere, and TimEvens).

benchmark/moqt_framer.cpp line 9 at r1 (raw file):

Previously, gmarzot (Giovanni Marzot) wrote…

i guess we can phase out benchmarks as they are upstreamed but focus is on moqx right? can play it howefver you wish

Could still be in openmoq/moxygen, but it's simpler for now I guess to have a single benchmark.

benchmark/benchmark_main.cpp line 4 at r6 (raw file):

#include <folly/init/Init.h>

int main(int argc, char** argv) {

Update to SnakeCase?

benchmark/config_loader.cpp line 14 at r6 (raw file):

// Write a temporary YAML config file for benchmarking.
static std::string writeTempConfig(int numServices) {

I feel like there's another flavor of this utility in the test code elsewhere we should re-use? It may even create unique file names so it won't fail if e.g. /tmp/moqx_bench_config.yaml is not writable due to perms?

benchmark/config_loader.cpp line 77 at r6 (raw file):

}

BENCHMARK(BM_ConfigGenerateSchema, iters) {

Do we think that having benchmarks for config loading and resolution really justifies their existence? This only happens once, at startup time, and is likely peanuts.

benchmark/moq_stats_collector.cpp line 27 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    pubCb->onSubscribeError(moxygen::RequestErrorCode::INTERNAL_ERROR);

Is this meaningfully different than the test above and if so, how?

benchmark/moq_stats_collector.cpp line 39 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    subCb->recordSubscribeLatency(latency);

This might be more interesting if we used different latency values that hit different parts of the histogram buckets

benchmark/moq_stats_collector.cpp line 56 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    auto snap = collector->snapshot();

This is the one we actually care about - it gets run in the worker thread.

benchmark/moqt_extensions.cpp line 55 at r6 (raw file):

    bool err = false;
    writer.writeExtensions(buf, exts, sz, err);
    folly::doNotOptimizeAway(sz);

I don't think this can be optimized away because of the += ?

benchmark/moqt_extensions.cpp line 106 at r6 (raw file):

  for (unsigned i = 0; i < iters; ++i) {
    MoQFrameParser parser;

I don't think you want to declare the parser under the load test every time?

benchmark/moqt_extensions.cpp line 144 at r6 (raw file):

    MoQFrameParser parser;
    parser.initializeVersion(kVersion);
    auto buf = wireData->clone();

I don't think you need the clone() here -- it's a malloc.

Same comments as above -- would be ideal to declare/initialize the parser only once

benchmark/moqt_extensions.cpp line 150 at r6 (raw file):

    size_t length = buf->computeChainDataLength();
    auto res = parser.parseExtensions(cursor, length, header);
    folly::doNotOptimizeAway(res);

Why not serialize *res below?

benchmark/moqt_framer.cpp line 108 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    folly::IOBufQueue buf;

Unsure if we should declare the buf outside and clear it with buf.move()? Here and elsewhere

benchmark/moqt_framer.cpp line 229 at r6 (raw file):

  for (unsigned i = 0; i < iters; ++i) {
    MoQFrameParser parser;

can remove parser init per loop and clone, here and elsewhere?

benchmark/moqt_framer.cpp line 290 at r6 (raw file):

}

BENCHMARK(BM_ParseGoaway, iters) {

Measuring Goaway perf seems almost comical, but whatev.

benchmark/moqt_framer.cpp line 342 at r6 (raw file):

BENCHMARK(BM_TrackNamespace_Construct, iters) {
  for (unsigned i = 0; i < iters; ++i) {
    std::vector<std::string> parts = {"conference", "room42", "alice", "video"};

Don't you want this outside the loop?

benchmark/moqt_framer.cpp line 373 at r6 (raw file):

}

BENCHMARK(BM_TrackNamespace_Describe, iters) {

Describe perf?

benchmark/stats_registry.cpp line 36 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    auto buf = StatsSnapshot::formatPrometheus(snap);

Is this duplicative of the other promethus benchmark above?

benchmark/stats_registry.cpp line 46 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    auto idx = requestErrorCodeIndex(code);

the micro-est of micro benchmarks

gmarzot · 2026-05-07T21:32:43Z

libquicr ↔ moxygen extensions comparison — argo (M4 mini)

Same-hardware comparison run on argo. Both benchmarks filtered to extensions only, 10s budget. Source-build moxygen (with this PR), prebuilt libquicr binary at ~/Projects/libquicr/build/benchmark/quicr_benchmark.

moxygen (folly Benchmark — this PR)

============================================================================================
benchmark/moqt_extensions.cpp                relative  time/iter   iters/s  bytes_per_iter
============================================================================================
BM_ExtensionsSerialize(_1)                                 49.08ns    20.37M               3
BM_ExtensionsSerialize(_10)                               237.30ns     4.21M              30
BM_ExtensionsSerialize(_100)                                2.45us   408.10K             535
BM_ExtensionsSerialize(_1000)                              25.98us    38.49K            5935
BM_ExtensionsSerializeArray(_1)                            66.19ns    15.11M              30
BM_ExtensionsSerializeArray(_10)                          584.06ns     1.71M             292
BM_ExtensionsSerializeArray(_100)                           5.85us   170.93K            2970
BM_ExtensionsDeserialize(_1)                               41.29ns    24.22M               3
BM_ExtensionsDeserialize(_10)                             211.11ns     4.74M              30
BM_ExtensionsDeserialize(_100)                              1.73us   576.66K             535
BM_ExtensionsDeserialize(_1000)                            17.06us    58.63K            5935
BM_ExtensionsRoundTrip(_1)                                 93.22ns    10.73M               3
BM_ExtensionsRoundTrip(_10)                               482.94ns     2.07M              30
BM_ExtensionsRoundTrip(_100)                                4.35us   230.10K             535
BM_ExtensionsRoundTrip(_1000)                              42.93us    23.30K            5935
============================================================================================

libquicr (Google Benchmark — current main)

-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
ExtensionsSerialize/1           0.376 us        0.376 us      1894088 bytes_per_second=27.886Mi/s   items_per_second=5.31646M/s
ExtensionsSerialize/10           2.25 us         2.25 us       308598 bytes_per_second=58.1416Mi/s  items_per_second=8.90012M/s
ExtensionsSerialize/100          21.5 us         21.5 us        32249 bytes_per_second=57.9675Mi/s  items_per_second=9.30119M/s
ExtensionsSerialize/1000          243 us          243 us         2902 bytes_per_second=51.0539Mi/s  items_per_second=8.23155M/s
ExtensionsDeserialize/1         0.384 us        0.384 us      1792395 bytes_per_second=27.2961Mi/s  items_per_second=5.20401M/s
ExtensionsDeserialize/10         2.14 us         2.14 us       319511 bytes_per_second=61.1493Mi/s  items_per_second=9.36055M/s
ExtensionsDeserialize/100        21.2 us         21.2 us        32889 bytes_per_second=58.6847Mi/s  items_per_second=9.41628M/s
ExtensionsDeserialize/1000        222 us          222 us         3164 bytes_per_second=55.9612Mi/s  items_per_second=9.02277M/s
ExtensionsRoundTrip/1           0.777 us        0.777 us       907253 bytes_per_second=13.4929Mi/s  items_per_second=2.57243M/s
ExtensionsRoundTrip/10           4.72 us         4.72 us       156578 bytes_per_second=27.7059Mi/s  items_per_second=4.24113M/s
ExtensionsRoundTrip/100          44.0 us         44.0 us        15786 bytes_per_second=28.315Mi/s   items_per_second=4.5433M/s
ExtensionsRoundTrip/1000          466 us          466 us         1502 bytes_per_second=26.6325Mi/s  items_per_second=4.29402M/s

Per-extension steady-state (N=1000), normalized

⚠️ Caveat. The two suites do not measure identical work:

libquicr: each extension carries an 8-byte raw Bytes payload (memcpy'd uint64_t), and SerializeExtensions is called with both a mutable and an immutable extension set — so each call serializes 2N extensions total.

moxygen: each extension carries an int value (varint-encoded, ~3-5 bytes), with empty immutable set — so each call serializes N extensions total.

For a rigorous head-to-head we'd need either matching payload shapes or per-byte normalization. Numbers below are normalized for the 2N-vs-N count difference but the payload-shape difference remains.

Operation	libquicr (ns/ext)	moxygen (ns/ext)	Speedup
Serialize	121.5 (243 µs / 2000)	25.98 (25.98 µs / 1000)	4.7×
Deserialize	111.0 (222 µs / 2000)	17.06 (17.06 µs / 1000)	6.5×
RoundTrip (per op)	116.5 (466 µs / 4000 ops)	21.5 (42.93 µs / 2000 ops)	5.4×

Run setup

Host: argo (M4 mini, macOS Darwin 24.6.0, arm64)
Build profile: RelWithDebInfo, default optimization
moxygen build mode: source build (folly + fizz + wangle + mvfst + proxygen + moxygen rebuilt; tarball mode skipped due to known brew Cellar baked-path issue, separate tracking item)
moxygen commit: this PR's HEAD (folly Benchmark migration + bytes_per_iter counters + BENCHMARK_RELATIVE)
libquicr commit: main @ 320a51b6 ("Dynamic groups for publisher initiated")
Filter: --bm_regex='Extensions' (folly) / --benchmark_filter='^Extensions' (Google)
Time budget: ~10s per benchmark

Output artifacts

JSON outputs stored on argo at:

/tmp/moxygen-bench-argo.json (folly format)
/tmp/libquicr-bench-argo.json (Google Benchmark format)

Both have bytes_per_iter / bytes_per_second for framework-independent throughput math.

Headline

Moxygen's extension framer is consistently ~5–6× faster per extension than libquicr's at steady state on the same hardware. Part of that gap comes from the different test data shape (smaller varint payload vs. 8-byte raw memcpy), so the apples-to-apples speedup is somewhat smaller than the raw ratio — but the direction and order of magnitude are clear: this is faster code on faster wire format, not slower.

Per Alan's review feedback that the previous comparison conflated benchmark-harness differences with real encoder/decoder performance, add two new families that mirror libquicr's ExtensionsSerialize/Deserialize/RoundTrip benchmark shape: _LibquicrShape: 2N extensions per call (mutable + immutable), type values mixed even/odd parity (matches libquicr's CreateTestExtensions input). Even types -> int form (no payload memcpy); odd types -> array form (8-byte IOBuf, encoder memcpy's payload). Matches libquicr's INPUT shape; encoder cost is asymmetric across parity. _LikeLibquicrAllArray: 2N extensions per call, every entry odd-typed and array-form with 8-byte IOBuf payload. Forces the moxygen encoder into the same per-extension memcpy work libquicr's SerializeExtensions does for every payload. Apples-to-apples on encoder cost. The original 10x raw ratio in the first comparison run was inflated by moxygen's existing benchmarks using int-form extensions (varint-encoded values, no payload memcpy) versus libquicr always doing 8-byte byte-array payloads (encoder memcpy per extension). The _LikeLibquicrAllArray variant strips that wire-format-shape advantage out so the residual speedup reflects actual encoder/decoder code efficiency rather than the moq spec choice between int and array forms. Both variants emit bytes_per_iter and exts_per_iter user counters for framework-independent throughput math. Names are explicit about the comparison they support so reviewers don't conflate them with the existing moxygen-native int-form benchmarks.

Per Alan's review feedback (#115). The rest of the project uses CamelCase filenames (src/MoqxRelay.cpp, src/ServiceMatcher.cpp, src/stats/MoQStatsCollector.cpp, src/stats/StatsRegistry.cpp, src/stats/BoundedHistogram.cpp, src/config/loader/ConfigResolver.cpp, etc.) — the benchmark/ directory was the lone snake_case holdout. Renames: benchmark_main.cpp -> BenchmarkMain.cpp bounded_histogram.cpp -> BoundedHistogram.cpp config_loader.cpp -> ConfigLoader.cpp config_resolver.cpp -> ConfigResolver.cpp moq_stats_collector.cpp -> MoQStatsCollector.cpp moqt_extensions.cpp -> MoQTExtensions.cpp moqt_framer.cpp -> MoQFramer.cpp prometheus_format.cpp -> PrometheusFormat.cpp service_matcher.cpp -> ServiceMatcher.cpp stats_registry.cpp -> StatsRegistry.cpp Each benchmark file's name now mirrors its primary system-under-test (BoundedHistogram benchmarks src/stats/BoundedHistogram.cpp, MoQFramer benchmarks moxygen's MoQFramer API, etc.).

…ist parser/writer init Per Alan's review on PR #115. Three categories of change: 1. Drop benchmarks with low signal-to-noise: - benchmark/ConfigLoader.cpp (entire file): config load and resolve run once at startup; perf is "peanuts" relative to the worker-thread hot paths the benchmark suite is meant to track. - benchmark/ConfigResolver.cpp (entire file): same scope, same reasoning. - BM_ParseGoaway / BM_WriteGoaway from MoQFramer.cpp: control frames at session teardown — perf irrelevant. - BM_TrackNamespace_Describe from MoQFramer.cpp: describe() is for logs, not a measured workload. - BM_StatsSnapshot_FormatPrometheus from StatsRegistry.cpp: duplicate of the dedicated PrometheusFormat.cpp benchmark family. - BM_RequestErrorCodeIndex from StatsRegistry.cpp: enum-to-index lookup, "the micro-est of micro benchmarks" per review. - BM_StatsCollector_OnSubscribeError from MoQStatsCollector.cpp: not meaningfully different from OnSubscribeSuccess (symmetric implementation). 2. Hoist parser/writer init out of timed iter loops where the parser is reusable across calls. Each loop iteration was paying for: - A fresh MoQFrameParser construction + initializeVersion(kVersion) - An IOBuf clone() for the input wire data — unnecessary because Cursor is read-only and doesn't mutate the underlying buffer These two costs were dwarfing the actual parse work in some cases. Now parser is constructed once before the loop, and Cursor is created fresh per iter against the original wireData (no clone). Affects all parse benchmarks in MoQTExtensions.cpp and MoQFramer.cpp. For roundtrip benchmarks, also hoist the output IOBufQueue and use move() per iter to discard prior contents — cheaper than constructing a fresh queue on every iteration. 3. BM_StatsCollector_RecordLatency now cycles through 8 latency values spanning the full kLatencyBucketsUs range so the bucket-search code path is exercised under realistic mixed load (per review: "more interesting if we used different latency values that hit different parts of the histogram buckets"). Net change: -316 lines / +63 lines across 7 files. The remaining benchmarks are tighter — each measures a worker-thread-hot path or a specific encoder/ decoder cost we want to track over time.

The earlier hoist of MoQFrameParser construction outside the iter loop broke measurement: parseExtensions / parseFetch / parseSubscribeRequest / parsePublishNamespace each carry internal state that persists across calls, so the second iteration onwards short-circuited and Deserialize/ RoundTrip benchmarks collapsed to ~4ns / ~30ns regardless of N (instead of the realistic 17µs–70µs at /1000). Per-iter parser construction is cheap relative to the parse work itself and is necessary for correct measurement. Keep the malloc-reduction wins: - No more wireData->clone() per iter (Cursor reads the IOBuf directly) - Output IOBufQueue hoisted out of RoundTrip loops with outBuf.move() per iter to discard prior contents (cheaper than reconstruction) Net effect: Deserialize and RoundTrip times will return to realistic order-of-magnitude (~70µs at /1000 vs. the broken ~5ns) and the encode-side parser overhead is unchanged from the original. Apologies for the noise — Alan's hoist suggestion was correct in spirit but runs into the parser's per-call state machine in practice.

parseExtensions takes length by reference and decrements it as bytes are consumed. Reusing wireSize across iters meant only the first iter saw real input — subsequent iters short-circuited at ~10ns regardless of N (and previously misdiagnosed as a parser-state issue). Pass a fresh length copy per iter; mark wireSize const to prevent recurrence. Also anchor header.extensions explicitly via folly::doNotOptimizeAway to mirror libquicr's benchmark::DoNotOptimize(extensions) pattern, so the compiler can't elide the parsed-output vector pushbacks even if a future change drops the length-mutation barrier. Local sanity (Deserialize_LibquicrShape, ubuntu-22.04, ryzen): N=1 101ns N=10 749ns N=100 7.83µs N=1000 62.69µs — linear, was previously ~10ns flat Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gmarzot · 2026-05-08T16:12:21Z

libquicr ↔ moxygen extensions comparison — argo (M4 mini) — v2 (corrected)

Why a v2: the v1 run's BM_ExtensionsDeserialize* and BM_ExtensionsRoundTrip* numbers were artifacts. MoQFrameParser::parseExtensions(cursor, size_t& length, header) decrements length by reference; the benchmark passed the same wireSize variable every iter, so after iter 1 length was near zero and iter 2+ short-circuited on QuicInteger underflow at ~10ns regardless of N. Symptom matched the parser-state regression we'd debugged earlier, hence the misdiagnosis. Fixed in b68a68a (per-iter length = wireSize copy + folly::doNotOptimizeAway(header.extensions) to anchor the populated container — mirrors libquicr's DoNotOptimize(extensions) pattern). Serialize was always correct; only Deserialize/RoundTrip move in v2.

moxygen (this PR, after fix — argo)

============================================================================================
benchmark/MoQTExtensions.cpp                relative  time/iter   iters/s  bytes_per_iter  exts_per_iter
============================================================================================
BM_ExtensionsSerialize(_1)                                 46.65ns    21.44M               3            NaN
BM_ExtensionsSerialize(_10)                               226.12ns     4.42M              30            NaN
BM_ExtensionsSerialize(_100)                                2.46us   406.63K             535            NaN
BM_ExtensionsSerialize(_1000)                              24.68us    40.52K            5935            NaN
BM_ExtensionsDeserialize(_1)                               24.52ns    40.78M               3            NaN
BM_ExtensionsDeserialize(_10)                             186.18ns     5.37M              30            NaN
BM_ExtensionsDeserialize(_100)                              1.41us   708.69K             535            NaN
BM_ExtensionsDeserialize(_1000)                            12.52us    79.85K            5935            NaN
BM_ExtensionsRoundTrip(_1)                                 72.09ns    13.87M               3            NaN
BM_ExtensionsRoundTrip(_10)                               440.24ns     2.27M              30            NaN
BM_ExtensionsRoundTrip(_100)                                3.94us   254.06K             535            NaN
BM_ExtensionsRoundTrip(_1000)                              38.13us    26.23K            5935            NaN
BM_ExtensionsSerialize_LibquicrShape(_1)                   89.53ns    11.17M              11              2
BM_ExtensionsSerialize_LibquicrShape(_10)                 812.51ns     1.23M             155             20
BM_ExtensionsSerialize_LibquicrShape(_100)                  7.73us   129.44K            1505            200
BM_ExtensionsSerialize_LibquicrShape(_1000)                84.92us    11.78K           15005           2000
BM_ExtensionsSerialize_LikeLibquicrAllArray(_1)           146.08ns     6.85M              25              2
BM_ExtensionsSerialize_LikeLibquicrAllArray(_10)            1.16us   864.92K             225             20
BM_ExtensionsSerialize_LikeLibquicrAllArray(_100)          12.19us    82.06K            2205            200
BM_ExtensionsSerialize_LikeLibquicrAllArray(_1000)        131.05us     7.63K           22007           2000
BM_ExtensionsDeserialize_LibquicrShape(_1)                 60.90ns    16.42M              11              2
BM_ExtensionsDeserialize_LibquicrShape(_10)               657.89ns     1.52M             155             20
BM_ExtensionsDeserialize_LibquicrShape(_100)                4.93us   202.64K            1505            200
BM_ExtensionsDeserialize_LibquicrShape(_1000)              28.51us    35.07K           15005           2000
BM_ExtensionsDeserialize_LikeLibquicrAllArray(_1)          97.73ns    10.23M              25              2
BM_ExtensionsDeserialize_LikeLibquicrAllArray(_10)        884.94ns     1.13M             225             20
BM_ExtensionsDeserialize_LikeLibquicrAllArray(_100)         4.92us   203.33K            2205            200
BM_ExtensionsDeserialize_LikeLibquicrAllArray(_1000)       41.22us    24.26K           22007           2000
BM_ExtensionsRoundTrip_LibquicrShape(_1)                  160.13ns     6.24M              11              2
BM_ExtensionsRoundTrip_LibquicrShape(_10)                   1.58us   634.76K             155             20
BM_ExtensionsRoundTrip_LibquicrShape(_100)                 13.01us    76.87K            1505            200
BM_ExtensionsRoundTrip_LibquicrShape(_1000)                73.86us    13.54K           15005           2000
BM_ExtensionsRoundTrip_LikeLibquicrAllArray(_1)           261.29ns     3.83M              25              2
BM_ExtensionsRoundTrip_LikeLibquicrAllArray(_10)            2.12us   472.07K             225             20
BM_ExtensionsRoundTrip_LikeLibquicrAllArray(_100)          11.90us    84.00K            2205            200
BM_ExtensionsRoundTrip_LikeLibquicrAllArray(_1000)        114.65us     8.72K           22007           2000
============================================================================================

libquicr (current main — argo, this run)

ExtensionsSerialize/1           0.383 us   bytes_per_second=27.38Mi/s  items_per_second=5.22M/s
ExtensionsSerialize/10           2.30 us   bytes_per_second=56.71Mi/s  items_per_second=8.68M/s
ExtensionsSerialize/100          21.6 us   bytes_per_second=57.71Mi/s  items_per_second=9.26M/s
ExtensionsSerialize/1000          237 us   bytes_per_second=52.34Mi/s  items_per_second=8.44M/s
ExtensionsDeserialize/1         0.384 us   bytes_per_second=27.35Mi/s  items_per_second=5.21M/s
ExtensionsDeserialize/10         2.16 us   bytes_per_second=60.62Mi/s  items_per_second=9.28M/s
ExtensionsDeserialize/100        21.1 us   bytes_per_second=59.15Mi/s  items_per_second=9.49M/s
ExtensionsDeserialize/1000        226 us   bytes_per_second=55.00Mi/s  items_per_second=8.87M/s
ExtensionsRoundTrip/1           0.791 us   bytes_per_second=13.27Mi/s  items_per_second=2.53M/s
ExtensionsRoundTrip/10           4.52 us   bytes_per_second=28.92Mi/s  items_per_second=4.43M/s
ExtensionsRoundTrip/100          44.7 us   bytes_per_second=27.90Mi/s  items_per_second=4.48M/s
ExtensionsRoundTrip/1000          478 us   bytes_per_second=25.93Mi/s  items_per_second=4.18M/s

Per-extension steady-state (N=1000)

Two replica families now to make the comparison interpretable. Both call writeExtensions(buf, Extensions(mutable, immutable)) with N + N entries (= 2N extensions per call), matching libquicr's SerializeExtensions(buffer, extensions, immutable) with CreateTestExtensions(N) twice. Both use 8-byte payloads.

LikeLibquicrAllArray — every extension is odd-typed array form, forcing the moxygen encoder into the same per-extension memcpy work libquicr does. Apples-to-apples on encoder cost.
LibquicrShape — extension types are mixed parity. Even-typed entries use moxygen's int form (varint value, no payload memcpy in encoder); odd-typed entries use array form. Captures moxygen's wire-format advantage where ~half the entries skip the memcpy entirely.

Operation	libquicr (ns/ext)	moxygen LikeLibquicrAllArray (ns/ext)	speedup	moxygen LibquicrShape (ns/ext)	speedup
Serialize	118.5 (237µs / 2000)	65.5 (131.05µs / 2000)	1.81×	42.5 (84.92µs / 2000)	2.79×
Deserialize	113.0 (226µs / 2000)	20.6 (41.22µs / 2000)	5.49×	14.3 (28.51µs / 2000)	7.93×
RoundTrip (per op)	119.5 (478µs / 4000)	28.7 (114.65µs / 4000)	4.16×	18.5 (73.86µs / 4000)	6.47×

RoundTrip "per op" normalizes by 2×2N = 4N (parse + reserialize) to keep it comparable to one-way costs.

Run setup

Host: argo (M4 mini, macOS Darwin 24.6.0, arm64, 10 cores)
Build profile: RelWithDebInfo, default optimization
moxygen commit: this PR @ b68a68a (with elision/length fix)
libquicr commit: main @ 320a51b6 (unchanged from v1)
Filter: --bm_regex='Extensions' (folly) / --benchmark_filter='Extensions' (Google)
Time budget: 10s per benchmark

Headline

Direction and order of magnitude in v1 were right; the per-extension multipliers below are the corrected numbers:

Apples-to-apples (LikeLibquicrAllArray vs libquicr): moxygen is ~1.8× faster on Serialize, ~5.5× on Deserialize, ~4.2× on RoundTrip per extension at N=1000.
With wire-format asymmetry (LibquicrShape — half the entries take the int-form fast path): ~2.8× / ~7.9× / ~6.5×.

The wider Deserialize/RoundTrip gap (vs. Serialize) is interesting: moxygen's parser is doing meaningfully less per-byte work than libquicr's for the same wire content.

…e/parts Per Alan's review on PR #115: - BM_TrackNamespace_Construct: hoist `parts` vector outside the loop and pass by const ref. The benchmark now measures TrackNamespace's constructor cost on a pre-built vector rather than the dominant cost of building the initializer_list + std::vector each iter. - 5 write benchmarks (BM_Write{SubscribeRequest,SubgroupHeader,StreamObject, PublishNamespace,Fetch}): hoist `folly::IOBufQueue buf` outside the loop and call `buf.reset()` per iter. Same pattern applied to extension serialize benchmarks (BM_ExtensionsSerialize{,Array,_LibquicrShape, _LikeLibquicrAllArray}) and to RoundTrip output queues. - 4 parse benchmarks (BM_Parse{SubscribeRequest,Fetch,PublishNamespace} + BM_SubscribeRoundTrip): hoist MoQFrameParser outside the loop. These control-frame parses don't touch the parser's delta-decoding state, so no carryover. (The "parser must be fresh per iter" comment in earlier commits was load-bearing on a misdiagnosis — see preceding commit.) - 6 extension parse benchmarks (BM_ExtensionsDeserialize{,_LibquicrShape, _LikeLibquicrAllArray} and the matching RoundTrips): hoist parser too. parseExtensionKvPairs self-resets previousExtensionType_=0 at the top of each call (v16+ delta decoding), so per-iter state doesn't carry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gmarzot · 2026-05-08T22:58:38Z

Update — review feedback applied (`e4b0684`)

Addressed the remaining non-blocking items from your review:

BM_TrackNamespace_Construct: hoisted parts outside the loop, pass by const ref. Was measuring initializer_list + std::vector building each iter; now measures TrackNamespace's constructor cost on a pre-built vector.
IOBufQueue declaration pattern: hoisted folly::IOBufQueue buf + buf.reset() per iter across all 5 write benchmarks in MoQFramer.cpp and the 4 serialize benchmarks in MoQTExtensions.cpp. Same pattern applied to RoundTrip output queues.
MoQFrameParser hoist: hoisted across the 4 control-frame parses (parseSubscribeRequest/Fetch/PublishNamespace + BM_SubscribeRoundTrip) and the 6 extension parse benchmarks. Verified safe: control-frame parses don't touch the parser's delta-decoding state, and parseExtensionKvPairs self-resets previousExtensionType_=0 at the top of each call. The earlier "parser must be fresh per iter" comment was load-bearing on the length-by-ref misdiagnosis from the v2 update.
parseExtensions result usage (Why not serialize *res below?): handled in v2 by anchoring header.extensions directly via folly::doNotOptimizeAway, mirroring libquicr's DoNotOptimize(extensions) pattern. Equivalent observability without the extra serialize work in the Deserialize-only benchmark.

Re-ran on argo. Numbers unchanged at the headline level (these were code-quality fixes, not perf wins). Comparison conclusions from v2 stand:

Operation	libquicr ns/ext	moxygen LikeLibquicrAllArray ns/ext	speedup	moxygen LibquicrShape ns/ext	speedup
Serialize	118.5	65.0	1.82×	42.4	2.79×
Deserialize	113.0	21.4	5.28×	14.3	7.90×
RoundTrip (per op)	119.5	26.8	4.46×	18.5	6.46×

Full v3 argo output [pasted in /tmp on the host; raw numbers within ±2% of v2 across all tests].

gmarzot added 3 commits April 5, 2026 12:40

Add macos-latest to benchmark matrix

48b994c

Fix cmake version check for macOS (grep -P not available)

eb80ebe

afrind requested changes Apr 5, 2026

View reviewed changes

gmarzot added 3 commits April 5, 2026 14:32

Disable tests in benchmark builds to avoid gflags conflict on macOS

2411c9b

Fix varint encode benchmark segfault on macOS

1c7a7e9

Unlink brew gflags on macOS to avoid tarball conflict

0b82b1a

gmarzot mentioned this pull request Apr 5, 2026

macOS: tarball gflags conflicts with brew gflags openmoq/moxygen#114

Open

Move brew unlink gflags after setup (before build)

003a234

gmarzot added 3 commits April 5, 2026 18:29

Fix macOS gflags conflict: prefer shared gflags on Darwin

6ce21d8

gmarzot added 2 commits April 6, 2026 07:39

Add moqt_extensions.cpp to benchmark CMakeLists (was missing)

0d337e8

Add benchmark jobs to ci-main.yml (linux + macos)

5aa0fd4

gmarzot mentioned this pull request Apr 7, 2026

Add macOS build to CI #124

Merged

gmarzot self-assigned this Apr 8, 2026

gmarzot requested review from Oxyd, TimEvens, akash-a-n, michalhosna, mondain, peterchave and suhasHere as code owners May 7, 2026 11:57

gmarzot changed the title ~~Add micro-benchmark suite (Google Benchmark)~~ Add micro-benchmark suite (folly Benchmark) May 7, 2026

Merge remote-tracking branch 'origin/main' into benchmark-scaffolding…

83df35b

…-folly # Conflicts: # .github/workflows/ci-main.yml # .github/workflows/ci-pr.yml # scripts/build.sh

gmarzot added 3 commits May 7, 2026 10:19

afrind approved these changes May 7, 2026

View reviewed changes

gmarzot and others added 5 commits May 7, 2026 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add micro-benchmark suite (folly Benchmark)#115

Add micro-benchmark suite (folly Benchmark)#115
gmarzot wants to merge 23 commits intomainfrom
benchmark-scaffolding

gmarzot commented Apr 5, 2026 •

edited by michalhosna

Loading

Uh oh!

afrind left a comment

Uh oh!

afrind commented Apr 5, 2026

Uh oh!

gmarzot commented Apr 6, 2026

Uh oh!

gmarzot commented Apr 6, 2026

Uh oh!

gmarzot commented Apr 6, 2026

Uh oh!

afrind left a comment

Uh oh!

gmarzot commented May 7, 2026

Uh oh!

gmarzot commented May 8, 2026

Uh oh!

gmarzot commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gmarzot commented Apr 5, 2026 • edited by michalhosna Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Build & run

Test plan

Future benchmarks

Uh oh!

afrind left a comment

Choose a reason for hiding this comment

Uh oh!

afrind commented Apr 5, 2026

Uh oh!

gmarzot commented Apr 6, 2026

Uh oh!

gmarzot commented Apr 6, 2026

Uh oh!

gmarzot commented Apr 6, 2026

Benchmark Report

Extensions Serialize/Deserialize/RoundTrip

QUIC Varint Encode/Decode

TrackNamespace

Frame Write and Parse (moqx only — no libquicr equivalent)

moqx Application Layer (no libquicr equivalent)

libquicr-only (no moqx equivalent yet)

Future work

Uh oh!

afrind left a comment

Choose a reason for hiding this comment

Uh oh!

gmarzot commented May 7, 2026

libquicr ↔ moxygen extensions comparison — argo (M4 mini)

moxygen (folly Benchmark — this PR)

libquicr (Google Benchmark — current main)

Per-extension steady-state (N=1000), normalized

Run setup

Output artifacts

Headline

Uh oh!

gmarzot commented May 8, 2026

libquicr ↔ moxygen extensions comparison — argo (M4 mini) — v2 (corrected)

moxygen (this PR, after fix — argo)

libquicr (current main — argo, this run)

Per-extension steady-state (N=1000)

Run setup

Headline

Uh oh!

gmarzot commented May 8, 2026

Update — review feedback applied (e4b0684)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gmarzot commented Apr 5, 2026 •

edited by michalhosna

Loading

Update — review feedback applied (`e4b0684`)