Skip to content

Add micro-benchmark suite (folly Benchmark)#115

Open
gmarzot wants to merge 23 commits intomainfrom
benchmark-scaffolding
Open

Add micro-benchmark suite (folly Benchmark)#115
gmarzot wants to merge 23 commits intomainfrom
benchmark-scaffolding

Conversation

@gmarzot
Copy link
Copy Markdown
Contributor

@gmarzot gmarzot commented Apr 5, 2026

Summary

Add micro-benchmark suite using folly Benchmark, modeled after Quicr/libquicr.

  • 9 benchmark files, 63 benchmarks covering:
    • MOQT framer: varint encode/decode (cold + warm), frame writes and parses (Subscribe, Subgroup, StreamObject, PublishNamespace, Fetch, Goaway), Subscribe roundtrip, TrackNamespace/FullTrackName operations
    • MOQT extensions: serialize, deserialize, and roundtrip at 1/10/100/1000 extensions (direct comparison to libquicr)
    • Stats: collector callbacks, snapshot, histogram, Prometheus text formatting at scale
    • Service matcher: exact/wildcard/fallback/no-match routing
    • Config: YAML load, resolve, schema generation
  • folly Benchmark via the existing standalone folly fetch — no new top-level dep. Links Folly::follybenchmark (already built as part of the moxygen standalone build).
  • --benchmark flag for scripts/build.sh
  • CI benchmark jobs on ubuntu-22.04 and macos-latest
  • macOS gflags shared linking fix (GFLAGS_SHARED=ON on Darwin)
  • macOS cmake version check fix (grep -Psed)

Build & run

./scripts/build.sh --benchmark
./build/benchmark/moqx_benchmark

Test plan

  • Full suite passes locally on WSL2 (linux x86)
  • Full suite passes on argo (macOS ARM64)
  • CI benchmark passes on ubuntu-22.04
  • CI benchmark passes on macos-latest
  • Same-hardware comparison with libquicr on argo (see benchmark report comment)

Future benchmarks

  • Relay forwarding throughput (equivalent to libquicr's PQ_ConnDataForwarding, requires mock sessions)
  • Subgroup object header parse (hot path for object delivery)
  • Multi-session stats aggregation (StatsRegistry::aggregateAsync)

This change is Reviewable

gmarzot added 3 commits April 5, 2026 12:40
8 benchmark files covering MOQT framer (varint, frame writes,
TrackNamespace/FullTrackName ops), stats collector, histogram,
Prometheus formatting, service matcher, and config loader/resolver.

41 benchmarks total, modeled after Quicr/libquicr's benchmark suite.

Build with: ./scripts/build.sh --benchmark
Run with: ./build/benchmark/moqx_benchmark

Also adds a benchmark CI job on ubuntu-22.04 for apples-to-apples
comparison with libquicr's GitHub-hosted runner results.
Copy link
Copy Markdown
Contributor

@afrind afrind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@afrind made 2 comments.
Reviewable status: 0 of 12 files reviewed, 2 unresolved discussions (waiting on gmarzot).


benchmark/CMakeLists.txt line 10 at r1 (raw file):

  FetchContent_Declare(
    benchmark
    GIT_REPOSITORY https://github.com/google/benchmark.git

We can use folly/Benchmark and not add another dep?


benchmark/moqt_framer.cpp line 9 at r1 (raw file):

namespace {

using namespace moxygen;

This likely belongs in moxygen proper?

@afrind
Copy link
Copy Markdown
Contributor

afrind commented Apr 5, 2026

Thanks for the data.

The varint encoder might be measuring the test rather than real-world usage -- we typically have an allocated buffer in advance, so this is free in practice.

moxygen's varint parser is designed to work across discontiguous buffers to avoid memcpy -- stream data can arrive e.g. 1 byte at a time on the stream. But maybe we're optimized for the pessimistic rather than optimistic case? Or maybe memcpy is actually faster than setting up cursor infrastructure.

The namespace design is a nice libquicr optimization that we should apply. We've just never looked -- these are only in control plane messages.

gmarzot added 3 commits April 5, 2026 18:29
- ExtensionsDeserialize/N: isolated parse (compare to libquicr)
- ParseSubscribeRequest, ParseFetch, ParsePublishNamespace, ParseGoaway
- SubscribeRoundTrip: full write+parse cycle

62 benchmarks total. Tested on WSL2 (linux) and argo (macOS ARM64).
Split varint encode into Cold (new IOBufQueue per call) and Warm
(reused IOBufQueue). Warm reflects production where the buffer is
pre-allocated. Shows 2-2.7x improvement over cold on both platforms.

Addresses Alan's feedback that the cold benchmark measures allocation
overhead rather than real-world varint encode cost.
@gmarzot
Copy link
Copy Markdown
Contributor Author

gmarzot commented Apr 6, 2026

benchmark/CMakeLists.txt line 10 at r1 (raw file):

Previously, afrind wrote…

We can use folly/Benchmark and not add another dep?

Should i make that change before initial merge? its your call. claude says its doable ..

@gmarzot
Copy link
Copy Markdown
Contributor Author

gmarzot commented Apr 6, 2026

benchmark/moqt_framer.cpp line 9 at r1 (raw file):

Previously, afrind wrote…

This likely belongs in moxygen proper?

i guess we can phase out benchmarks as they are upstreamed but focus is on moqx right? can play it howefver you wish

@gmarzot
Copy link
Copy Markdown
Contributor Author

gmarzot commented Apr 6, 2026

Benchmark Report

9 files, 63 benchmarks. All numbers from argo (Apple M4, 10 cores). libquicr built and run on the same machine for a fair comparison.

Extensions Serialize/Deserialize/RoundTrip

Both serialize MOQT extension key-value pairs per draft-ietf-moq-transport.

Benchmark moqx libquicr Ratio
Serialize/1 50.6 ns 387 ns moqx 7.6x faster
Serialize/10 234 ns 2390 ns moqx 10.2x faster
Serialize/100 2575 ns 22800 ns moqx 8.9x faster
Serialize/1000 27.1 μs 241 μs moqx 8.9x faster
Deserialize/1 42.4 ns 395 ns moqx 9.3x faster
Deserialize/10 218 ns 2290 ns moqx 10.5x faster
Deserialize/100 1751 ns 21900 ns moqx 12.5x faster
Deserialize/1000 16.8 μs 231 μs moqx 13.7x faster
RoundTrip/1 96.6 ns 789 ns moqx 8.1x faster
RoundTrip/10 483 ns 4680 ns moqx 9.7x faster
RoundTrip/100 4437 ns 45700 ns moqx 10.3x faster
RoundTrip/1000 42.6 μs 493 μs moqx 11.6x faster

The implementations use different extension value types and serialization strategies, so these ratios reflect end-to-end framework differences, not a single design choice.

QUIC Varint Encode/Decode

Both encode/decode RFC 9000 variable-length integers through different APIs.

Benchmark moqx libquicr Notes
Encode (cold — new IOBufQueue) 26.2 ns 0.23 ns Not comparable — moqx allocates a new IOBufQueue per call. libquicr writes raw bytes to a pre-allocated buffer.
Encode (warm — reused IOBufQueue) 9.67 ns 0.23 ns Remaining gap is folly cursor/appender setup. In production the queue is pre-allocated and reused across objects.
Decode (small) 2.07 ns 0.23 ns moqx uses ContiguousReadCursor, designed for discontiguous buffers. libquicr Decode is a trivial cast.
Decode (real parse) 2.07 ns 2.20 ns ~equal when both do actual multi-byte parsing (UIntVar_FromBytes).

The encode gap is IOBuf framework overhead. moxygen's parser is designed for the pessimistic case of fragmented stream data arriving across discontiguous buffers, which adds overhead for the contiguous case.

TrackNamespace

Both implement MOQT TrackNamespace with different internal representations.

Benchmark moqx libquicr Notes
Hash 6.49 ns 1.50 ns folly::hash_range over vector<string> vs custom flat hash
Construct 15.7 ns 5.30 ns vector<string> heap allocation vs compact/inline storage

Architectural difference in moxygen's type design. Only in control plane messages. A flatter representation is a potential optimization to apply upstream.

Frame Write and Parse (moqx only — no libquicr equivalent)

Benchmark Write Parse RoundTrip
SubscribeRequest 146 ns 33.2 ns 155 ns
Fetch 176 ns 33.3 ns
PublishNamespace 88.1 ns 19.2 ns
Goaway 78.1 ns 33.2 ns
SubgroupHeader 51.0 ns
StreamObject 19.6 ns

moqx Application Layer (no libquicr equivalent)

Benchmark Time
StatsCollector callback 0.70 ns
StatsCollector Snapshot 28.5 ns
Histogram AddValue 0.92 ns
Prometheus Format 4.3 μs
ServiceMatcher Exact 9.65 ns
ServiceMatcher Wildcard 12.0 ns
Config Load YAML (3 svc) 47.9 μs
Config Resolve (50 svc) 12.8 μs

libquicr-only (no moqx equivalent yet)

Benchmark Time Notes
PQ_ConnDataForwarding 874 ns Relay forwarding — highest value future addition
PQ_Push/Pop 53-308 ns Priority queue ops
DataStorage_Push 19.4 ns Object storage

Future work

  • Relay forwarding throughput benchmark (requires mock MoQ sessions)
  • Subgroup object header parse
  • Multi-session stats aggregation
  • Consider migration to folly/Benchmark (already a dependency)

@gmarzot gmarzot mentioned this pull request Apr 7, 2026
@gmarzot gmarzot self-assigned this Apr 8, 2026
Per review feedback (avoid adding google/benchmark as a new top-level
dep when folly is already pulled in transitively via the moxygen
standalone build).

Translation:
- #include <benchmark/benchmark.h> -> #include <folly/Benchmark.h>
- void BM_X(benchmark::State& state) { for (auto _ : state) {...} }
  -> BENCHMARK(BM_X, iters) { for (unsigned i = 0; i < iters; ++i) {...} }
  with folly::BenchmarkSuspender wrapping the previously-untimed
  setup so semantics match Google's "setup is not timed" model.
- benchmark::DoNotOptimize -> folly::doNotOptimizeAway
- BENCHMARK(F)->Arg(N)->Arg(M) -> BENCHMARK_NAMED_PARAM(F, _N, N)
  / BENCHMARK_NAMED_PARAM(F, _M, M) (one line per arg)
- BENCHMARK_MAIN auto-generated main -> explicit benchmark_main.cpp
  with folly::Init + folly::runBenchmarks.

CMake: drop FetchContent on google/benchmark, link Folly::follybenchmark
(already built as part of the standalone folly fetch — zero new deps).

CI workflows unchanged: bare `./build/benchmark/moqx_benchmark` works
with folly's defaults; no Google-specific CLI flags were used.

Same 63 benchmarks in the same 9 source files.
@gmarzot gmarzot changed the title Add micro-benchmark suite (Google Benchmark) Add micro-benchmark suite (folly Benchmark) May 7, 2026
…-folly

# Conflicts:
#	.github/workflows/ci-main.yml
#	.github/workflows/ci-pr.yml
#	scripts/build.sh
gmarzot added 3 commits May 7, 2026 10:19
main renamed lower_case headers to CamelCase across the 215-commit
drift since this PR was opened, and the moqx headers live at
"$PROJECT/src/<...>" with the include path set to $PROJECT/src — not
under a "moqx/" prefix. Update benchmark sources to match what the
rest of the codebase uses:

- "stats/BoundedHistogram.h"          (was <moqx/stats/BoundedHistogram.h>)
- "stats/StatsRegistry.h"             (was <moqx/stats/StatsRegistry.h>)
- "stats/MoQStatsCollector.h"         (was <moqx/stats/MoQStatsCollector.h>)
- "config/loader/Loader.h"            (was <moqx/config/loader/loader.h>)
- "config/loader/ConfigResolver.h"    (was <moqx/config/loader/config_resolver.h>)
- "config/loader/ParsedConfig.h"      (was <moqx/config/loader/parsed_config.h>)
- "ServiceMatcher.h"                  (was <moqx/ServiceMatcher.h>)
- --bm_json_verbose=bench-results.json — machine-readable output for
  perf-regression detection across PRs and absolute throughput
  computation (vs. raw ns/op which has ~1-5% framework-overhead noise).
- Step summary renders the human-readable table directly in the GitHub
  Actions run UI (no artifact download needed for casual review).
- Both bench-results.json (machine-readable) and bench-output.txt
  (human-readable) uploaded as CI artifacts per platform.

Per-platform artifact name: bench-results-{linux,macos}.
…ATIVE on warm varints

UserCounters[bytes_per_iter] on the four moqt_extensions families
(Serialize, SerializeArray, Deserialize, RoundTrip) — the wire byte
count is the natural normalization point for libquicr comparison and
removes per-iteration framework overhead from the throughput
calculation:

  throughput (bytes/sec) = bytes_per_iter / ns_per_iter * 1e9

That number is framework-independent — comparable apples-to-apples
between this folly Benchmark suite and libquicr's Google Benchmark
output.

BENCHMARK_RELATIVE on BM_VarintEncode_Warm and _WarmLarge so the
output renders the pre-allocation speedup as a percentage relative
to BM_VarintEncode_Cold. Makes the optimization story explicit in
the table.
Copy link
Copy Markdown
Contributor

@afrind afrind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bunch of feedback about the ways I think these can be improved, but marked non-blocking and approving - feel free to land when you are satisfied:

  1. there's several places where the BM is including more work than we're trying to measure
  2. some benchmarks seem stilly/provide little value
  3. still using snake_case filenames

@afrind reviewed 15 files and all commit messages, made 19 comments, and resolved 2 discussions.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on akash-a-n, michalhosna, mondain, Oxyd, peterchave, suhasHere, and TimEvens).


benchmark/moqt_framer.cpp line 9 at r1 (raw file):

Previously, gmarzot (Giovanni Marzot) wrote…

i guess we can phase out benchmarks as they are upstreamed but focus is on moqx right? can play it howefver you wish

Could still be in openmoq/moxygen, but it's simpler for now I guess to have a single benchmark.


benchmark/benchmark_main.cpp line 4 at r6 (raw file):

#include <folly/init/Init.h>

int main(int argc, char** argv) {

Update to SnakeCase?


benchmark/config_loader.cpp line 14 at r6 (raw file):

// Write a temporary YAML config file for benchmarking.
static std::string writeTempConfig(int numServices) {

I feel like there's another flavor of this utility in the test code elsewhere we should re-use? It may even create unique file names so it won't fail if e.g. /tmp/moqx_bench_config.yaml is not writable due to perms?


benchmark/config_loader.cpp line 77 at r6 (raw file):

}

BENCHMARK(BM_ConfigGenerateSchema, iters) {

Do we think that having benchmarks for config loading and resolution really justifies their existence? This only happens once, at startup time, and is likely peanuts.


benchmark/moq_stats_collector.cpp line 27 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    pubCb->onSubscribeError(moxygen::RequestErrorCode::INTERNAL_ERROR);

Is this meaningfully different than the test above and if so, how?


benchmark/moq_stats_collector.cpp line 39 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    subCb->recordSubscribeLatency(latency);

This might be more interesting if we used different latency values that hit different parts of the histogram buckets


benchmark/moq_stats_collector.cpp line 56 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    auto snap = collector->snapshot();

This is the one we actually care about - it gets run in the worker thread.


benchmark/moqt_extensions.cpp line 55 at r6 (raw file):

    bool err = false;
    writer.writeExtensions(buf, exts, sz, err);
    folly::doNotOptimizeAway(sz);

I don't think this can be optimized away because of the += ?


benchmark/moqt_extensions.cpp line 106 at r6 (raw file):

  for (unsigned i = 0; i < iters; ++i) {
    MoQFrameParser parser;

I don't think you want to declare the parser under the load test every time?


benchmark/moqt_extensions.cpp line 144 at r6 (raw file):

    MoQFrameParser parser;
    parser.initializeVersion(kVersion);
    auto buf = wireData->clone();

I don't think you need the clone() here -- it's a malloc.

Same comments as above -- would be ideal to declare/initialize the parser only once


benchmark/moqt_extensions.cpp line 150 at r6 (raw file):

    size_t length = buf->computeChainDataLength();
    auto res = parser.parseExtensions(cursor, length, header);
    folly::doNotOptimizeAway(res);

Why not serialize *res below?


benchmark/moqt_framer.cpp line 108 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    folly::IOBufQueue buf;

Unsure if we should declare the buf outside and clear it with buf.move()? Here and elsewhere


benchmark/moqt_framer.cpp line 229 at r6 (raw file):

  for (unsigned i = 0; i < iters; ++i) {
    MoQFrameParser parser;

can remove parser init per loop and clone, here and elsewhere?


benchmark/moqt_framer.cpp line 290 at r6 (raw file):

}

BENCHMARK(BM_ParseGoaway, iters) {

Measuring Goaway perf seems almost comical, but whatev.


benchmark/moqt_framer.cpp line 342 at r6 (raw file):

BENCHMARK(BM_TrackNamespace_Construct, iters) {
  for (unsigned i = 0; i < iters; ++i) {
    std::vector<std::string> parts = {"conference", "room42", "alice", "video"};

Don't you want this outside the loop?


benchmark/moqt_framer.cpp line 373 at r6 (raw file):

}

BENCHMARK(BM_TrackNamespace_Describe, iters) {

Describe perf?


benchmark/stats_registry.cpp line 36 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    auto buf = StatsSnapshot::formatPrometheus(snap);

Is this duplicative of the other promethus benchmark above?


benchmark/stats_registry.cpp line 46 at r6 (raw file):

  susp.dismiss();
  for (unsigned i = 0; i < iters; ++i) {
    auto idx = requestErrorCodeIndex(code);

the micro-est of micro benchmarks

@gmarzot
Copy link
Copy Markdown
Contributor Author

gmarzot commented May 7, 2026

libquicr ↔ moxygen extensions comparison — argo (M4 mini)

Same-hardware comparison run on argo. Both benchmarks filtered to extensions only, 10s budget. Source-build moxygen (with this PR), prebuilt libquicr binary at ~/Projects/libquicr/build/benchmark/quicr_benchmark.

moxygen (folly Benchmark — this PR)

============================================================================================
benchmark/moqt_extensions.cpp                relative  time/iter   iters/s  bytes_per_iter
============================================================================================
BM_ExtensionsSerialize(_1)                                 49.08ns    20.37M               3
BM_ExtensionsSerialize(_10)                               237.30ns     4.21M              30
BM_ExtensionsSerialize(_100)                                2.45us   408.10K             535
BM_ExtensionsSerialize(_1000)                              25.98us    38.49K            5935
BM_ExtensionsSerializeArray(_1)                            66.19ns    15.11M              30
BM_ExtensionsSerializeArray(_10)                          584.06ns     1.71M             292
BM_ExtensionsSerializeArray(_100)                           5.85us   170.93K            2970
BM_ExtensionsDeserialize(_1)                               41.29ns    24.22M               3
BM_ExtensionsDeserialize(_10)                             211.11ns     4.74M              30
BM_ExtensionsDeserialize(_100)                              1.73us   576.66K             535
BM_ExtensionsDeserialize(_1000)                            17.06us    58.63K            5935
BM_ExtensionsRoundTrip(_1)                                 93.22ns    10.73M               3
BM_ExtensionsRoundTrip(_10)                               482.94ns     2.07M              30
BM_ExtensionsRoundTrip(_100)                                4.35us   230.10K             535
BM_ExtensionsRoundTrip(_1000)                              42.93us    23.30K            5935
============================================================================================

libquicr (Google Benchmark — current main)

-------------------------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------
ExtensionsSerialize/1           0.376 us        0.376 us      1894088 bytes_per_second=27.886Mi/s   items_per_second=5.31646M/s
ExtensionsSerialize/10           2.25 us         2.25 us       308598 bytes_per_second=58.1416Mi/s  items_per_second=8.90012M/s
ExtensionsSerialize/100          21.5 us         21.5 us        32249 bytes_per_second=57.9675Mi/s  items_per_second=9.30119M/s
ExtensionsSerialize/1000          243 us          243 us         2902 bytes_per_second=51.0539Mi/s  items_per_second=8.23155M/s
ExtensionsDeserialize/1         0.384 us        0.384 us      1792395 bytes_per_second=27.2961Mi/s  items_per_second=5.20401M/s
ExtensionsDeserialize/10         2.14 us         2.14 us       319511 bytes_per_second=61.1493Mi/s  items_per_second=9.36055M/s
ExtensionsDeserialize/100        21.2 us         21.2 us        32889 bytes_per_second=58.6847Mi/s  items_per_second=9.41628M/s
ExtensionsDeserialize/1000        222 us          222 us         3164 bytes_per_second=55.9612Mi/s  items_per_second=9.02277M/s
ExtensionsRoundTrip/1           0.777 us        0.777 us       907253 bytes_per_second=13.4929Mi/s  items_per_second=2.57243M/s
ExtensionsRoundTrip/10           4.72 us         4.72 us       156578 bytes_per_second=27.7059Mi/s  items_per_second=4.24113M/s
ExtensionsRoundTrip/100          44.0 us         44.0 us        15786 bytes_per_second=28.315Mi/s   items_per_second=4.5433M/s
ExtensionsRoundTrip/1000          466 us          466 us         1502 bytes_per_second=26.6325Mi/s  items_per_second=4.29402M/s

Per-extension steady-state (N=1000), normalized

⚠️ Caveat. The two suites do not measure identical work:

  • libquicr: each extension carries an 8-byte raw Bytes payload (memcpy'd uint64_t), and SerializeExtensions is called with both a mutable and an immutable extension set — so each call serializes 2N extensions total.
  • moxygen: each extension carries an int value (varint-encoded, ~3-5 bytes), with empty immutable set — so each call serializes N extensions total.

For a rigorous head-to-head we'd need either matching payload shapes or per-byte normalization. Numbers below are normalized for the 2N-vs-N count difference but the payload-shape difference remains.

Operation libquicr (ns/ext) moxygen (ns/ext) Speedup
Serialize 121.5 (243 µs / 2000) 25.98 (25.98 µs / 1000) 4.7×
Deserialize 111.0 (222 µs / 2000) 17.06 (17.06 µs / 1000) 6.5×
RoundTrip (per op) 116.5 (466 µs / 4000 ops) 21.5 (42.93 µs / 2000 ops) 5.4×

Run setup

  • Host: argo (M4 mini, macOS Darwin 24.6.0, arm64)
  • Build profile: RelWithDebInfo, default optimization
  • moxygen build mode: source build (folly + fizz + wangle + mvfst + proxygen + moxygen rebuilt; tarball mode skipped due to known brew Cellar baked-path issue, separate tracking item)
  • moxygen commit: this PR's HEAD (folly Benchmark migration + bytes_per_iter counters + BENCHMARK_RELATIVE)
  • libquicr commit: main @ 320a51b6 ("Dynamic groups for publisher initiated")
  • Filter: --bm_regex='Extensions' (folly) / --benchmark_filter='^Extensions' (Google)
  • Time budget: ~10s per benchmark

Output artifacts

JSON outputs stored on argo at:

  • /tmp/moxygen-bench-argo.json (folly format)
  • /tmp/libquicr-bench-argo.json (Google Benchmark format)

Both have bytes_per_iter / bytes_per_second for framework-independent throughput math.

Headline

Moxygen's extension framer is consistently ~5–6× faster per extension than libquicr's at steady state on the same hardware. Part of that gap comes from the different test data shape (smaller varint payload vs. 8-byte raw memcpy), so the apples-to-apples speedup is somewhat smaller than the raw ratio — but the direction and order of magnitude are clear: this is faster code on faster wire format, not slower.

gmarzot and others added 5 commits May 7, 2026 17:48
Per Alan's review feedback that the previous comparison conflated
benchmark-harness differences with real encoder/decoder performance,
add two new families that mirror libquicr's
ExtensionsSerialize/Deserialize/RoundTrip benchmark shape:

  _LibquicrShape:           2N extensions per call (mutable + immutable),
                            type values mixed even/odd parity (matches
                            libquicr's CreateTestExtensions input).
                            Even types -> int form (no payload memcpy);
                            odd types -> array form (8-byte IOBuf, encoder
                            memcpy's payload). Matches libquicr's INPUT
                            shape; encoder cost is asymmetric across
                            parity.

  _LikeLibquicrAllArray:    2N extensions per call, every entry odd-typed
                            and array-form with 8-byte IOBuf payload.
                            Forces the moxygen encoder into the same
                            per-extension memcpy work libquicr's
                            SerializeExtensions does for every payload.
                            Apples-to-apples on encoder cost.

The original 10x raw ratio in the first comparison run was inflated by
moxygen's existing benchmarks using int-form extensions (varint-encoded
values, no payload memcpy) versus libquicr always doing 8-byte byte-array
payloads (encoder memcpy per extension). The _LikeLibquicrAllArray
variant strips that wire-format-shape advantage out so the residual
speedup reflects actual encoder/decoder code efficiency rather than
the moq spec choice between int and array forms.

Both variants emit bytes_per_iter and exts_per_iter user counters for
framework-independent throughput math. Names are explicit about the
comparison they support so reviewers don't conflate them with the
existing moxygen-native int-form benchmarks.
Per Alan's review feedback (#115). The rest of the project uses
CamelCase filenames (src/MoqxRelay.cpp, src/ServiceMatcher.cpp,
src/stats/MoQStatsCollector.cpp, src/stats/StatsRegistry.cpp,
src/stats/BoundedHistogram.cpp, src/config/loader/ConfigResolver.cpp,
etc.) — the benchmark/ directory was the lone snake_case holdout.

Renames:
  benchmark_main.cpp        -> BenchmarkMain.cpp
  bounded_histogram.cpp     -> BoundedHistogram.cpp
  config_loader.cpp         -> ConfigLoader.cpp
  config_resolver.cpp       -> ConfigResolver.cpp
  moq_stats_collector.cpp   -> MoQStatsCollector.cpp
  moqt_extensions.cpp       -> MoQTExtensions.cpp
  moqt_framer.cpp           -> MoQFramer.cpp
  prometheus_format.cpp     -> PrometheusFormat.cpp
  service_matcher.cpp       -> ServiceMatcher.cpp
  stats_registry.cpp        -> StatsRegistry.cpp

Each benchmark file's name now mirrors its primary system-under-test
(BoundedHistogram benchmarks src/stats/BoundedHistogram.cpp,
MoQFramer benchmarks moxygen's MoQFramer API, etc.).
…ist parser/writer init

Per Alan's review on PR #115. Three categories of change:

1. Drop benchmarks with low signal-to-noise:
   - benchmark/ConfigLoader.cpp (entire file): config load and resolve
     run once at startup; perf is "peanuts" relative to the worker-thread
     hot paths the benchmark suite is meant to track.
   - benchmark/ConfigResolver.cpp (entire file): same scope, same reasoning.
   - BM_ParseGoaway / BM_WriteGoaway from MoQFramer.cpp: control frames at
     session teardown — perf irrelevant.
   - BM_TrackNamespace_Describe from MoQFramer.cpp: describe() is for logs,
     not a measured workload.
   - BM_StatsSnapshot_FormatPrometheus from StatsRegistry.cpp: duplicate of
     the dedicated PrometheusFormat.cpp benchmark family.
   - BM_RequestErrorCodeIndex from StatsRegistry.cpp: enum-to-index lookup,
     "the micro-est of micro benchmarks" per review.
   - BM_StatsCollector_OnSubscribeError from MoQStatsCollector.cpp: not
     meaningfully different from OnSubscribeSuccess (symmetric implementation).

2. Hoist parser/writer init out of timed iter loops where the parser is
   reusable across calls. Each loop iteration was paying for:
   - A fresh MoQFrameParser construction + initializeVersion(kVersion)
   - An IOBuf clone() for the input wire data — unnecessary because Cursor
     is read-only and doesn't mutate the underlying buffer
   These two costs were dwarfing the actual parse work in some cases. Now
   parser is constructed once before the loop, and Cursor is created fresh
   per iter against the original wireData (no clone). Affects all parse
   benchmarks in MoQTExtensions.cpp and MoQFramer.cpp.

   For roundtrip benchmarks, also hoist the output IOBufQueue and use
   move() per iter to discard prior contents — cheaper than constructing a
   fresh queue on every iteration.

3. BM_StatsCollector_RecordLatency now cycles through 8 latency values
   spanning the full kLatencyBucketsUs range so the bucket-search code
   path is exercised under realistic mixed load (per review: "more
   interesting if we used different latency values that hit different
   parts of the histogram buckets").

Net change: -316 lines / +63 lines across 7 files. The remaining benchmarks
are tighter — each measures a worker-thread-hot path or a specific encoder/
decoder cost we want to track over time.
The earlier hoist of MoQFrameParser construction outside the iter loop
broke measurement: parseExtensions / parseFetch / parseSubscribeRequest /
parsePublishNamespace each carry internal state that persists across
calls, so the second iteration onwards short-circuited and Deserialize/
RoundTrip benchmarks collapsed to ~4ns / ~30ns regardless of N (instead
of the realistic 17µs–70µs at /1000).

Per-iter parser construction is cheap relative to the parse work itself
and is necessary for correct measurement.

Keep the malloc-reduction wins:
- No more wireData->clone() per iter (Cursor reads the IOBuf directly)
- Output IOBufQueue hoisted out of RoundTrip loops with outBuf.move()
  per iter to discard prior contents (cheaper than reconstruction)

Net effect: Deserialize and RoundTrip times will return to realistic
order-of-magnitude (~70µs at /1000 vs. the broken ~5ns) and the
encode-side parser overhead is unchanged from the original. Apologies
for the noise — Alan's hoist suggestion was correct in spirit but
runs into the parser's per-call state machine in practice.
parseExtensions takes length by reference and decrements it as bytes
are consumed. Reusing wireSize across iters meant only the first iter
saw real input — subsequent iters short-circuited at ~10ns regardless
of N (and previously misdiagnosed as a parser-state issue). Pass a
fresh length copy per iter; mark wireSize const to prevent recurrence.

Also anchor header.extensions explicitly via folly::doNotOptimizeAway
to mirror libquicr's benchmark::DoNotOptimize(extensions) pattern, so
the compiler can't elide the parsed-output vector pushbacks even if a
future change drops the length-mutation barrier.

Local sanity (Deserialize_LibquicrShape, ubuntu-22.04, ryzen):
  N=1     101ns
  N=10    749ns
  N=100   7.83µs
  N=1000  62.69µs   — linear, was previously ~10ns flat

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gmarzot
Copy link
Copy Markdown
Contributor Author

gmarzot commented May 8, 2026

libquicr ↔ moxygen extensions comparison — argo (M4 mini) — v2 (corrected)

Why a v2: the v1 run's BM_ExtensionsDeserialize* and BM_ExtensionsRoundTrip* numbers were artifacts. MoQFrameParser::parseExtensions(cursor, size_t& length, header) decrements length by reference; the benchmark passed the same wireSize variable every iter, so after iter 1 length was near zero and iter 2+ short-circuited on QuicInteger underflow at ~10ns regardless of N. Symptom matched the parser-state regression we'd debugged earlier, hence the misdiagnosis. Fixed in b68a68a (per-iter length = wireSize copy + folly::doNotOptimizeAway(header.extensions) to anchor the populated container — mirrors libquicr's DoNotOptimize(extensions) pattern). Serialize was always correct; only Deserialize/RoundTrip move in v2.

moxygen (this PR, after fix — argo)

============================================================================================
benchmark/MoQTExtensions.cpp                relative  time/iter   iters/s  bytes_per_iter  exts_per_iter
============================================================================================
BM_ExtensionsSerialize(_1)                                 46.65ns    21.44M               3            NaN
BM_ExtensionsSerialize(_10)                               226.12ns     4.42M              30            NaN
BM_ExtensionsSerialize(_100)                                2.46us   406.63K             535            NaN
BM_ExtensionsSerialize(_1000)                              24.68us    40.52K            5935            NaN
BM_ExtensionsDeserialize(_1)                               24.52ns    40.78M               3            NaN
BM_ExtensionsDeserialize(_10)                             186.18ns     5.37M              30            NaN
BM_ExtensionsDeserialize(_100)                              1.41us   708.69K             535            NaN
BM_ExtensionsDeserialize(_1000)                            12.52us    79.85K            5935            NaN
BM_ExtensionsRoundTrip(_1)                                 72.09ns    13.87M               3            NaN
BM_ExtensionsRoundTrip(_10)                               440.24ns     2.27M              30            NaN
BM_ExtensionsRoundTrip(_100)                                3.94us   254.06K             535            NaN
BM_ExtensionsRoundTrip(_1000)                              38.13us    26.23K            5935            NaN
BM_ExtensionsSerialize_LibquicrShape(_1)                   89.53ns    11.17M              11              2
BM_ExtensionsSerialize_LibquicrShape(_10)                 812.51ns     1.23M             155             20
BM_ExtensionsSerialize_LibquicrShape(_100)                  7.73us   129.44K            1505            200
BM_ExtensionsSerialize_LibquicrShape(_1000)                84.92us    11.78K           15005           2000
BM_ExtensionsSerialize_LikeLibquicrAllArray(_1)           146.08ns     6.85M              25              2
BM_ExtensionsSerialize_LikeLibquicrAllArray(_10)            1.16us   864.92K             225             20
BM_ExtensionsSerialize_LikeLibquicrAllArray(_100)          12.19us    82.06K            2205            200
BM_ExtensionsSerialize_LikeLibquicrAllArray(_1000)        131.05us     7.63K           22007           2000
BM_ExtensionsDeserialize_LibquicrShape(_1)                 60.90ns    16.42M              11              2
BM_ExtensionsDeserialize_LibquicrShape(_10)               657.89ns     1.52M             155             20
BM_ExtensionsDeserialize_LibquicrShape(_100)                4.93us   202.64K            1505            200
BM_ExtensionsDeserialize_LibquicrShape(_1000)              28.51us    35.07K           15005           2000
BM_ExtensionsDeserialize_LikeLibquicrAllArray(_1)          97.73ns    10.23M              25              2
BM_ExtensionsDeserialize_LikeLibquicrAllArray(_10)        884.94ns     1.13M             225             20
BM_ExtensionsDeserialize_LikeLibquicrAllArray(_100)         4.92us   203.33K            2205            200
BM_ExtensionsDeserialize_LikeLibquicrAllArray(_1000)       41.22us    24.26K           22007           2000
BM_ExtensionsRoundTrip_LibquicrShape(_1)                  160.13ns     6.24M              11              2
BM_ExtensionsRoundTrip_LibquicrShape(_10)                   1.58us   634.76K             155             20
BM_ExtensionsRoundTrip_LibquicrShape(_100)                 13.01us    76.87K            1505            200
BM_ExtensionsRoundTrip_LibquicrShape(_1000)                73.86us    13.54K           15005           2000
BM_ExtensionsRoundTrip_LikeLibquicrAllArray(_1)           261.29ns     3.83M              25              2
BM_ExtensionsRoundTrip_LikeLibquicrAllArray(_10)            2.12us   472.07K             225             20
BM_ExtensionsRoundTrip_LikeLibquicrAllArray(_100)          11.90us    84.00K            2205            200
BM_ExtensionsRoundTrip_LikeLibquicrAllArray(_1000)        114.65us     8.72K           22007           2000
============================================================================================

libquicr (current main — argo, this run)

ExtensionsSerialize/1           0.383 us   bytes_per_second=27.38Mi/s  items_per_second=5.22M/s
ExtensionsSerialize/10           2.30 us   bytes_per_second=56.71Mi/s  items_per_second=8.68M/s
ExtensionsSerialize/100          21.6 us   bytes_per_second=57.71Mi/s  items_per_second=9.26M/s
ExtensionsSerialize/1000          237 us   bytes_per_second=52.34Mi/s  items_per_second=8.44M/s
ExtensionsDeserialize/1         0.384 us   bytes_per_second=27.35Mi/s  items_per_second=5.21M/s
ExtensionsDeserialize/10         2.16 us   bytes_per_second=60.62Mi/s  items_per_second=9.28M/s
ExtensionsDeserialize/100        21.1 us   bytes_per_second=59.15Mi/s  items_per_second=9.49M/s
ExtensionsDeserialize/1000        226 us   bytes_per_second=55.00Mi/s  items_per_second=8.87M/s
ExtensionsRoundTrip/1           0.791 us   bytes_per_second=13.27Mi/s  items_per_second=2.53M/s
ExtensionsRoundTrip/10           4.52 us   bytes_per_second=28.92Mi/s  items_per_second=4.43M/s
ExtensionsRoundTrip/100          44.7 us   bytes_per_second=27.90Mi/s  items_per_second=4.48M/s
ExtensionsRoundTrip/1000          478 us   bytes_per_second=25.93Mi/s  items_per_second=4.18M/s

Per-extension steady-state (N=1000)

Two replica families now to make the comparison interpretable. Both call writeExtensions(buf, Extensions(mutable, immutable)) with N + N entries (= 2N extensions per call), matching libquicr's SerializeExtensions(buffer, extensions, immutable) with CreateTestExtensions(N) twice. Both use 8-byte payloads.

  • LikeLibquicrAllArray — every extension is odd-typed array form, forcing the moxygen encoder into the same per-extension memcpy work libquicr does. Apples-to-apples on encoder cost.
  • LibquicrShape — extension types are mixed parity. Even-typed entries use moxygen's int form (varint value, no payload memcpy in encoder); odd-typed entries use array form. Captures moxygen's wire-format advantage where ~half the entries skip the memcpy entirely.
Operation libquicr (ns/ext) moxygen LikeLibquicrAllArray (ns/ext) speedup moxygen LibquicrShape (ns/ext) speedup
Serialize 118.5 (237µs / 2000) 65.5 (131.05µs / 2000) 1.81× 42.5 (84.92µs / 2000) 2.79×
Deserialize 113.0 (226µs / 2000) 20.6 (41.22µs / 2000) 5.49× 14.3 (28.51µs / 2000) 7.93×
RoundTrip (per op) 119.5 (478µs / 4000) 28.7 (114.65µs / 4000) 4.16× 18.5 (73.86µs / 4000) 6.47×

RoundTrip "per op" normalizes by 2×2N = 4N (parse + reserialize) to keep it comparable to one-way costs.

Run setup

  • Host: argo (M4 mini, macOS Darwin 24.6.0, arm64, 10 cores)
  • Build profile: RelWithDebInfo, default optimization
  • moxygen commit: this PR @ b68a68a (with elision/length fix)
  • libquicr commit: main @ 320a51b6 (unchanged from v1)
  • Filter: --bm_regex='Extensions' (folly) / --benchmark_filter='Extensions' (Google)
  • Time budget: 10s per benchmark

Headline

Direction and order of magnitude in v1 were right; the per-extension multipliers below are the corrected numbers:

  • Apples-to-apples (LikeLibquicrAllArray vs libquicr): moxygen is ~1.8× faster on Serialize, ~5.5× on Deserialize, ~4.2× on RoundTrip per extension at N=1000.
  • With wire-format asymmetry (LibquicrShape — half the entries take the int-form fast path): ~2.8× / ~7.9× / ~6.5×.

The wider Deserialize/RoundTrip gap (vs. Serialize) is interesting: moxygen's parser is doing meaningfully less per-byte work than libquicr's for the same wire content.

…e/parts

Per Alan's review on PR #115:

- BM_TrackNamespace_Construct: hoist `parts` vector outside the loop and
  pass by const ref. The benchmark now measures TrackNamespace's constructor
  cost on a pre-built vector rather than the dominant cost of building the
  initializer_list + std::vector each iter.

- 5 write benchmarks (BM_Write{SubscribeRequest,SubgroupHeader,StreamObject,
  PublishNamespace,Fetch}): hoist `folly::IOBufQueue buf` outside the loop
  and call `buf.reset()` per iter. Same pattern applied to extension
  serialize benchmarks (BM_ExtensionsSerialize{,Array,_LibquicrShape,
  _LikeLibquicrAllArray}) and to RoundTrip output queues.

- 4 parse benchmarks (BM_Parse{SubscribeRequest,Fetch,PublishNamespace} +
  BM_SubscribeRoundTrip): hoist MoQFrameParser outside the loop. These
  control-frame parses don't touch the parser's delta-decoding state, so
  no carryover. (The "parser must be fresh per iter" comment in earlier
  commits was load-bearing on a misdiagnosis — see preceding commit.)

- 6 extension parse benchmarks (BM_ExtensionsDeserialize{,_LibquicrShape,
  _LikeLibquicrAllArray} and the matching RoundTrips): hoist parser too.
  parseExtensionKvPairs self-resets previousExtensionType_=0 at the top of
  each call (v16+ delta decoding), so per-iter state doesn't carry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gmarzot
Copy link
Copy Markdown
Contributor Author

gmarzot commented May 8, 2026

Update — review feedback applied (e4b0684)

Addressed the remaining non-blocking items from your review:

  • BM_TrackNamespace_Construct: hoisted parts outside the loop, pass by const ref. Was measuring initializer_list + std::vector building each iter; now measures TrackNamespace's constructor cost on a pre-built vector.
  • IOBufQueue declaration pattern: hoisted folly::IOBufQueue buf + buf.reset() per iter across all 5 write benchmarks in MoQFramer.cpp and the 4 serialize benchmarks in MoQTExtensions.cpp. Same pattern applied to RoundTrip output queues.
  • MoQFrameParser hoist: hoisted across the 4 control-frame parses (parseSubscribeRequest/Fetch/PublishNamespace + BM_SubscribeRoundTrip) and the 6 extension parse benchmarks. Verified safe: control-frame parses don't touch the parser's delta-decoding state, and parseExtensionKvPairs self-resets previousExtensionType_=0 at the top of each call. The earlier "parser must be fresh per iter" comment was load-bearing on the length-by-ref misdiagnosis from the v2 update.
  • parseExtensions result usage (Why not serialize *res below?): handled in v2 by anchoring header.extensions directly via folly::doNotOptimizeAway, mirroring libquicr's DoNotOptimize(extensions) pattern. Equivalent observability without the extra serialize work in the Deserialize-only benchmark.

Re-ran on argo. Numbers unchanged at the headline level (these were code-quality fixes, not perf wins). Comparison conclusions from v2 stand:

Operation libquicr ns/ext moxygen LikeLibquicrAllArray ns/ext speedup moxygen LibquicrShape ns/ext speedup
Serialize 118.5 65.0 1.82× 42.4 2.79×
Deserialize 113.0 21.4 5.28× 14.3 7.90×
RoundTrip (per op) 119.5 26.8 4.46× 18.5 6.46×

Full v3 argo output [pasted in /tmp on the host; raw numbers within ±2% of v2 across all tests].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants