Conversation
8 benchmark files covering MOQT framer (varint, frame writes, TrackNamespace/FullTrackName ops), stats collector, histogram, Prometheus formatting, service matcher, and config loader/resolver. 41 benchmarks total, modeled after Quicr/libquicr's benchmark suite. Build with: ./scripts/build.sh --benchmark Run with: ./build/benchmark/moqx_benchmark Also adds a benchmark CI job on ubuntu-22.04 for apples-to-apples comparison with libquicr's GitHub-hosted runner results.
afrind
left a comment
There was a problem hiding this comment.
@afrind made 2 comments.
Reviewable status: 0 of 12 files reviewed, 2 unresolved discussions (waiting on gmarzot).
benchmark/CMakeLists.txt line 10 at r1 (raw file):
FetchContent_Declare( benchmark GIT_REPOSITORY https://github.com/google/benchmark.git
We can use folly/Benchmark and not add another dep?
benchmark/moqt_framer.cpp line 9 at r1 (raw file):
namespace { using namespace moxygen;
This likely belongs in moxygen proper?
|
Thanks for the data. The varint encoder might be measuring the test rather than real-world usage -- we typically have an allocated buffer in advance, so this is free in practice. moxygen's varint parser is designed to work across discontiguous buffers to avoid memcpy -- stream data can arrive e.g. 1 byte at a time on the stream. But maybe we're optimized for the pessimistic rather than optimistic case? Or maybe memcpy is actually faster than setting up cursor infrastructure. The namespace design is a nice libquicr optimization that we should apply. We've just never looked -- these are only in control plane messages. |
- ExtensionsDeserialize/N: isolated parse (compare to libquicr) - ParseSubscribeRequest, ParseFetch, ParsePublishNamespace, ParseGoaway - SubscribeRoundTrip: full write+parse cycle 62 benchmarks total. Tested on WSL2 (linux) and argo (macOS ARM64).
Split varint encode into Cold (new IOBufQueue per call) and Warm (reused IOBufQueue). Warm reflects production where the buffer is pre-allocated. Shows 2-2.7x improvement over cold on both platforms. Addresses Alan's feedback that the cold benchmark measures allocation overhead rather than real-world varint encode cost.
|
Previously, afrind wrote…
Should i make that change before initial merge? its your call. claude says its doable .. |
|
Previously, afrind wrote…
i guess we can phase out benchmarks as they are upstreamed but focus is on moqx right? can play it howefver you wish |
Benchmark Report9 files, 63 benchmarks. All numbers from argo (Apple M4, 10 cores). libquicr built and run on the same machine for a fair comparison. Extensions Serialize/Deserialize/RoundTripBoth serialize MOQT extension key-value pairs per draft-ietf-moq-transport.
The implementations use different extension value types and serialization strategies, so these ratios reflect end-to-end framework differences, not a single design choice. QUIC Varint Encode/DecodeBoth encode/decode RFC 9000 variable-length integers through different APIs.
The encode gap is IOBuf framework overhead. moxygen's parser is designed for the pessimistic case of fragmented stream data arriving across discontiguous buffers, which adds overhead for the contiguous case. TrackNamespaceBoth implement MOQT TrackNamespace with different internal representations.
Architectural difference in moxygen's type design. Only in control plane messages. A flatter representation is a potential optimization to apply upstream. Frame Write and Parse (moqx only — no libquicr equivalent)
moqx Application Layer (no libquicr equivalent)
libquicr-only (no moqx equivalent yet)
Future work
|
Per review feedback (avoid adding google/benchmark as a new top-level
dep when folly is already pulled in transitively via the moxygen
standalone build).
Translation:
- #include <benchmark/benchmark.h> -> #include <folly/Benchmark.h>
- void BM_X(benchmark::State& state) { for (auto _ : state) {...} }
-> BENCHMARK(BM_X, iters) { for (unsigned i = 0; i < iters; ++i) {...} }
with folly::BenchmarkSuspender wrapping the previously-untimed
setup so semantics match Google's "setup is not timed" model.
- benchmark::DoNotOptimize -> folly::doNotOptimizeAway
- BENCHMARK(F)->Arg(N)->Arg(M) -> BENCHMARK_NAMED_PARAM(F, _N, N)
/ BENCHMARK_NAMED_PARAM(F, _M, M) (one line per arg)
- BENCHMARK_MAIN auto-generated main -> explicit benchmark_main.cpp
with folly::Init + folly::runBenchmarks.
CMake: drop FetchContent on google/benchmark, link Folly::follybenchmark
(already built as part of the standalone folly fetch — zero new deps).
CI workflows unchanged: bare `./build/benchmark/moqx_benchmark` works
with folly's defaults; no Google-specific CLI flags were used.
Same 63 benchmarks in the same 9 source files.
…-folly # Conflicts: # .github/workflows/ci-main.yml # .github/workflows/ci-pr.yml # scripts/build.sh
main renamed lower_case headers to CamelCase across the 215-commit drift since this PR was opened, and the moqx headers live at "$PROJECT/src/<...>" with the include path set to $PROJECT/src — not under a "moqx/" prefix. Update benchmark sources to match what the rest of the codebase uses: - "stats/BoundedHistogram.h" (was <moqx/stats/BoundedHistogram.h>) - "stats/StatsRegistry.h" (was <moqx/stats/StatsRegistry.h>) - "stats/MoQStatsCollector.h" (was <moqx/stats/MoQStatsCollector.h>) - "config/loader/Loader.h" (was <moqx/config/loader/loader.h>) - "config/loader/ConfigResolver.h" (was <moqx/config/loader/config_resolver.h>) - "config/loader/ParsedConfig.h" (was <moqx/config/loader/parsed_config.h>) - "ServiceMatcher.h" (was <moqx/ServiceMatcher.h>)
- --bm_json_verbose=bench-results.json — machine-readable output for
perf-regression detection across PRs and absolute throughput
computation (vs. raw ns/op which has ~1-5% framework-overhead noise).
- Step summary renders the human-readable table directly in the GitHub
Actions run UI (no artifact download needed for casual review).
- Both bench-results.json (machine-readable) and bench-output.txt
(human-readable) uploaded as CI artifacts per platform.
Per-platform artifact name: bench-results-{linux,macos}.
…ATIVE on warm varints UserCounters[bytes_per_iter] on the four moqt_extensions families (Serialize, SerializeArray, Deserialize, RoundTrip) — the wire byte count is the natural normalization point for libquicr comparison and removes per-iteration framework overhead from the throughput calculation: throughput (bytes/sec) = bytes_per_iter / ns_per_iter * 1e9 That number is framework-independent — comparable apples-to-apples between this folly Benchmark suite and libquicr's Google Benchmark output. BENCHMARK_RELATIVE on BM_VarintEncode_Warm and _WarmLarge so the output renders the pre-allocation speedup as a percentage relative to BM_VarintEncode_Cold. Makes the optimization story explicit in the table.
afrind
left a comment
There was a problem hiding this comment.
A bunch of feedback about the ways I think these can be improved, but marked non-blocking and approving - feel free to land when you are satisfied:
- there's several places where the BM is including more work than we're trying to measure
- some benchmarks seem stilly/provide little value
- still using snake_case filenames
@afrind reviewed 15 files and all commit messages, made 19 comments, and resolved 2 discussions.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on akash-a-n, michalhosna, mondain, Oxyd, peterchave, suhasHere, and TimEvens).
benchmark/moqt_framer.cpp line 9 at r1 (raw file):
Previously, gmarzot (Giovanni Marzot) wrote…
i guess we can phase out benchmarks as they are upstreamed but focus is on moqx right? can play it howefver you wish
Could still be in openmoq/moxygen, but it's simpler for now I guess to have a single benchmark.
benchmark/benchmark_main.cpp line 4 at r6 (raw file):
#include <folly/init/Init.h> int main(int argc, char** argv) {
Update to SnakeCase?
benchmark/config_loader.cpp line 14 at r6 (raw file):
// Write a temporary YAML config file for benchmarking. static std::string writeTempConfig(int numServices) {
I feel like there's another flavor of this utility in the test code elsewhere we should re-use? It may even create unique file names so it won't fail if e.g. /tmp/moqx_bench_config.yaml is not writable due to perms?
benchmark/config_loader.cpp line 77 at r6 (raw file):
} BENCHMARK(BM_ConfigGenerateSchema, iters) {
Do we think that having benchmarks for config loading and resolution really justifies their existence? This only happens once, at startup time, and is likely peanuts.
benchmark/moq_stats_collector.cpp line 27 at r6 (raw file):
susp.dismiss(); for (unsigned i = 0; i < iters; ++i) { pubCb->onSubscribeError(moxygen::RequestErrorCode::INTERNAL_ERROR);
Is this meaningfully different than the test above and if so, how?
benchmark/moq_stats_collector.cpp line 39 at r6 (raw file):
susp.dismiss(); for (unsigned i = 0; i < iters; ++i) { subCb->recordSubscribeLatency(latency);
This might be more interesting if we used different latency values that hit different parts of the histogram buckets
benchmark/moq_stats_collector.cpp line 56 at r6 (raw file):
susp.dismiss(); for (unsigned i = 0; i < iters; ++i) { auto snap = collector->snapshot();
This is the one we actually care about - it gets run in the worker thread.
benchmark/moqt_extensions.cpp line 55 at r6 (raw file):
bool err = false; writer.writeExtensions(buf, exts, sz, err); folly::doNotOptimizeAway(sz);
I don't think this can be optimized away because of the += ?
benchmark/moqt_extensions.cpp line 106 at r6 (raw file):
for (unsigned i = 0; i < iters; ++i) { MoQFrameParser parser;
I don't think you want to declare the parser under the load test every time?
benchmark/moqt_extensions.cpp line 144 at r6 (raw file):
MoQFrameParser parser; parser.initializeVersion(kVersion); auto buf = wireData->clone();
I don't think you need the clone() here -- it's a malloc.
Same comments as above -- would be ideal to declare/initialize the parser only once
benchmark/moqt_extensions.cpp line 150 at r6 (raw file):
size_t length = buf->computeChainDataLength(); auto res = parser.parseExtensions(cursor, length, header); folly::doNotOptimizeAway(res);
Why not serialize *res below?
benchmark/moqt_framer.cpp line 108 at r6 (raw file):
susp.dismiss(); for (unsigned i = 0; i < iters; ++i) { folly::IOBufQueue buf;
Unsure if we should declare the buf outside and clear it with buf.move()? Here and elsewhere
benchmark/moqt_framer.cpp line 229 at r6 (raw file):
for (unsigned i = 0; i < iters; ++i) { MoQFrameParser parser;
can remove parser init per loop and clone, here and elsewhere?
benchmark/moqt_framer.cpp line 290 at r6 (raw file):
} BENCHMARK(BM_ParseGoaway, iters) {
Measuring Goaway perf seems almost comical, but whatev.
benchmark/moqt_framer.cpp line 342 at r6 (raw file):
BENCHMARK(BM_TrackNamespace_Construct, iters) { for (unsigned i = 0; i < iters; ++i) { std::vector<std::string> parts = {"conference", "room42", "alice", "video"};
Don't you want this outside the loop?
benchmark/moqt_framer.cpp line 373 at r6 (raw file):
} BENCHMARK(BM_TrackNamespace_Describe, iters) {
Describe perf?
benchmark/stats_registry.cpp line 36 at r6 (raw file):
susp.dismiss(); for (unsigned i = 0; i < iters; ++i) { auto buf = StatsSnapshot::formatPrometheus(snap);
Is this duplicative of the other promethus benchmark above?
benchmark/stats_registry.cpp line 46 at r6 (raw file):
susp.dismiss(); for (unsigned i = 0; i < iters; ++i) { auto idx = requestErrorCodeIndex(code);
the micro-est of micro benchmarks
libquicr ↔ moxygen extensions comparison — argo (M4 mini)Same-hardware comparison run on argo. Both benchmarks filtered to extensions only, 10s budget. Source-build moxygen (with this PR), prebuilt libquicr binary at moxygen (folly Benchmark — this PR)libquicr (Google Benchmark — current main)Per-extension steady-state (N=1000), normalized
Run setup
Output artifactsJSON outputs stored on argo at:
Both have HeadlineMoxygen's extension framer is consistently ~5–6× faster per extension than libquicr's at steady state on the same hardware. Part of that gap comes from the different test data shape (smaller varint payload vs. 8-byte raw memcpy), so the apples-to-apples speedup is somewhat smaller than the raw ratio — but the direction and order of magnitude are clear: this is faster code on faster wire format, not slower. |
Per Alan's review feedback that the previous comparison conflated
benchmark-harness differences with real encoder/decoder performance,
add two new families that mirror libquicr's
ExtensionsSerialize/Deserialize/RoundTrip benchmark shape:
_LibquicrShape: 2N extensions per call (mutable + immutable),
type values mixed even/odd parity (matches
libquicr's CreateTestExtensions input).
Even types -> int form (no payload memcpy);
odd types -> array form (8-byte IOBuf, encoder
memcpy's payload). Matches libquicr's INPUT
shape; encoder cost is asymmetric across
parity.
_LikeLibquicrAllArray: 2N extensions per call, every entry odd-typed
and array-form with 8-byte IOBuf payload.
Forces the moxygen encoder into the same
per-extension memcpy work libquicr's
SerializeExtensions does for every payload.
Apples-to-apples on encoder cost.
The original 10x raw ratio in the first comparison run was inflated by
moxygen's existing benchmarks using int-form extensions (varint-encoded
values, no payload memcpy) versus libquicr always doing 8-byte byte-array
payloads (encoder memcpy per extension). The _LikeLibquicrAllArray
variant strips that wire-format-shape advantage out so the residual
speedup reflects actual encoder/decoder code efficiency rather than
the moq spec choice between int and array forms.
Both variants emit bytes_per_iter and exts_per_iter user counters for
framework-independent throughput math. Names are explicit about the
comparison they support so reviewers don't conflate them with the
existing moxygen-native int-form benchmarks.
Per Alan's review feedback (#115). The rest of the project uses CamelCase filenames (src/MoqxRelay.cpp, src/ServiceMatcher.cpp, src/stats/MoQStatsCollector.cpp, src/stats/StatsRegistry.cpp, src/stats/BoundedHistogram.cpp, src/config/loader/ConfigResolver.cpp, etc.) — the benchmark/ directory was the lone snake_case holdout. Renames: benchmark_main.cpp -> BenchmarkMain.cpp bounded_histogram.cpp -> BoundedHistogram.cpp config_loader.cpp -> ConfigLoader.cpp config_resolver.cpp -> ConfigResolver.cpp moq_stats_collector.cpp -> MoQStatsCollector.cpp moqt_extensions.cpp -> MoQTExtensions.cpp moqt_framer.cpp -> MoQFramer.cpp prometheus_format.cpp -> PrometheusFormat.cpp service_matcher.cpp -> ServiceMatcher.cpp stats_registry.cpp -> StatsRegistry.cpp Each benchmark file's name now mirrors its primary system-under-test (BoundedHistogram benchmarks src/stats/BoundedHistogram.cpp, MoQFramer benchmarks moxygen's MoQFramer API, etc.).
…ist parser/writer init Per Alan's review on PR #115. Three categories of change: 1. Drop benchmarks with low signal-to-noise: - benchmark/ConfigLoader.cpp (entire file): config load and resolve run once at startup; perf is "peanuts" relative to the worker-thread hot paths the benchmark suite is meant to track. - benchmark/ConfigResolver.cpp (entire file): same scope, same reasoning. - BM_ParseGoaway / BM_WriteGoaway from MoQFramer.cpp: control frames at session teardown — perf irrelevant. - BM_TrackNamespace_Describe from MoQFramer.cpp: describe() is for logs, not a measured workload. - BM_StatsSnapshot_FormatPrometheus from StatsRegistry.cpp: duplicate of the dedicated PrometheusFormat.cpp benchmark family. - BM_RequestErrorCodeIndex from StatsRegistry.cpp: enum-to-index lookup, "the micro-est of micro benchmarks" per review. - BM_StatsCollector_OnSubscribeError from MoQStatsCollector.cpp: not meaningfully different from OnSubscribeSuccess (symmetric implementation). 2. Hoist parser/writer init out of timed iter loops where the parser is reusable across calls. Each loop iteration was paying for: - A fresh MoQFrameParser construction + initializeVersion(kVersion) - An IOBuf clone() for the input wire data — unnecessary because Cursor is read-only and doesn't mutate the underlying buffer These two costs were dwarfing the actual parse work in some cases. Now parser is constructed once before the loop, and Cursor is created fresh per iter against the original wireData (no clone). Affects all parse benchmarks in MoQTExtensions.cpp and MoQFramer.cpp. For roundtrip benchmarks, also hoist the output IOBufQueue and use move() per iter to discard prior contents — cheaper than constructing a fresh queue on every iteration. 3. BM_StatsCollector_RecordLatency now cycles through 8 latency values spanning the full kLatencyBucketsUs range so the bucket-search code path is exercised under realistic mixed load (per review: "more interesting if we used different latency values that hit different parts of the histogram buckets"). Net change: -316 lines / +63 lines across 7 files. The remaining benchmarks are tighter — each measures a worker-thread-hot path or a specific encoder/ decoder cost we want to track over time.
The earlier hoist of MoQFrameParser construction outside the iter loop broke measurement: parseExtensions / parseFetch / parseSubscribeRequest / parsePublishNamespace each carry internal state that persists across calls, so the second iteration onwards short-circuited and Deserialize/ RoundTrip benchmarks collapsed to ~4ns / ~30ns regardless of N (instead of the realistic 17µs–70µs at /1000). Per-iter parser construction is cheap relative to the parse work itself and is necessary for correct measurement. Keep the malloc-reduction wins: - No more wireData->clone() per iter (Cursor reads the IOBuf directly) - Output IOBufQueue hoisted out of RoundTrip loops with outBuf.move() per iter to discard prior contents (cheaper than reconstruction) Net effect: Deserialize and RoundTrip times will return to realistic order-of-magnitude (~70µs at /1000 vs. the broken ~5ns) and the encode-side parser overhead is unchanged from the original. Apologies for the noise — Alan's hoist suggestion was correct in spirit but runs into the parser's per-call state machine in practice.
parseExtensions takes length by reference and decrements it as bytes are consumed. Reusing wireSize across iters meant only the first iter saw real input — subsequent iters short-circuited at ~10ns regardless of N (and previously misdiagnosed as a parser-state issue). Pass a fresh length copy per iter; mark wireSize const to prevent recurrence. Also anchor header.extensions explicitly via folly::doNotOptimizeAway to mirror libquicr's benchmark::DoNotOptimize(extensions) pattern, so the compiler can't elide the parsed-output vector pushbacks even if a future change drops the length-mutation barrier. Local sanity (Deserialize_LibquicrShape, ubuntu-22.04, ryzen): N=1 101ns N=10 749ns N=100 7.83µs N=1000 62.69µs — linear, was previously ~10ns flat Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
libquicr ↔ moxygen extensions comparison — argo (M4 mini) — v2 (corrected)Why a v2: the v1 run's moxygen (this PR, after fix — argo)libquicr (current main — argo, this run)Per-extension steady-state (N=1000)Two replica families now to make the comparison interpretable. Both call
Run setup
HeadlineDirection and order of magnitude in v1 were right; the per-extension multipliers below are the corrected numbers:
The wider Deserialize/RoundTrip gap (vs. Serialize) is interesting: moxygen's parser is doing meaningfully less per-byte work than libquicr's for the same wire content. |
…e/parts Per Alan's review on PR #115: - BM_TrackNamespace_Construct: hoist `parts` vector outside the loop and pass by const ref. The benchmark now measures TrackNamespace's constructor cost on a pre-built vector rather than the dominant cost of building the initializer_list + std::vector each iter. - 5 write benchmarks (BM_Write{SubscribeRequest,SubgroupHeader,StreamObject, PublishNamespace,Fetch}): hoist `folly::IOBufQueue buf` outside the loop and call `buf.reset()` per iter. Same pattern applied to extension serialize benchmarks (BM_ExtensionsSerialize{,Array,_LibquicrShape, _LikeLibquicrAllArray}) and to RoundTrip output queues. - 4 parse benchmarks (BM_Parse{SubscribeRequest,Fetch,PublishNamespace} + BM_SubscribeRoundTrip): hoist MoQFrameParser outside the loop. These control-frame parses don't touch the parser's delta-decoding state, so no carryover. (The "parser must be fresh per iter" comment in earlier commits was load-bearing on a misdiagnosis — see preceding commit.) - 6 extension parse benchmarks (BM_ExtensionsDeserialize{,_LibquicrShape, _LikeLibquicrAllArray} and the matching RoundTrips): hoist parser too. parseExtensionKvPairs self-resets previousExtensionType_=0 at the top of each call (v16+ delta decoding), so per-iter state doesn't carry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update — review feedback applied (e4b0684)Addressed the remaining non-blocking items from your review:
Re-ran on argo. Numbers unchanged at the headline level (these were code-quality fixes, not perf wins). Comparison conclusions from v2 stand:
Full v3 argo output [pasted in /tmp on the host; raw numbers within ±2% of v2 across all tests]. |
Summary
Add micro-benchmark suite using folly Benchmark, modeled after Quicr/libquicr.
Folly::follybenchmark(already built as part of the moxygen standalone build).--benchmarkflag forscripts/build.shGFLAGS_SHARED=ONon Darwin)grep -P→sed)Build & run
Test plan
Future benchmarks
PQ_ConnDataForwarding, requires mock sessions)StatsRegistry::aggregateAsync)This change is