azure monitor exporter gzip batcher and performance improvements. #1490

gouslu · 2025-11-29T20:05:35Z

async exports
ack/nack support
refactoring for better readability
other performance improvements leading to 25-30k events per second based on 1kb completely random highly uncompressible events

codecov · 2025-11-29T20:07:45Z

Codecov Report

❌ Patch coverage is 81.71937% with 370 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.08%. Comparing base (28554f9) to head (741f6d1).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1490      +/-   ##
==========================================
- Coverage   84.12%   84.08%   -0.05%     
==========================================
  Files         444      451       +7     
  Lines      125704   128189    +2485     
==========================================
+ Hits       105749   107782    +2033     
- Misses      19421    19873     +452     
  Partials      534      534

Components	Coverage Δ
otap-dataflow	`85.78% <81.71%> (-0.14%)`	⬇️
query_abstraction	`80.61% <ø> (ø)`
query_engine	`89.98% <ø> (ø)`
syslog_cef_receivers	`∅ <ø> (∅)`
otel-arrow-go	`53.50% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/auth.rs

lalitb · 2025-12-01T16:59:57Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/gzip_batcher.rs

+        if self.total_uncompressed_size == 0 {
+            self.buf
+                .write_all(b"[")
+                .expect("write to memory buffer failed");


With `expect, any unexpected gzip/memory error crashes the exporter instead of surfacing an error.

this should never happen. I understand the concern, but if this happens it is an indicator of a acritical bug.

Agree with @lalitb. Let's avoid expect usage (panic)

lalitb · 2025-12-01T17:04:59Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/client.rs

-        let json_bytes =
-            serde_json::to_vec(&body).map_err(|e| format!("Failed to serialize to JSON: {e}"))?;
+        // Clone static headers and add the auth header
+        let mut headers = self.static_headers.clone();


these are static headers - I believe we should be able to avoid cloning in hot path.

lalitb · 2025-12-01T17:09:25Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/gzip_batcher.rs

+            self.current_uncompressed_size += 1;
+        } else {
+            self.buf
+                .write_all(b",")


we append , here, but later when the size check triggers a flush, the batch is emitted as [... ,] without the attempted element. Trailing comma will make the JSON invalid, and most probably fail at ingestion.

the API handles this just fine, but you are right that this is not a valid json format, so I will fix that as you suggested.

jmacd

Since this is still in draft form, please take these comments as conversation starters! 😀

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/auth.rs

jmacd · 2025-12-09T20:12:15Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/auth.rs

+// TODO - Remove print statements
+#[allow(clippy::print_stdout)]
+pub struct Auth {
+    credential: Arc<dyn TokenCredential>,
+    scope: String,
+    // Thread-safe shared token cache
+    cached_token: Arc<RwLock<Option<AccessToken>>>,
+}
+


Suggested change

// TODO - Remove print statements

#[allow(clippy::print_stdout)]

pub struct Auth {

credential: Arc<dyn TokenCredential>,

scope: String,

// Thread-safe shared token cache

cached_token: Arc<RwLock<Option<AccessToken>>>,

}

// TODO - Consolidate with crates/otap/src/{cloud_auth,object_store)/azure.rs

#[allow(clippy::print_stdout)]

pub struct Auth {

credential: Arc<dyn TokenCredential>,

scope: String,

// Thread-safe shared token cache

cached_token: Arc<RwLock<Option<AccessToken>>>,

}

I recognize this is using things already committed in crates/otap/src/experimental/azure_monitor_exporter/config.rs.

Have a look at rust/otap-dataflow/crates/otap/src/cloud_auth/azure.rs and rust/otap-dataflow/crates/otap/src/object_store/azure.rs, added subsequently in #1517. The struct here looks similar to the object_store struct, and there are two similar Auth configs now. It will be nice to see less code and more re-use as we move forward, not to block this PR.

(@lalitb please review. I believe our position should be that Azure auth code/config belongs in an extension, the extension used by parquet_exporter for object_store and by Azure Monitor here.)

@jmacd That's correct. Instead of each exporter (Azure Monitor, Parquet/object_store, etc.) implementing its own auth config and credential creation logic, they should all reference this shared extension - and any common requirements should be met by extending this module rather than adding parallel implementations.

jmacd · 2025-12-09T21:06:01Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/exporter.rs

+    failed_batch_count: f64,
+    failed_msg_count: f64,
+    average_client_latency_secs: f64,
+}


Have look at a few of the components in the crates/otap/src folder, such as retry_processor.rs, and how there is an existing metrics APIs covering at the Counters here.

As for histogram measurements, I would prefer to hold back. Again, asking @lquerel for opinions: recording histogram instruments could be a thread-local histogram data structure or a message-passing of latency measurements (or both), at which point it may as well be a span message aggregated in the OTel SDK or one of our own pipelines for a latency metric.

We shouldn't reinvent this stuff, see https://github.com/open-telemetry/opentelemetry-rust/blob/main/opentelemetry-sdk/src/metrics/internal/exponential_histogram.rs.

I see this as a question of routing histogram measurements (per core) to the central instrumentation collector

@gouslu I understand that some of this is for your own performance studies. Latency and counters aside, can we remove processing_started_at, last_message_received_at, idle_duration, and average_client_latency_secs?

As for the counters, we aim to standardize pipeline metrics, the topic in #487 and a model RFC in the collector RFC https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/rfcs/component-universal-telemetry.md. The counters here are fine until we have a more-standard solution.

I need these for myself until I am done. Can put TODO to change this later.

jmacd · 2025-12-09T21:15:08Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/exporter.rs

+    async fn handle_pdata(
+        &mut self,
+        effect_handler: &EffectHandler<OtapPdata>,
+        request: ExportLogsServiceRequest,


It will be a relatively large performance improvement when we avoid constructing ExportLogsServiceRequest objects and use the view instead. @lalitb please help with pointer and/or an example?

+1, @gouslu - you can find the example, and the benchmark here - OtapLogsView

jmacd · 2025-12-09T21:17:29Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/exporter.rs

+                            self.handle_shutdown(&effect_handler).await?;
+                            return Ok(TerminalState::new(
+                                deadline,
+                                std::iter::empty::<otap_df_telemetry::metrics::MetricSetSnapshot>(),


I mentioned in another comment, we should follow other crates/otap component examples of the metrics integration with crates/telemetry: then you'll have a proper MetricsSet at this point.

I can look into this in future updates. currently focused on optimizing the performance and refactoring the code.

jmacd · 2025-12-09T21:18:14Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/in_flight_exports.rs

+}
+
+pub struct InFlightExports {
+    futures: FuturesUnordered<BoxFuture<'static, CompletedExport>>,


jmacd · 2025-12-09T21:20:50Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/exporter.rs

+─────────────── AzureMonitorExporter ──────────────────────────
+ perf    │ th/s={:.2}  avg_lat={:.2}ms
+ success │ rows={:.0}  batches={:.0}  msgs={:.0}         
+ fail    │ rows={:.0}  batches={:.0}  msgs={:.0}       
+ time    │ elapsed={:.1}s  active={:.1}s  idle={:.1}s
+ state   | batch_to_msg={}  msg_to_batch={}  msg_to_data={}
+───────────────────────────────────────────────────────────────\n",


FYI @cijothomas I've recommended that this component use the crates/telemetry framework, which would mean we could compute performance measurements using the continuous benchmarks. OTOH we would need a server that accepts gzip-compressed-json for the testing. Either way, I've recommended to use the Counter<_> and existing metrics support for now (for counters); asked questions here about logging and histogram measurements.

created a TODO for tracking metrics, will work with cijo for perf testing.

jmacd · 2025-12-09T21:22:51Z

All of my real questions have to do with instrumentation and the potential for re-use of Azure auth structs. The code looks good!

lalitb · 2025-12-10T08:30:33Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/client.rs

    pub fn new(config: &Config) -> Result<Self, String> {
        let http_client = Client::builder()
            .timeout(Duration::from_secs(30))
+            .http2_prior_knowledge()  // Use HTTP/2 directly (faster, no upgrade negotiation)


A note on HTTP/2: Since Azure Monitor uses HTTPS, ALPN negotiates HTTP/2 as part of the SSL handshake - so there's no extra round trip. Explicitly adding this configuration provides no benefit.

I'd also recommend benchmarking with both HTTP/1.1 and HTTP/2. With HTTP/2, the Geneva backend restricts the client from creating new connections, meaning all requests are multiplexed over a single connection. When payload sizes are large, this single connection can become a bottleneck and lead to timeouts. I ran into this exact issue with the Geneva exporter and had to switch to HTTP/1.1 to resolve it.

Not a blocker for this PR, but good to do some benchmark.

I was considering this as well, switched back to 1.1. I did 't have any perf differences, but chose 1.1 because it is better for creating 1 http client per API client, which is what helps improve perf a lot in my case.

jmacd · 2025-12-12T19:46:11Z

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/exporter.rs

+
+                _ = tokio::time::sleep_until(next_stats_print) => {
+                    next_stats_print = tokio::time::Instant::now() + tokio::time::Duration::from_secs(STATS_PRINT_INTERVAL);


OK for experimental. We'll want to use the built-in metrics, and we'll want to make the continuous benchmark support this exporter as we've discussed.

github-project-automation bot added this to OTel-Arrow Nov 29, 2025

github-actions bot added the rust Pull requests that update Rust code label Nov 29, 2025

gouslu changed the title ~~gigla gzip batcher and performance improvements.~~ azure monitor exporter gzip batcher and performance improvements. Nov 29, 2025

lalitb reviewed Dec 1, 2025

View reviewed changes

rust/otap-dataflow/crates/otap/src/experimental/azure_monitor_exporter/auth.rs Show resolved Hide resolved

lalitb reviewed Dec 1, 2025

View reviewed changes

gouslu force-pushed the gouslu/gigla_exporter_batching branch from d0d5368 to 413048d Compare December 3, 2025 04:11

jmacd reviewed Dec 9, 2025

View reviewed changes

gouslu force-pushed the gouslu/gigla_exporter_batching branch from 9f5cae4 to 29fb7bc Compare December 9, 2025 23:53

lalitb reviewed Dec 10, 2025

View reviewed changes

jmacd mentioned this pull request Dec 10, 2025

OTAP dataflow logging SDK requirements #1576

Open

gouslu added 16 commits December 11, 2025 19:33

gzip batcher

4e03eb8

cargo fmt

5bc6be0

fixes and work in progress

3de4b61

client/auth optimizations

dd6da65

cargo fmt

38c2d33

clippy

4be4dde

fix the tests

4fbcf53

cargo fmt

1e5750e

clippy

f4b6b79

fix error

0a66c60

fmt

cea7c62

add row/s for benchmarking for now + auth optimizations

c65883b

fmt

07f8c89

clippy fix

b1a3671

work in progress.

b556150

work in progress.

3d01439

gouslu added 17 commits December 11, 2025 19:33

clippy

b4e9243

wip

1e96e5d

work in progress.

4e2fb4c

a bit more optimizations

c3798bd

dockerignore for builds

575fa5a

backup/wip

256994e

work in progress.

b6a0546

work in progress.

2d29a50

work in progress.

0e64e6a

work in progress.

56cca3e

fmt

b6ca0ab

work in progress.

e428cde

work in progress.

dd1d3a9

fix

2bccfe9

add retry / improvements

6f410d1

fix http client issue by creating as many clients as in flight requests

cc0bf0c

added AI generated unit tests

c3e4790

gouslu force-pushed the gouslu/gigla_exporter_batching branch from be097bf to c3e4790 Compare December 11, 2025 19:33

gouslu added 7 commits December 11, 2025 19:36

fmt

4c7bc37

fmt, clippy

19258b1

fix tests

019165b

fix tests and fmt

74be645

reduce safety margin for gzip

fbfeae6

work in progress.

dd33e3e

Merge branch 'main' into gouslu/gigla_exporter_batching

809dc15

gouslu marked this pull request as ready for review December 12, 2025 00:18

gouslu requested a review from a team as a code owner December 12, 2025 00:18

add error handling as suggested

8029b62

jmacd approved these changes Dec 12, 2025

View reviewed changes

work in progress.

741f6d1


		_ = tokio::time::sleep_until(next_stats_print) => {
		next_stats_print = tokio::time::Instant::now() + tokio::time::Duration::from_secs(STATS_PRINT_INTERVAL);

azure monitor exporter gzip batcher and performance improvements. #1490

Are you sure you want to change the base?

azure monitor exporter gzip batcher and performance improvements. #1490

Uh oh!

Conversation

gouslu commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lalitb Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmacd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jmacd commented Dec 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gouslu commented Nov 29, 2025 •

edited

Loading

codecov bot commented Nov 29, 2025 •

edited

Loading

lalitb Dec 1, 2025 •

edited

Loading