Conversation
Summary of ChangesHello @nirs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical network stability and performance issue on macOS when using Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request correctly handles the ENOBUFS error on macOS for unixgram sockets by mapping it to WriteError::NothingWritten, which allows the caller to retry sending the frame. This prevents a fatal error when the kernel socket buffer is full. I've added one suggestion to also handle EAGAIN and EWOULDBLOCK in the same way to improve portability and make the error handling more robust across different Unix-like systems.
|
@mtjhrc PTAL |
|
The change LGTM, but what did you mean by this in the commit message:
Wouldn't we get notified via kqueue the socket is writable again? (same as EAGAIN on Linux...) |
I could not find any evidence that we can use non-blocking io to detect if the socket is writable or not. The only info I could find was in FreeBSD mailing list thread explaining that the only way to recover is retrying. Looking at early experiments with vmnet-helper I did not try non-blocking io, or maybe I did not commit this since it did not work. |
|
Hmm, but we are already using a non-blocking socket here. Notably we also use What I am wondering is that, If receiving an |
I'm not sure that edge triggered behavior works for datagram socket. The socket is probably always writable since we never get EAGAIN. But this is a separate issue to investigate, retrying ENOBUFS works. It can be interesting to count retries and be able to get stats so we have some visibility on this issue. Maybe add debug log in this case? |
|
I mean this is definitely good, even if it likely isn't a full fix (like discussed above) - the code change LGTM. It's just the commit message seems misleading: "Map ENOBUFS to WriteError::NothingWritten so the caller retries the frame instead of treating it as a fatal error." - the caller doesn't retry on |
Right, "caller retries" is not a good description. The write is retried when the socket is considered writable (need to test if this happen) or the guest kicks us. I'll try to experiment more and have a more correct description how this is handled. |
|
I feel like we should be getting the EV_WRITE events. I made a simple test and it seemed to work: |
3236fe9 to
c1efb00
Compare
Interesting, but there are no timestamps so it is not clear if we got the event immediately after the read. |
|
@mtjhrc I'm trying the EV_WRITE way, it seems like a simple change. |
c1efb00 to
3a1c999
Compare
|
@mtjhrc Current version should be a complete fix. Tested with vment-helper and podman/gvproxy. I will post updated results later. |
|
Oh, so we were already busy looping on I'll try testing this a bit too, but fix looks good (I guess this now touches unixstream / passt too - fixes the busy loop) |
3a1c999 to
a8f3b9a
Compare
@mtjhrc Yes and no. We get WRITE event, but the next write may fail again with ENOBUFS. I added a log to track this.
Yes, the worker change fixes the same issue for unixtream. I did not test with unixstream . |
Example debug log showing ENUBUFS recoveryLogs created with #564. Timeline
Based on the timeline, we woke up 4 times on WRITE event. The first 3 write_frame attempts failed with ENOBUFS (we log only the first attempt). The 4th attempt suceeded, so we expect to see "resolved after 3 retries", but we see "resolved after 4 retries". So we woke up another time, probably due to guest notification (kevent: READ fd=83 data=8). We woke up twice when available socket buffer was 17996, but we try to write 66560 frame. This can explain why we get a write event but we cannot complete the write. So WRITE event works, but cannot guarantee that the write will be completed. |
c6006f4 to
44907ff
Compare
|
After testing this change with real ramen testing workload I see issues when podman networking is broken containers/gvisor-tap-vsock#617. |
82d5f9b to
a3e6b71
Compare
|
@mtjhrc Please review again. Current version adds a timer-based retry delay, fixing random hangs when we don't get a WRITE event after ENOBUFS. |
|
Ugh... sure a timer could resolve the deadlock, but it doesn't seem great. Thinking about this more I think the problem is that If I am right, using |
The frame size can change. Do you suggest to use the largest possible frame (65550) once when starting the worker? Or maybe 2 frames to ensure we have space for one fi there is additional space needed? |
I mean we could use the |
|
@mtjhrc I tried both:
Both do not work: under load (iperf3 -P 8 --bidir -t 600) we fail after some time when we get ENOBUFS and we never get WRITE event. I cannot explain why it does not work, but it many be related to the fact that with datagram sockets on macOS, the send buffer is not used (it only set the maximum frame size #574). The frames are copied to the peer receive buffer. There is a note that the write check is racy, but I don't see how this an effect our case (2 peers): In summary:
|
27fbde1 to
ac485f1
Compare
|
Need a rebase after #567 was merged, will simplify the code. |
If write_frame() returns NothingWritten, propagate the error to process_tx_loop() and return to the event loop. The event loop will wake us when the socket available space is changed. Previously when the backend could not write anything we enabled notifications and retried process_tx(), creating a busy loop ending when the write complete. Signed-off-by: Nir Soffer <nirsof@gmail.com>
Add add_oneshot_timer() to register a one-shot EVFILT_TIMER with microsecond precision. The caller provides a udata value to identify the timer event, since epoll has no timer concept — timer events are reported with empty event bits and the caller matches on udata. This will be used to implement delayed TX retries after ENOBUFS, where stress testing showed that edge-triggered EVFILT_WRITE does not re-fire reliably after the error. Assisted-by: Cursor/Claude Opus 4.6 Signed-off-by: Nir Soffer <nirsof@gmail.com>
When process_tx() fails with NothingWritten, set tx_has_deferred_frame and schedule a retry using the epoll shim's one-shot timer. The retry delay is provided by the backend via write_retry_delay_us(), which defaults to 0 (no timer). Backends opt in by overriding it. Assisted-by: Cursor/Claude Opus 4.6 Signed-off-by: Nir Soffer <nirsof@gmail.com>
When running iperf3 with gvproxy or vmnet-helper, krunkit breaks
randomly with:
[2026-02-19T02:53:41Z ERROR devices::virtio::net::worker] Failed to process rx:
Backend(Internal(ENOBUFS)) (triggered by backend socket readable)
macOS returns ENOBUFS when the kernel socket buffer is full, rather
than blocking or returning EAGAIN on non-blocking sockets. In
vmnet-helper this is handled by retrying the write after 50 microseconds
sleep.
We use non-blocking I/O, and we get WRITE events when the socket becomes
writeable, but because we need to write a complete frame, the write can
fail with ENOBUFS again after we get a WRITE event. Testing shows that
we typically get several WRITE event and ENOBUFS error is resolved after
few retries.
Stress testing shows that sometimes we don't get a WRITE event after the
network proxy read all data from the socket. The network proxy is
blocked on read(), libkrun blocked on kevent() never getting any event,
and the guest is blocked waiting for interrupt. The only reliable way to
recover from this is to retry the write after a delay.
Map ENOBUFS to WriteError::NothingWritten so the write is retried
instead of treating it as a fatal error. We also log a debug log to
make it easy to understand the events.
Override write_retry_delay_us() to return 50 microseconds delay. The
worker use this to retry the write, ensuring that we make progress when
WRITE event does not fire.
Assisted-by: Cursor/Claude Opus 4.6
Signed-off-by: Nir Soffer <nirsof@gmail.com>
ENOBUFS on macOS does not always resolve on the first writable event
from kevent. The writable event fires when any buffer space is
available, but for datagram sockets the entire message must fit
atomically. When writing large frames (e.g. 65226 bytes), the socket
may report writable with insufficient space (e.g. data=17996), causing
send() to return ENOBUFS until enough buffer drains.
Add a retry counter to log the first ENOBUFS event and the number of
retries on recovery:
[2026-02-26T16:36:24.440890Z INFO devices::virtio::net::unixgram] write_frame: ENOBUFS
[2026-02-26T16:36:24.441544Z INFO devices::virtio::net::unixgram] write_frame: ENOBUFS resolved after 1 retries
We use info level since ENOBUFS events are rare and testing in debug
level is about 3x slower, distorting the results.
ENOBUFS event are rare when testing with vment-helper, and common when
testing with gvproxy. The following stats are from 10 minutes stress
test using `iperf3 -c vm-address -P 8 --bidir -t 600`.
vment-helper stats:
Total ENOBUFS events: 293
Retries Count % Min Avg P99 Max
1 280 95.6% 17us 425us 1274us 1470us
2 12 4.1% 517us 656us 1047us 1047us
3 1 0.3% 624us 624us 624us 624us
Overall: min=17us avg=436us p50=532us p99=1274us max=1470us
Retry count histogram:
1 retries: ################################################## 280
2 retries: ## 12
3 retries: # 1
gvproxy stats:
Total ENOBUFS events: 18346
Retries Count % Min Avg P99 Max
1 17291 94.2% 7us 85us 394us 1566us
2 884 4.8% 14us 188us 650us 923us
3 151 0.8% 29us 271us 1095us 2092us
4 16 0.1% 80us 319us 959us 959us
5 4 0.0% 42us 280us 691us 691us
Overall: min=7us avg=92us p50=66us p99=452us max=2092us
Retry count histogram:
1 retries: ################################################## 17291
2 retries: ## 884
3 retries: # 151
4 retries: # 16
5 retries: # 4
Assisted-by: Cursor/Claude Opus 4.6
Signed-off-by: Nir Soffer <nirsof@gmail.com>
In read_frame we log:
Read eth frame from proxy: 65550 bytes
But in write_frame we log:
Written frame size=65226, written=65226
The frame size is always equal to the return value. There is no partial
write with datagram socket. Unify the log to:
Written eth frame to proxy: 65226 bytes
Signed-off-by: Nir Soffer <nirsof@gmail.com>
When running iperf3 with gvproxy or vmnet-helper, krunkit breaks randomly with:
macOS returns ENOBUFS when the kernel socket buffer is full, rather than blocking or returning EAGAIN on non-blocking sockets. Stress testing showed that edge-triggered EVFILT_WRITE does not re-fire reliably after ENOBUFS, causing a permanent stall where the worker is blocked in kevent(), the network proxy is blocked on read(), and the guest is waiting for an interrupt.
The fix uses a one-shot kqueue timer (50μs) to retry TX after NothingWritten, ensuring progress without busy-looping. The retry delay is backend-specific: only unixgram on macOS enables it, since datagram sockets can report writable with insufficient space for a complete frame.
Changes
Stop tx loop on NothingWritten — When a backend cannot write, propagate the error to process_tx_loop() and return to the event loop instead of busy-looping.
Add one-shot timer to macOS epoll shim — Add
add_oneshot_timer()using EVFILT_TIMER with microsecond precision. Timer events are reported with empty event bits; the caller identifies them by udata.Add timer-based TX retry in worker — When process_tx() fails with NothingWritten, set
tx_has_deferred_frameand schedule a retry via the one-shot timer. The retry delay is provided by the backend viawrite_retry_delay_us()(default 0, no timer).Retry ENOBUFS after delay — Map ENOBUFS to
WriteError::NothingWrittenand overridewrite_retry_delay_us()to return 50μs in the unixgram backend on macOS.Track ENOBUFS retries — Add a retry counter to log the first ENOBUFS event and recovery. 94% of events resolve on the first retry (50μs), worst case 5 retries (~2ms).
Unify read/write logs — Unify read_frame and write_frame debug logs.
Test results
Tested 600 seconds bidirectional iperf3 runs with 8 streams:
Configuration:
Fixes #555