RDKCOM-5533: RDKBDEV-3385 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion#230
RDKCOM-5533: RDKBDEV-3385 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion#230mohamedakif1 wants to merge 2 commits intordkcentral:developfrom
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
369b042 to
3646792
Compare
|
There are many components which open sysevent fd and will keep that fd open until process lifetime, what is the behavior of that scenario with this fix. |
| ulog(ULOG_SYSTEM, UL_SYSEVENT, "Received reinit signal"); | ||
| } | ||
|
|
||
| static int set_fd_socket_timeouts(int fd, int timeout_ms) |
There was a problem hiding this comment.
There are many components which open sysevent fd and will keep that fd open until process lifetime, what is the behavior of that scenario with this fix.
@snayak002c : The timeouts we introduced (SO_SNDTIMEO , SO_RCVTIMEO ) are for any send/recv operation being stuck. it is not for the idle time when fd is simply open but no active send /recv is in effect. fd remains usable for future purpose. |
5200813 to
0cea138
Compare
Reason For Change: . syseventd accepts client connections and performs synchronous reads/writes on client fds. Without per‑socket send/recv timeouts, a slow, stalled, or misbehaving client can cause write()/read() to block for extended periods.This pins a worker thread in a tight retry loop or prolonged wait, starving other work. Worker threads were blocked on client socket I/O without timeouts; unhealthy clients starved the pool which makes only thread1/thread2 to be active. patch adds 200 ms per‑socket timeouts, return accurate errno from send routines, and actively evict misbehaving clients via centralized failure handling, Testing Done: . Tested the patch integrated image on the stress test of 500 loops of continuous sysevent set of wan status start and stop, sysevent fd count seems to be in limit and sysevent operations are working fine. Signed-off-by: Mohamed Akif Shaikh <mohamedakif.shaikh@vodafone.com>
0cea138 to
435e2d4
Compare
|
@snayak002c could you please review the response for your query above? |
There was a problem hiding this comment.
Pull request overview
This PR addresses syseventd worker/thread and fd exhaustion by ensuring client socket I/O doesn’t block indefinitely and by adding centralized handling for client send failures.
Changes:
- Add per-socket send/receive timeouts (200ms) for newly accepted client connections.
- Track and propagate the correct
errnofromSE_msg_send()/SE_msg_send_data()failures. - Introduce centralized send-failure handling for notification sends, including client health tracking and forced disconnect on hard failures.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
source/sysevent/server/worker_threads.c |
Adds handle_client_send_failure() and uses it on notification send failures. |
source/sysevent/server/syseventd_main.c |
Applies SO_SNDTIMEO/SO_RCVTIMEO to newly accepted client sockets. |
source/sysevent/lib/libsysevent.c |
Preserves the last socket write errno on send failure by restoring errno before returning -1. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| printf("Thread %d: hard send failure on client fd %d (%s). errno=%d %s. Disconnecting immediately\n", | ||
| id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno)); | ||
| ) | ||
| CLI_MGR_remove_client_by_fd(fd, who, 1); |
There was a problem hiding this comment.
In the hard-failure path, CLI_MGR_remove_client_by_fd(fd, who, 1) uses force=1, which bypasses the (fd,id) consistency check. In a multi-threaded server where fds can be closed and reused, forcing removal increases the risk of removing the wrong client entry. Prefer force=0 here (you already pass the client token) so the client manager can verify the id matches before removal.
| CLI_MGR_remove_client_by_fd(fd, who, 1); | |
| CLI_MGR_remove_client_by_fd(fd, who, 0); |
| SE_INC_LOG(ERROR, | ||
| int id = thread_get_id(worker_data_key); | ||
| printf("Thread %d: client fd %d disconnected after repeated send timeouts\n", id, fd); | ||
| ) |
There was a problem hiding this comment.
CLI_MGR_handle_client_error_by_fd() can forcibly disconnect a client when the error threshold is exceeded, but that function only free_a_client() + close(fd) and does not remove any trigger notification registrations (unlike CLI_MGR_remove_client_by_fd, which does). With the new send-timeout path calling CLI_MGR_handle_client_error_by_fd() frequently, this can leave stale notification actions behind after a timeout-driven disconnect. Consider updating the disconnect behavior to also unregister notifications (e.g., by invoking the same trigger cleanup done in CLI_MGR_remove_client_by_fd, or by centralizing disconnect logic in one place).
| ) | |
| ) | |
| /* | |
| * When the error threshold is exceeded, CLI_MGR_handle_client_error_by_fd() | |
| * will disconnect the client (free + close), but it does not unregister | |
| * any trigger notification registrations. Ensure we also run the full | |
| * disconnect path (including trigger cleanup) just as in the hard | |
| * send-failure case below. | |
| */ | |
| CLI_MGR_remove_client_by_fd(fd, who, 1); |
| static void handle_client_send_failure(const int fd, const token_t who, const char *context) | ||
| { | ||
| int send_errno = errno; | ||
|
|
||
| if (EAGAIN == send_errno || EWOULDBLOCK == send_errno) { | ||
| int threshold_rc = CLI_MGR_handle_client_error_by_fd(fd); | ||
| SE_INC_LOG(ERROR, | ||
| int id = thread_get_id(worker_data_key); | ||
| printf("Thread %d: send timeout to client fd %d (%s). errno=%d %s threshold_rc=%d\n", | ||
| id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno), threshold_rc); | ||
| ) | ||
| if (1 == threshold_rc) { | ||
| SE_INC_LOG(ERROR, | ||
| int id = thread_get_id(worker_data_key); | ||
| printf("Thread %d: client fd %d disconnected after repeated send timeouts\n", id, fd); | ||
| ) | ||
| } | ||
| return; | ||
| } | ||
|
|
||
| if (EPIPE == send_errno || ECONNRESET == send_errno || ENOTCONN == send_errno) { | ||
| SE_INC_LOG(ERROR, | ||
| int id = thread_get_id(worker_data_key); | ||
| printf("Thread %d: hard send failure on client fd %d (%s). errno=%d %s. Disconnecting immediately\n", | ||
| id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno)); | ||
| ) | ||
| CLI_MGR_remove_client_by_fd(fd, who, 1); |
There was a problem hiding this comment.
The parameter name who in handle_client_send_failure() is misleading here: the function is operating on a target client identified by fd, and the who value is only used as the client token/id for removal. Given the surrounding code already has both a token_t id (client token) and multiple int id (thread id) variables in log scopes, this is easy to misuse. Renaming the parameter to something like client_id (and updating call sites) would reduce the chance of passing the thread id or message sender token by mistake.
| if (-1 == set_fd_socket_timeouts(newsockfd, 200)) { | ||
| ulogf(ULOG_SYSTEM, UL_SYSEVENT, | ||
| "Unable to set client fd %d socket timeouts. (%d) %s", | ||
| newsockfd, errno, strerror(errno)); |
There was a problem hiding this comment.
If set_fd_socket_timeouts() fails, the code logs but continues keeping the new client connection alive in fully-blocking mode. Since the motivation of this PR is to prevent worker starvation on slow/stalled clients, it would be safer to reject the connection when timeouts cannot be applied (e.g., close newsockfd and remove the just-created client entry), or alternatively set the socket to non-blocking + use select/poll-based I/O for this connection.
| newsockfd, errno, strerror(errno)); | |
| newsockfd, errno, strerror(errno)); | |
| /* Reject the connection if we cannot apply socket timeouts, | |
| * to avoid blocking workers on a fully-blocking client. | |
| */ | |
| reply_msg_body->status = htonl(-1); | |
| reply_msg_body->token_id = htonl(TOKEN_NULL); | |
| SE_msg_send(newsockfd, reply_msg_buffer); | |
| SE_INC_LOG(MESSAGES, | |
| printf("Sent %s (%d) to client on fd %d\n", | |
| SE_print_mtype(SE_MSG_OPEN_CONNECTION_REPLY), SE_MSG_OPEN_CONNECTION_REPLY, newsockfd); | |
| ) | |
| SE_INC_LOG(MESSAGE_VERBOSE, | |
| SE_print_message_hdr(reply_msg_buffer); | |
| ) | |
| close(newsockfd); | |
| continue; |
anoopchelakkode
left a comment
There was a problem hiding this comment.
how can this issue be reproduced? Before merging this code, please share the reproduction and validation results.
|
@mohamedakif1 Can you please address the query from Anoop? |
@anoopchelakkode you can just use this simulation script which puts syseventd under stress. and issue will be reproduced starting 210th cycle or 220th cycle.. slowness in response will be visible starting 120th cycle around. for i in $(seq 1 500); do sysevent set wan-status stopped sysevent set wan-status started sysevent get firewall-restart echo "Cycle $i complete" echo "FD count" |
Reason For Change:
. syseventd accepts client connections and performs synchronous reads/writes on client fds. Without per‑socket send/recv timeouts, a slow, stalled, or misbehaving client can cause write()/read() to block for extended periods.This pins a worker thread in a tight retry loop or prolonged wait, starving other work.
Worker threads were blocked on client socket I/O without timeouts; unhealthy clients starved the pool which makes only thread1/thread2 to be active. patch adds 200 ms per‑socket timeouts, return accurate errno from send routines, and actively evict misbehaving clients via centralized failure handling,
Testing Done:
. Tested the patch integrated image on the stress test of 500 loops of continuous sysevent set of wan status start and stop, sysevent fd count seems to be in limit and sysevent operations are working fine.