RDKCOM-5533: RDKBDEV-3385 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion by mohamedakif1 · Pull Request #230 · rdkcentral/utopia

mohamedakif1 · 2026-02-18T13:23:37Z

Reason For Change:
. syseventd accepts client connections and performs synchronous reads/writes on client fds. Without per‑socket send/recv timeouts, a slow, stalled, or misbehaving client can cause write()/read() to block for extended periods.This pins a worker thread in a tight retry loop or prolonged wait, starving other work.
Worker threads were blocked on client socket I/O without timeouts; unhealthy clients starved the pool which makes only thread1/thread2 to be active. patch adds 200 ms per‑socket timeouts, return accurate errno from send routines, and actively evict misbehaving clients via centralized failure handling,

Testing Done:
. Tested the patch integrated image on the stress test of 500 loops of continuous sysevent set of wan status start and stop, sysevent fd count seems to be in limit and sysevent operations are working fine.

github-actions · 2026-02-18T13:23:51Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

mohamedakif1 · 2026-02-18T13:26:09Z

I have read the CLA Document and I hereby sign the CLA

snayak002c · 2026-03-11T23:28:23Z

There are many components which open sysevent fd and will keep that fd open until process lifetime, what is the behavior of that scenario with this fix.

snayak002c · 2026-03-13T15:47:14Z

source/sysevent/server/syseventd_main.c

   ulog(ULOG_SYSTEM, UL_SYSEVENT, "Received reinit signal");
 }

+static int set_fd_socket_timeouts(int fd, int timeout_ms)


There are many components which open sysevent fd and will keep that fd open until process lifetime, what is the behavior of that scenario with this fix.

mohamedakif1 · 2026-03-17T11:16:27Z

There are many components which open sysevent fd and will keep that fd open until process lifetime, what is the behavior of that scenario with this fix.

@snayak002c : The timeouts we introduced (SO_SNDTIMEO , SO_RCVTIMEO ) are for any send/recv operation being stuck. it is not for the idle time when fd is simply open but no active send /recv is in effect. fd remains usable for future purpose.
We have an example of wan manager where the fd for sysevent is opened once during the startup and kept open until the life cycle of the process and used for sysevent set/get operations.
We have tested the WAN scenarios thoroughly and have observed NO negative impact of this change .

Reason For Change: . syseventd accepts client connections and performs synchronous reads/writes on client fds. Without per‑socket send/recv timeouts, a slow, stalled, or misbehaving client can cause write()/read() to block for extended periods.This pins a worker thread in a tight retry loop or prolonged wait, starving other work. Worker threads were blocked on client socket I/O without timeouts; unhealthy clients starved the pool which makes only thread1/thread2 to be active. patch adds 200 ms per‑socket timeouts, return accurate errno from send routines, and actively evict misbehaving clients via centralized failure handling, Testing Done: . Tested the patch integrated image on the stress test of 500 loops of continuous sysevent set of wan status start and stop, sysevent fd count seems to be in limit and sysevent operations are working fine. Signed-off-by: Mohamed Akif Shaikh <mohamedakif.shaikh@vodafone.com>

mohamedakif1 · 2026-03-24T07:58:48Z

@snayak002c could you please review the response for your query above?
also, not sure why this PR is moved to "Requested change" is there any review comment on the source code? please let us know.
cc: @cpokuru @pradeeptakdas

Copilot

Pull request overview

This PR addresses syseventd worker/thread and fd exhaustion by ensuring client socket I/O doesn’t block indefinitely and by adding centralized handling for client send failures.

Changes:

Add per-socket send/receive timeouts (200ms) for newly accepted client connections.
Track and propagate the correct errno from SE_msg_send() / SE_msg_send_data() failures.
Introduce centralized send-failure handling for notification sends, including client health tracking and forced disconnect on hard failures.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`source/sysevent/server/worker_threads.c`	Adds `handle_client_send_failure()` and uses it on notification send failures.
`source/sysevent/server/syseventd_main.c`	Applies SO_SNDTIMEO/SO_RCVTIMEO to newly accepted client sockets.
`source/sysevent/lib/libsysevent.c`	Preserves the last socket write errno on send failure by restoring `errno` before returning `-1`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T23:46:35Z

source/sysevent/server/worker_threads.c

+         printf("Thread %d: hard send failure on client fd %d (%s). errno=%d %s. Disconnecting immediately\n",
+                id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno));
+      )
+      CLI_MGR_remove_client_by_fd(fd, who, 1);


In the hard-failure path, CLI_MGR_remove_client_by_fd(fd, who, 1) uses force=1, which bypasses the (fd,id) consistency check. In a multi-threaded server where fds can be closed and reused, forcing removal increases the risk of removing the wrong client entry. Prefer force=0 here (you already pass the client token) so the client manager can verify the id matches before removal.

Suggested change

CLI_MGR_remove_client_by_fd(fd, who, 1);

CLI_MGR_remove_client_by_fd(fd, who, 0);

Copilot · 2026-03-24T23:46:36Z

source/sysevent/server/worker_threads.c

+         SE_INC_LOG(ERROR,
+            int id = thread_get_id(worker_data_key);
+            printf("Thread %d: client fd %d disconnected after repeated send timeouts\n", id, fd);
+         )


CLI_MGR_handle_client_error_by_fd() can forcibly disconnect a client when the error threshold is exceeded, but that function only free_a_client() + close(fd) and does not remove any trigger notification registrations (unlike CLI_MGR_remove_client_by_fd, which does). With the new send-timeout path calling CLI_MGR_handle_client_error_by_fd() frequently, this can leave stale notification actions behind after a timeout-driven disconnect. Consider updating the disconnect behavior to also unregister notifications (e.g., by invoking the same trigger cleanup done in CLI_MGR_remove_client_by_fd, or by centralizing disconnect logic in one place).

Suggested change

)

)

/*

* When the error threshold is exceeded, CLI_MGR_handle_client_error_by_fd()

* will disconnect the client (free + close), but it does not unregister

* any trigger notification registrations. Ensure we also run the full

* disconnect path (including trigger cleanup) just as in the hard

* send-failure case below.

*/

CLI_MGR_remove_client_by_fd(fd, who, 1);

Copilot · 2026-03-24T23:46:36Z

source/sysevent/server/worker_threads.c

+static void handle_client_send_failure(const int fd, const token_t who, const char *context)
+{
+   int send_errno = errno;
+
+   if (EAGAIN == send_errno || EWOULDBLOCK == send_errno) {
+      int threshold_rc = CLI_MGR_handle_client_error_by_fd(fd);
+      SE_INC_LOG(ERROR,
+         int id = thread_get_id(worker_data_key);
+         printf("Thread %d: send timeout to client fd %d (%s). errno=%d %s threshold_rc=%d\n",
+                id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno), threshold_rc);
+      )
+      if (1 == threshold_rc) {
+         SE_INC_LOG(ERROR,
+            int id = thread_get_id(worker_data_key);
+            printf("Thread %d: client fd %d disconnected after repeated send timeouts\n", id, fd);
+         )
+      }
+      return;
+   }
+
+   if (EPIPE == send_errno || ECONNRESET == send_errno || ENOTCONN == send_errno) {
+      SE_INC_LOG(ERROR,
+         int id = thread_get_id(worker_data_key);
+         printf("Thread %d: hard send failure on client fd %d (%s). errno=%d %s. Disconnecting immediately\n",
+                id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno));
+      )
+      CLI_MGR_remove_client_by_fd(fd, who, 1);


The parameter name who in handle_client_send_failure() is misleading here: the function is operating on a target client identified by fd, and the who value is only used as the client token/id for removal. Given the surrounding code already has both a token_t id (client token) and multiple int id (thread id) variables in log scopes, this is easy to misuse. Renaming the parameter to something like client_id (and updating call sites) would reduce the chance of passing the thread id or message sender token by mistake.

Copilot · 2026-03-24T23:46:36Z

source/sysevent/server/syseventd_main.c

+                   if (-1 == set_fd_socket_timeouts(newsockfd, 200)) {
+                      ulogf(ULOG_SYSTEM, UL_SYSEVENT,
+                            "Unable to set client fd %d socket timeouts. (%d) %s",
+                            newsockfd, errno, strerror(errno));


If set_fd_socket_timeouts() fails, the code logs but continues keeping the new client connection alive in fully-blocking mode. Since the motivation of this PR is to prevent worker starvation on slow/stalled clients, it would be safer to reject the connection when timeouts cannot be applied (e.g., close newsockfd and remove the just-created client entry), or alternatively set the socket to non-blocking + use select/poll-based I/O for this connection.

Suggested change

newsockfd, errno, strerror(errno));

newsockfd, errno, strerror(errno));

/* Reject the connection if we cannot apply socket timeouts,

* to avoid blocking workers on a fully-blocking client.

*/

reply_msg_body->status = htonl(-1);

reply_msg_body->token_id = htonl(TOKEN_NULL);

SE_msg_send(newsockfd, reply_msg_buffer);

SE_INC_LOG(MESSAGES,

printf("Sent %s (%d) to client on fd %d\n",

SE_print_mtype(SE_MSG_OPEN_CONNECTION_REPLY), SE_MSG_OPEN_CONNECTION_REPLY, newsockfd);

)

SE_INC_LOG(MESSAGE_VERBOSE,

SE_print_message_hdr(reply_msg_buffer);

)

close(newsockfd);

continue;

anoopchelakkode

how can this issue be reproduced? Before merging this code, please share the reproduction and validation results.

tinaelizabeth84 · 2026-03-26T20:00:36Z

@mohamedakif1 Can you please address the query from Anoop?

mohamedakif1 · 2026-03-27T06:00:52Z

how can this issue be reproduced? Before merging this code, please share the reproduction and validation results.

@anoopchelakkode you can just use this simulation script which puts syseventd under stress. and issue will be reproduced starting 210th cycle or 220th cycle.. slowness in response will be visible starting 120th cycle around.
the normal fd count is 49-53 in normal conditions... in below test you would see after some cycles, the fd count just keeps on incrementing and then finally will exhaust the total ulimit 1024 fd.
I have verified this script on vanilla build on rpi and was able to reproduce the issue.
`#!/bin/bash

for i in $(seq 1 500); do

sysevent set wan-status stopped

sysevent set wan-status started

sysevent get firewall-restart

echo "Cycle $i complete"

echo "FD count"
ls /proc/$(pidof syseventd)/fd | wc -l`

mohamedakif1 requested review from a team as code owners February 18, 2026 13:23

pradeeptakdas changed the title ~~VODAFONE-45: Sysevent daemon getting stuck due to fd & worker exhaustion~~ RDKBDEV-3385: RDKCOM-5533 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion Feb 25, 2026

pradeeptakdas force-pushed the VODAFONE-45 branch from 369b042 to 3646792 Compare February 26, 2026 13:56

pradeeptakdas changed the title ~~RDKBDEV-3385: RDKCOM-5533 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion~~ RDKCOM-5533: RDKBDEV-3385 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion Mar 2, 2026

veeraputhiran-thangavel approved these changes Mar 9, 2026

View reviewed changes

snayak002c requested changes Mar 13, 2026

View reviewed changes

AkhilaReddyK7 added the community-contribution Contribution from community label Mar 17, 2026

tinaelizabeth84 force-pushed the VODAFONE-45 branch from 5200813 to 0cea138 Compare March 20, 2026 16:50

tinaelizabeth84 force-pushed the VODAFONE-45 branch from 0cea138 to 435e2d4 Compare March 23, 2026 17:29

Merge branch 'develop' into VODAFONE-45

d15a134

anoopchelakkode requested a review from Copilot March 24, 2026 23:43

Copilot started reviewing on behalf of anoopchelakkode March 24, 2026 23:43 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

anoopchelakkode reviewed Mar 24, 2026

View reviewed changes

	CLI_MGR_remove_client_by_fd(fd, who, 1);
	CLI_MGR_remove_client_by_fd(fd, who, 0);

-         )
+         )
+         /*
+          * When the error threshold is exceeded, CLI_MGR_handle_client_error_by_fd()
+          * will disconnect the client (free + close), but it does not unregister
+          * any trigger notification registrations. Ensure we also run the full
+          * disconnect path (including trigger cleanup) just as in the hard
+          * send-failure case below.
+          */
+         CLI_MGR_remove_client_by_fd(fd, who, 1);

-                            newsockfd, errno, strerror(errno));
+                            newsockfd, errno, strerror(errno));
+                      /* Reject the connection if we cannot apply socket timeouts,
+                       * to avoid blocking workers on a fully-blocking client.
+                       */
+                      reply_msg_body->status   = htonl(-1);
+                      reply_msg_body->token_id = htonl(TOKEN_NULL);
+                      SE_msg_send(newsockfd, reply_msg_buffer);
+                      SE_INC_LOG(MESSAGES,
+                         printf("Sent %s (%d) to client on fd %d\n",
+                                SE_print_mtype(SE_MSG_OPEN_CONNECTION_REPLY), SE_MSG_OPEN_CONNECTION_REPLY, newsockfd);
+                      )
+                      SE_INC_LOG(MESSAGE_VERBOSE,
+                         SE_print_message_hdr(reply_msg_buffer);
+                      )
+                      close(newsockfd);
+                      continue;

Conversation

mohamedakif1 commented Feb 18, 2026

Uh oh!

github-actions bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohamedakif1 commented Feb 18, 2026

Uh oh!

snayak002c commented Mar 11, 2026

Uh oh!

snayak002c Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

mohamedakif1 commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mohamedakif1 commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

anoopchelakkode left a comment

Choose a reason for hiding this comment

Uh oh!

tinaelizabeth84 commented Mar 26, 2026

Uh oh!

mohamedakif1 commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

github-actions bot commented Feb 18, 2026 •

edited

Loading

mohamedakif1 commented Mar 17, 2026 •

edited

Loading