Skip to content

RDKCOM-5533: RDKBDEV-3385 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion#230

Open
mohamedakif1 wants to merge 2 commits intordkcentral:developfrom
mohamedakif1:VODAFONE-45
Open

RDKCOM-5533: RDKBDEV-3385 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion#230
mohamedakif1 wants to merge 2 commits intordkcentral:developfrom
mohamedakif1:VODAFONE-45

Conversation

@mohamedakif1
Copy link
Copy Markdown

Reason For Change:
. syseventd accepts client connections and performs synchronous reads/writes on client fds. Without per‑socket send/recv timeouts, a slow, stalled, or misbehaving client can cause write()/read() to block for extended periods.This pins a worker thread in a tight retry loop or prolonged wait, starving other work.
Worker threads were blocked on client socket I/O without timeouts; unhealthy clients starved the pool which makes only thread1/thread2 to be active. patch adds 200 ms per‑socket timeouts, return accurate errno from send routines, and actively evict misbehaving clients via centralized failure handling,

Testing Done:
. Tested the patch integrated image on the stress test of 500 loops of continuous sysevent set of wan status start and stop, sysevent fd count seems to be in limit and sysevent operations are working fine.

@mohamedakif1 mohamedakif1 requested review from a team as code owners February 18, 2026 13:23
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 18, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@mohamedakif1
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@pradeeptakdas pradeeptakdas changed the title VODAFONE-45: Sysevent daemon getting stuck due to fd & worker exhaustion RDKBDEV-3385: RDKCOM-5533 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion Feb 25, 2026
@pradeeptakdas pradeeptakdas changed the title RDKBDEV-3385: RDKCOM-5533 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion RDKCOM-5533: RDKBDEV-3385 VODAFONE-45 Sysevent daemon getting stuck due to fd & worker exhaustion Mar 2, 2026
@snayak002c
Copy link
Copy Markdown
Contributor

There are many components which open sysevent fd and will keep that fd open until process lifetime, what is the behavior of that scenario with this fix.

ulog(ULOG_SYSTEM, UL_SYSEVENT, "Received reinit signal");
}

static int set_fd_socket_timeouts(int fd, int timeout_ms)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many components which open sysevent fd and will keep that fd open until process lifetime, what is the behavior of that scenario with this fix.

@AkhilaReddyK7 AkhilaReddyK7 added the community-contribution Contribution from community label Mar 17, 2026
@mohamedakif1
Copy link
Copy Markdown
Author

mohamedakif1 commented Mar 17, 2026

There are many components which open sysevent fd and will keep that fd open until process lifetime, what is the behavior of that scenario with this fix.

@snayak002c : The timeouts we introduced (SO_SNDTIMEO , SO_RCVTIMEO ) are for any send/recv operation being stuck. it is not for the idle time when fd is simply open but no active send /recv is in effect. fd remains usable for future purpose.
We have an example of wan manager where the fd for sysevent is opened once during the startup and kept open until the life cycle of the process and used for sysevent set/get operations.
We have tested the WAN scenarios thoroughly and have observed NO negative impact of this change .

Reason For Change:
. syseventd accepts client connections and performs synchronous reads/writes on client fds.
Without per‑socket send/recv timeouts, a slow, stalled, or misbehaving client can cause
write()/read() to block for extended periods.This pins a worker thread in a tight retry
loop or prolonged wait, starving other work.
Worker threads were blocked on client socket I/O without timeouts; unhealthy clients starved
the pool which makes only thread1/thread2 to be active.
patch adds 200 ms per‑socket timeouts, return accurate errno from send routines, and actively
evict misbehaving clients via centralized failure handling,

Testing Done:
. Tested the patch integrated image on the stress test of 500 loops of continuous sysevent set
of wan status start and stop, sysevent fd count seems to be in limit and sysevent operations
are working fine.

Signed-off-by: Mohamed Akif Shaikh <mohamedakif.shaikh@vodafone.com>
@mohamedakif1
Copy link
Copy Markdown
Author

@snayak002c could you please review the response for your query above?
also, not sure why this PR is moved to "Requested change" is there any review comment on the source code? please let us know.
cc: @cpokuru @pradeeptakdas

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses syseventd worker/thread and fd exhaustion by ensuring client socket I/O doesn’t block indefinitely and by adding centralized handling for client send failures.

Changes:

  • Add per-socket send/receive timeouts (200ms) for newly accepted client connections.
  • Track and propagate the correct errno from SE_msg_send() / SE_msg_send_data() failures.
  • Introduce centralized send-failure handling for notification sends, including client health tracking and forced disconnect on hard failures.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
source/sysevent/server/worker_threads.c Adds handle_client_send_failure() and uses it on notification send failures.
source/sysevent/server/syseventd_main.c Applies SO_SNDTIMEO/SO_RCVTIMEO to newly accepted client sockets.
source/sysevent/lib/libsysevent.c Preserves the last socket write errno on send failure by restoring errno before returning -1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

printf("Thread %d: hard send failure on client fd %d (%s). errno=%d %s. Disconnecting immediately\n",
id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno));
)
CLI_MGR_remove_client_by_fd(fd, who, 1);
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the hard-failure path, CLI_MGR_remove_client_by_fd(fd, who, 1) uses force=1, which bypasses the (fd,id) consistency check. In a multi-threaded server where fds can be closed and reused, forcing removal increases the risk of removing the wrong client entry. Prefer force=0 here (you already pass the client token) so the client manager can verify the id matches before removal.

Suggested change
CLI_MGR_remove_client_by_fd(fd, who, 1);
CLI_MGR_remove_client_by_fd(fd, who, 0);

Copilot uses AI. Check for mistakes.
SE_INC_LOG(ERROR,
int id = thread_get_id(worker_data_key);
printf("Thread %d: client fd %d disconnected after repeated send timeouts\n", id, fd);
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI_MGR_handle_client_error_by_fd() can forcibly disconnect a client when the error threshold is exceeded, but that function only free_a_client() + close(fd) and does not remove any trigger notification registrations (unlike CLI_MGR_remove_client_by_fd, which does). With the new send-timeout path calling CLI_MGR_handle_client_error_by_fd() frequently, this can leave stale notification actions behind after a timeout-driven disconnect. Consider updating the disconnect behavior to also unregister notifications (e.g., by invoking the same trigger cleanup done in CLI_MGR_remove_client_by_fd, or by centralizing disconnect logic in one place).

Suggested change
)
)
/*
* When the error threshold is exceeded, CLI_MGR_handle_client_error_by_fd()
* will disconnect the client (free + close), but it does not unregister
* any trigger notification registrations. Ensure we also run the full
* disconnect path (including trigger cleanup) just as in the hard
* send-failure case below.
*/
CLI_MGR_remove_client_by_fd(fd, who, 1);

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +117
static void handle_client_send_failure(const int fd, const token_t who, const char *context)
{
int send_errno = errno;

if (EAGAIN == send_errno || EWOULDBLOCK == send_errno) {
int threshold_rc = CLI_MGR_handle_client_error_by_fd(fd);
SE_INC_LOG(ERROR,
int id = thread_get_id(worker_data_key);
printf("Thread %d: send timeout to client fd %d (%s). errno=%d %s threshold_rc=%d\n",
id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno), threshold_rc);
)
if (1 == threshold_rc) {
SE_INC_LOG(ERROR,
int id = thread_get_id(worker_data_key);
printf("Thread %d: client fd %d disconnected after repeated send timeouts\n", id, fd);
)
}
return;
}

if (EPIPE == send_errno || ECONNRESET == send_errno || ENOTCONN == send_errno) {
SE_INC_LOG(ERROR,
int id = thread_get_id(worker_data_key);
printf("Thread %d: hard send failure on client fd %d (%s). errno=%d %s. Disconnecting immediately\n",
id, fd, (NULL == context) ? "unknown" : context, send_errno, strerror(send_errno));
)
CLI_MGR_remove_client_by_fd(fd, who, 1);
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name who in handle_client_send_failure() is misleading here: the function is operating on a target client identified by fd, and the who value is only used as the client token/id for removal. Given the surrounding code already has both a token_t id (client token) and multiple int id (thread id) variables in log scopes, this is easy to misuse. Renaming the parameter to something like client_id (and updating call sites) would reduce the chance of passing the thread id or message sender token by mistake.

Copilot uses AI. Check for mistakes.
if (-1 == set_fd_socket_timeouts(newsockfd, 200)) {
ulogf(ULOG_SYSTEM, UL_SYSEVENT,
"Unable to set client fd %d socket timeouts. (%d) %s",
newsockfd, errno, strerror(errno));
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set_fd_socket_timeouts() fails, the code logs but continues keeping the new client connection alive in fully-blocking mode. Since the motivation of this PR is to prevent worker starvation on slow/stalled clients, it would be safer to reject the connection when timeouts cannot be applied (e.g., close newsockfd and remove the just-created client entry), or alternatively set the socket to non-blocking + use select/poll-based I/O for this connection.

Suggested change
newsockfd, errno, strerror(errno));
newsockfd, errno, strerror(errno));
/* Reject the connection if we cannot apply socket timeouts,
* to avoid blocking workers on a fully-blocking client.
*/
reply_msg_body->status = htonl(-1);
reply_msg_body->token_id = htonl(TOKEN_NULL);
SE_msg_send(newsockfd, reply_msg_buffer);
SE_INC_LOG(MESSAGES,
printf("Sent %s (%d) to client on fd %d\n",
SE_print_mtype(SE_MSG_OPEN_CONNECTION_REPLY), SE_MSG_OPEN_CONNECTION_REPLY, newsockfd);
)
SE_INC_LOG(MESSAGE_VERBOSE,
SE_print_message_hdr(reply_msg_buffer);
)
close(newsockfd);
continue;

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@anoopchelakkode anoopchelakkode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can this issue be reproduced? Before merging this code, please share the reproduction and validation results.

@tinaelizabeth84
Copy link
Copy Markdown

@mohamedakif1 Can you please address the query from Anoop?

@mohamedakif1
Copy link
Copy Markdown
Author

how can this issue be reproduced? Before merging this code, please share the reproduction and validation results.

@anoopchelakkode you can just use this simulation script which puts syseventd under stress. and issue will be reproduced starting 210th cycle or 220th cycle.. slowness in response will be visible starting 120th cycle around.
the normal fd count is 49-53 in normal conditions... in below test you would see after some cycles, the fd count just keeps on incrementing and then finally will exhaust the total ulimit 1024 fd.
I have verified this script on vanilla build on rpi and was able to reproduce the issue.
`#!/bin/bash

for i in $(seq 1 500); do

sysevent set wan-status stopped

sysevent set wan-status started

sysevent get firewall-restart

echo "Cycle $i complete"

echo "FD count"
ls /proc/$(pidof syseventd)/fd | wc -l`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contribution from community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants