Skip to content

Conversation

@alessandrobologna
Copy link
Contributor

Issue #, if available: #245

Description of changes:

  • Move completion event signaling to after execution state is updated from the checkpoint response.

  • Prevent a waiting user thread from running the second status check before new operations are added, avoiding duplicate STARTs and a stalled checkpoint thread.

  • Preserve existing checkpoint API semantics while closing the race window between completion signaling and state refresh.

    Tests:

    • ./.venv/bin/python -m pytest

    Fixes [Bug]: Duplicate START from synchronous checkpoint race causes map to stall and timeout #245

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Move completion event signaling to after the execution state is updated
from the checkpoint response. This prevents a waiting user thread from
running the second status check before new operations are added, which
could lead to a duplicate START and stalled checkpoint thread.

This preserves the existing checkpoint API semantics while closing the
race window between completion signaling and state refresh.
Copy link
Member

@yaythomas yaythomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

thank you so much for a great fix! welcome to dex!

@yaythomas yaythomas merged commit afd4083 into aws:main Jan 4, 2026
7 of 9 checks passed
@alessandrobologna
Copy link
Contributor Author

thank you for the quick review and merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Duplicate START from synchronous checkpoint race causes map to stall and timeout

2 participants