Skip to content

fix: add thread safety to SquidFilterWheel position tracking#520

Closed
hongquanli wants to merge 7 commits intomasterfrom
fix/filter-wheel-thread-safety
Closed

fix: add thread safety to SquidFilterWheel position tracking#520
hongquanli wants to merge 7 commits intomasterfrom
fix/filter-wheel-thread-safety

Conversation

@hongquanli
Copy link
Contributor

Summary

  • The GUI thread and acquisition worker thread both call set_filter_wheel_position without synchronization on _positions. When they race, both read the same stale position, compute wrong relative deltas, and the physical wheel ends up at an unexpected position. Once off by even one slot, every subsequent move is wrong until a manual re-home.
  • Add a threading.Lock to SquidFilterWheel that serializes all position-affecting operations (move, home, step, read) so the read-position → send-move → wait → update-position sequence is atomic.
  • Split _home_wheel into locked/unlocked variants to avoid deadlock when called from _move_to_position's timeout recovery path.

Test plan

  • All 8 existing filter wheel tests pass
  • New test_concurrent_moves_serialize — two threads race to set different positions; verifies deltas sum correctly (no position drift)
  • New test_home_during_move_serializes — home and move race; verifies final position is valid (no interleaving)
  • Manual test on hardware: run multi-channel acquisition while clicking filter wheel widget buttons — wheel should not drift

🤖 Generated with Claude Code

hongquanli and others added 4 commits March 23, 2026 00:31
The GUI thread and acquisition worker thread both call
set_filter_wheel_position without synchronization.  When they race, both
read the same stale position, compute wrong relative deltas, and the
physical wheel ends up at an unexpected position.  Once off by even one
slot, every subsequent move is wrong until a manual re-home.

Add a threading.Lock that serializes all position-affecting operations
(move, home, step, read) so the read → move → wait → update sequence is
atomic.  Split _home_wheel into locked/unlocked variants to avoid
deadlock when called from _move_to_position's timeout recovery path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Refactor _move_to_position into locked wrapper + _move_to_position_unlocked
(same pattern as _home_wheel), so _step_position can do bounds-check and
move under a single lock hold instead of lock-release-relock.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…dead move_w

- set_filter_wheel_position now does validate + move under a single lock
  hold via _move_to_position_unlocked, consistent with all other public
  entry points.
- Remove move_w() — grep confirms zero callers. It bypassed the lock and
  did not update _positions, so any caller would have desynced tracking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hongquanli hongquanli requested a review from Copilot March 24, 2026 07:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds thread-safe serialization around SQUID filter wheel position tracking to prevent concurrent GUI/worker operations from corrupting _positions and issuing incorrect relative moves.

Changes:

  • Introduces a threading.Lock to make read→move→wait→update sequences atomic across move/home/step/read operations.
  • Splits move/home helpers into locked vs unlocked variants to avoid deadlocks during timeout recovery.
  • Adds concurrent-access tests to validate serialization and prevent position drift.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
software/squid/filter_wheel_controller/cephla.py Adds a lock and refactors move/home methods to serialize operations and keep position tracking consistent under concurrency.
software/tests/squid/test_filter_wheel.py Adds new multithreaded tests intended to reproduce and prevent race-induced position drift.
software/control/ndviewer_light Updates submodule pointer (unrelated to filter wheel locking logic).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +177 to +183
step_size = 1.0 / 8
expected_total_delta = (final_pos - 1) * step_size
actual_total_delta = sum(move_deltas)
assert abs(actual_total_delta - expected_total_delta) < 1e-9, (
f"Delta mismatch: moves summed to {actual_total_delta}, "
f"but position {final_pos} requires {expected_total_delta}"
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test’s expected delta calculation doesn’t match the production logic: SquidFilterWheel computes delta using SCREW_PITCH_W_MM / slot_count, not 1.0/8. As written, this will only pass if the pitch happens to be 1.0 mm and slot_count is exactly 8; otherwise it will fail or become misleading. Compute step_size the same way as the controller (using the configured min/max indices and the same pitch constant used in _move_to_position_unlocked).

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code] Fixed in 6e13063 - now uses SCREW_PITCH_W_MM / (config.max_index - config.min_index + 1) matching production logic.

Comment on lines +166 to +169
t1.join(timeout=5)
t2.join(timeout=5)

assert not errors, f"Threads raised: {errors}"
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test can pass even if a thread deadlocks/hangs, because join(timeout=...) is not followed by an assertion that threads actually finished. Add assertions like assert not t1.is_alive() / assert not t2.is_alive() (or join without a timeout) so the test fails deterministically on lock regressions.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code] Fixed in 6e13063 - added assert not t1.is_alive() / assert not t2.is_alive() after both join calls in both tests.

t1.join(timeout=10)
t2.join(timeout=10)

# The operations must not interleave — home and move regions shouldn't overlap
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion doesn’t actually verify the stated requirement (“operations must not interleave”). It only checks that the final tracked position is one of two values, which could still be true even if home_* and move_* calls overlap. Since you already collect call_order, assert that the sequence is non-overlapping (e.g., home_start, home_end, move_start, move_end or the reverse) rather than only asserting the final position.

Suggested change
# The operations must not interleave — home and move regions shouldn't overlap
# The operations must not interleave — home and move regions shouldn't overlap
# Valid non-overlapping orders:
# home_start, home_end, move_start, move_end
# move_start, move_end, home_start, home_end
assert call_order in (
["home_start", "home_end", "move_start", "move_end"],
["move_start", "move_end", "home_start", "home_end"],
)
# Final position should be the result of either the home or the move

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code] Fixed in 6e13063 - now asserts call_order matches one of two valid non-interleaving sequences. Note: the expected sequences are 6 elements (not 4) because _home_wheel_unlocked also calls _move_wheel for the offset move, producing an additional move_start/move_end pair.

@@ -311,8 +328,3 @@ def get_delay_ms(self) -> Optional[float]:
def close(self):
"""Close the filter wheel controller (no-op for SQUID)."""
pass
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR removes the move_w(self, delta: float) backward-compatibility method at the end of the class. That’s an API-breaking change for any existing callers relying on it. If this is still needed externally, consider keeping it as a thin wrapper (possibly marked deprecated) or provide a migration path in the same PR.

Suggested change
pass
pass
def move_w(self, delta: float, wheel_id: int = 1):
"""Deprecated backward-compatibility wrapper for moving the wheel.
This method exists for API compatibility with older code.
Prefer using :meth:`next_position`, :meth:`previous_position`,
or :meth:`set_filter_wheel_position` instead.
Args:
delta: Number of steps to move. Positive values move forward,
negative values move backward. Any fractional component is
discarded.
wheel_id: The wheel to move. Defaults to 1 for backward
compatibility with older single-wheel usage.
"""
# Log a deprecation warning to aid migration off this method.
_log.warning(
"SquidFilterWheel.move_w() is deprecated; use next_position(), "
"previous_position(), or set_filter_wheel_position() instead."
)
# No movement requested.
steps = int(delta)
if steps == 0:
return
direction = 1 if steps > 0 else -1
for _ in range(abs(steps)):
self._step_position(wheel_id, direction)

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Claude Code] Skipped - grep confirms zero callers of move_w() in the entire codebase. The suggested replacement also changes semantics (interprets delta as step count rather than mm distance). Adding a deprecated wrapper for dead code adds maintenance burden with no benefit.

hongquanli and others added 3 commits March 24, 2026 00:57
- Use SCREW_PITCH_W_MM and config bounds for step_size calculation
  instead of hardcoded 1.0/8 (comment #1)
- Add assert not t.is_alive() after join(timeout=...) to fail
  deterministically on deadlock (comment #2)
- Assert call_order sequence is non-interleaving, accounting for the
  offset move within _home_wheel_unlocked (comment #3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Critical: _home_wheel_unlocked now resets _positions to min_index on
exception so subsequent moves don't compute deltas from a stale value.
_move_to_position_unlocked wraps the re-home call to log re-home
failures distinctly from the original timeout, and the retry failure
message now warns that tracked position may not reflect physical state.

Tests: add 20ms simulated MCU delay via wait_till_operation_is_completed
side_effect so threads actually have a window to interleave — without
the lock these tests would now fail. Previously MagicMock completed
instantly, making the tests pass regardless of locking.

Cleanup:
- Remove dead _move_to_position() and _home_wheel() locked wrappers
  (zero callers after refactor — all entry points use _unlocked directly)
- Add "Caller must hold self._lock" to _move_wheel docstring
- Add "_configs is immutable after __init__" note in _step_position
- Fix "(smaller)" wording in test docstring

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The locked wrappers are gone, so no name collision. Simpler names.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hongquanli hongquanli closed this Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants