You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #601 attempted to fix the original `AttributeError` from #594 and grew through 17 review iterations into a comprehensive set of fixes for ~30 race conditions and edge cases around autolock timer persistence, cancellation, and config-entry reload. Each fix exposed a narrower interleaving and the review loop never converged. Closed in favor of a clean redesign.
Root cause
`KeymasterTimer` conflates three layers with different lifetimes into one object backed by nullable fields and ad-hoc flags:
Persistent store — `_store`, `_store_lock`, `_timer_id`. Owned by the coordinator; lifetime spans HA restarts.
Scheduled callbacks — `_unsub_events`, `_inflight_tasks`, `_call_action`. Lifetime is one autolock cycle.
kmlock binding — `hass`, `_kmlock`. Lifetime is one config-entry-load cycle.
Layer crossings (e.g. detach during in-flight callback, setup rescue from preserved state, force-persist on detach) aren't atomic; they're scattered through narrative code with state transitions implicit in field-nulling patterns. Every reload race is a layer-crossing window.
Scope of bugs the original PR identified (kept as test cases for the redesign)
`_persist_to_store` / `cancel` race producing `AttributeError` on `_end_time.isoformat()` (the original ISSUE: Auto Lock Timer "Unknown" #594 symptom)
Cross-timer overwrite when multiple timers share `_timer_store`
`setup()` reading stale empty store while outgoing timer's persist is queued
Two concurrent `setup()`s on same timer_id scheduling recovery twice
`_on_expired` racing with `detach()` mid-action, mutating orphaned kmlock
`_on_expired` storage failure dropping the safety-critical action
`detach()` not awaiting in-flight callbacks → mutations lost
`detach()` losing pending `start()` state when queued persist hasn't run
class ScheduledCallback:
"""Just the async_call_later wrapper. Awaitable cancellation."""
async def cancel_and_wait(self) -> None: ... # awaits in-flight
class KeymasterTimer:
"""Just the orchestration. Wraps a TimerStore + at most one ScheduledCallback."""
state: Literal["fresh", "active", "firing", "done", "detached"]
```
Explicit state machine
```
fresh → active (start)
active → firing (callback fires)
firing → done (action+cleanup ok) | active (action raised, replay)
active → done (cancel)
→ detached (handed off to replacement)
```
Invalid transitions raise. No nullable fields proxying for state.
Ownership transfer instead of state-copy in `_update_lock`
Instead of mutating `new` from `old` and detaching the old timer, transfer the `KeymasterTimer` object from `old.autolock_timer` to `new.autolock_timer` and rebind its `_kmlock` reference. Most reload races vanish because there's no "swap the kmlock instance and surgically copy state" moment.
Action firing keyed by config_entry_id, not by reference
`_call_action` becomes a closure that looks up the live kmlock by config_entry_id at fire time, not a `functools.partial` capturing the kmlock instance. Eliminates the "action mutates orphaned kmlock" class of bugs entirely.
Background
PR #601 attempted to fix the original `AttributeError` from #594 and grew through 17 review iterations into a comprehensive set of fixes for ~30 race conditions and edge cases around autolock timer persistence, cancellation, and config-entry reload. Each fix exposed a narrower interleaving and the review loop never converged. Closed in favor of a clean redesign.
Root cause
`KeymasterTimer` conflates three layers with different lifetimes into one object backed by nullable fields and ad-hoc flags:
Layer crossings (e.g. detach during in-flight callback, setup rescue from preserved state, force-persist on detach) aren't atomic; they're scattered through narrative code with state transitions implicit in field-nulling patterns. Every reload race is a layer-crossing window.
Scope of bugs the original PR identified (kept as test cases for the redesign)
Proposed redesign
Layer separation
```python
class TimerStore:
"""Just persistence. Owns the asyncio.Lock. Atomic claim/release/load."""
async def claim(self, timer_id: str) -> TimerEntry | None: ...
async def write(self, timer_id: str, entry: TimerEntry) -> None: ...
async def remove(self, timer_id: str) -> None: ...
class ScheduledCallback:
"""Just the async_call_later wrapper. Awaitable cancellation."""
async def cancel_and_wait(self) -> None: ... # awaits in-flight
class KeymasterTimer:
"""Just the orchestration. Wraps a TimerStore + at most one ScheduledCallback."""
state: Literal["fresh", "active", "firing", "done", "detached"]
```
Explicit state machine
```
fresh → active (start)
active → firing (callback fires)
firing → done (action+cleanup ok) | active (action raised, replay)
active → done (cancel)
```
Invalid transitions raise. No nullable fields proxying for state.
Ownership transfer instead of state-copy in `_update_lock`
Instead of mutating `new` from `old` and detaching the old timer, transfer the `KeymasterTimer` object from `old.autolock_timer` to `new.autolock_timer` and rebind its `_kmlock` reference. Most reload races vanish because there's no "swap the kmlock instance and surgically copy state" moment.
Action firing keyed by config_entry_id, not by reference
`_call_action` becomes a closure that looks up the live kmlock by config_entry_id at fire time, not a `functools.partial` capturing the kmlock instance. Eliminates the "action mutates orphaned kmlock" class of bugs entirely.
Validation
Scope
Single PR. Larger than #601 as a diff but the architecture change is the point — incremental patches won't get there.