Skip to content

RemoteLogDownloadFuture::on_complete lock-order inversion can deadlock #200

@zhaohaidao

Description

@zhaohaidao

Search before asking

  • I searched in the issues and found nothing similar.

Please describe the bug 🐞

Summary

RemoteLogDownloadFuture::on_complete acquires completion_callbacks before result, while the completion task acquires result before completion_callbacks. This lock-order inversion can deadlock when registration and completion race.

Impact

  • Remote log downloads can hang permanently.
  • Read path can stall when waiting for remote segment completion.

Root Cause

Two code paths acquire the same locks in opposite order:

  • on_complete: completion_callbacks -> result
  • completion task: result -> completion_callbacks

Discussion Notes

  • The completion task is the tokio::spawn in RemoteLogDownloadFuture::new that sets result and drains completion_callbacks.
  • Today there is only one call site (scanner.rs) registering a callback, but the list is meant to allow multiple subscribers and keep Java-like onComplete semantics.
  • let is_done = self.result.lock().is_some(); releases the lock at the end of the statement. If we rely on that check, we must hold the lock across registration to avoid missing completion.

Reproduction (Race Scenario)

  1. Completion task locks result and is about to lock completion_callbacks.
  2. Another task calls on_complete, locks completion_callbacks, then tries to lock result.
  3. Both wait on each other, deadlock.

Solution

Fix Options

  1. Consistent lock order (minimal change, recommended).
    Make on_complete lock result first, keep it held while registering, then lock completion_callbacks. This matches the completion task order and avoids both deadlock and missed callbacks.
  2. Single combined mutex (simpler correctness, larger refactor).
    Replace the two mutexes with one Mutex<DownloadState { result, callbacks }>; completion drains callbacks after setting result.

Recommended Fix

Option 1 (consistent lock order) because it is the smallest change and fixes the deadlock without changing API or structure.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions