Skip to content

fix(core): apply isolation tests for NVMe quirks instead of global pa…#3372

Open
winshaurya wants to merge 3 commits into
Start9Labs:masterfrom
winshaurya:winshaurya-nvme-quirk-fix
Open

fix(core): apply isolation tests for NVMe quirks instead of global pa…#3372
winshaurya wants to merge 3 commits into
Start9Labs:masterfrom
winshaurya:winshaurya-nvme-quirk-fix

Conversation

@winshaurya

Copy link
Copy Markdown

This PR refactors the previous approach in #3338 to directly address the feedback from @helix-nine on issue #3336.

Changes:

  1. Removed Global Overrides: Rather than globally disabling APST and HMB on all StartOS machines (which could cause power-consumption regressions for unaffected users), this PR relies on targeted isolation testing first.
  2. Added Isolation Script: Added projects/start-os/scripts/nvme-isolation-test.sh. This script is designed for the reporter (otech47) and other affected users to easily and safely cycle through kernel parameters (nvme_core.default_ps_max_latency_us=0, nvme.max_host_mem_size_mb=0, pcie_aspm=off) to find the exact root cause of their controller resets.
  3. Fixed Validation Logic: Moved the NVMe quirk validation logic to the new shared-libs layout (shared-libs/crates/start-core/src/system/nvme.rs) and wired it into init.rs.
  4. Fixed HMB Checks: The validation now correctly checks /proc/cmdline for HMB flags rather than /etc/modprobe.d/, because the NVMe driver is built-in and modprobe directives are ignored.
  5. Repo Standards: Adhered strictly to StartOS repository standards (no em dashes in comments, proper rust code formatting).

Next Steps:
Once otech47 runs the isolation script and reports back which specific kernel flag stops the controller resets under load, we can add a targeted PCI-ID quirk just for that drive/controller.

…rams

Refactor the PR Start9Labs#3338 to address feedback from helix-nine on issue Start9Labs#3336.
Instead of globally disabling APST and HMB on all StartOS machines, this adds an isolation test script (scripts/nvme-isolation-test.sh) to help users debug their specific drive issues without impacting other users. The Rust system validation has been updated to check for kernel command line quirks instead of modprobe params, as NVMe is typically built-in.

@helix-nine helix-nine left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reworking this from #3338. The isolation script is genuinely useful — that per-knob cycling is exactly the data we need from the reporter. But the nvme.rs validation should not ship as written, because it encodes a premise we can't support: that HMB is broken / a likely culprit on the Samsung 990 EVO Plus.

I went and reconfirmed this against upstream rather than take either side on faith:

  1. The 990 EVO Plus is DRAM-less and HMB is its primary cache (Samsung Piccolo controller, no onboard DRAM). HMB here isn't an optional power-saving feature — it's how the drive caches its mapping tables. nvme.max_host_mem_size_mb=0 degrades the drive; it's not a free safety knob.

  2. Mainline Linux has no HMB quirk for any Samsung device. In drivers/nvme/host/pci.c, the only quirk touching this drive's PCI ID (144d:a80d, shared by the 990 EVO and EVO Plus) is FORCE_NO_SIMPLE_SUSPEND — an s2idle suspend power-draw fix gated on specific DMI board names. There is no HMB-related quirk for the 990 EVO Plus, or any Samsung drive. If the kernel maintainers believed HMB wedged this controller, there would be one.

  3. The known reset-class issues on this drive family are power-state, not HMB — APST / ASPM / D3cold transitions. That's consistent with the upstream evidence noted on #3336 (disabling APST alone didn't stop the resets, which points at ASPM / power-delivery, not HMB).

  4. The real-world HMB problems are the opposite scenario: allocation failures on memory-constrained ARM boards (Raspberry Pi / low CMA), where the host can't supply the buffer. That's unrelated to a RAM-rich x86 StartOS server wedging under sustained writes.

Concrete asks:

  • Drop the init-time HMB warning. As written, check_nvme_quirks_impl defaults hmb_disabled = Some(false), so the warning fires on every StartOS machine with an NVMe drive that lacks nvme.max_host_mem_size_mb=0 on the cmdline — i.e. nearly all of them — advising users to disable HMB based on an unproven theory. That's a noisy, performance-degrading nudge we shouldn't ship.
  • Hold the APST warning too until the isolation data actually implicates APST on the affected box. Same default-fires-everywhere concern.
  • Keep the script as pure diagnostics (collect + per-knob cycling). That's the right vehicle to find the root cause before we encode any conclusion in the OS.
  • Revert the em-dash -> hyphen edits in system/mod.rs — those are unrelated comment churn, not a repo standard.

tl;dr: HMB works on the 990 EVO Plus, and disabling it is neither indicated by upstream nor supported by the reporter's data so far. Let's let the isolation script tell us which single knob (if any) actually stops the resets before baking a fix — or a warning — into init.

Sources: kernel drivers/nvme/host/pci.c (torvalds/master); Samsung 990 EVO Plus datasheet; StorageReview teardown (Piccolo, DRAM-less/HMB); Framework FW13 990 EVO Plus power-state analysis.

winshaurya added a commit to winshaurya/start-os that referenced this pull request Jul 1, 2026
…ting churn

Addresses feedback on PR Start9Labs#3372 by removing the unproven HMB warning and
deferring the APST warning pending isolation test results. Also reverts the
em-dash -> hyphen changes in system/mod.rs.
@winshaurya winshaurya force-pushed the winshaurya-nvme-quirk-fix branch from 63064fe to 7aa93c8 Compare July 1, 2026 02:38
@winshaurya winshaurya force-pushed the winshaurya-nvme-quirk-fix branch from 7aa93c8 to 4689516 Compare July 1, 2026 02:48
@winshaurya

Copy link
Copy Markdown
Author

@helix-nine Thanks , you're right that HMB is critical for DRAM-less drives and a default warning would just degrade performance for unaffected users.

I've pushed an update to drop the default-firing HMB and APST warnings, while also reverting the accidental formatting changes in mod.rs.

The isolation script is left intact so we can use it purely as a diagnostic tool to gather the actual data from the reporter before baking any fix into the OS

ps: You might notice an orphaned commit in the PR timeline; I did a quick force-push to keep the branch history clean and squashed down to a single commit.

Thank you

@dr-bonez

dr-bonez commented Jul 1, 2026

Copy link
Copy Markdown
Member

as it stands, the check_nvme_quirks on startup does not serve much of a purpose. Let's keep it to just the diagnostic script until we can get to the bottom of what @otech47 was experiencing

@helix-nine

Copy link
Copy Markdown
Contributor

Agreed with @dr-bonez — with the warnings gone, check_nvme_quirks just logs an info line at every boot and doesn't act on anything, so it's not worth carrying in the OS while the root cause is still open. Let's land this as the diagnostic script only.

To do that, please drop the init-time check entirely:

  • Remove the crate::system::nvme::check_nvme_quirks().await.log_err(); line from shared-libs/crates/start-core/src/init.rs
  • Remove pub mod nvme; from shared-libs/crates/start-core/src/system/mod.rs
  • Delete shared-libs/crates/start-core/src/system/nvme.rs

That leaves just projects/start-os/scripts/nvme-isolation-test.sh, which is the useful part — it lets @otech47 cycle each knob (APST / HMB / ASPM) separately and report which one, if any, actually stops the resets. Once we have that data we can add a targeted fix (ideally a PCI-ID quirk for the specific controller) grounded in evidence rather than a global param.

Thanks for the quick turnaround on the warnings and the mod.rs revert.

@winshaurya

Copy link
Copy Markdown
Author

@helix-nine @dr-bonez Thanks for the guidance! I've just pushed an update to drop the init-time checks and warnings entirely.

The OS validation logic (system/nvme.rs and its hooks in init.rs) has been completely removed. This PR now exclusively carries the diagnostic script (projects/start-os/scripts/nvme-isolation-test.sh) so we can safely gather the specific data we need from the reporter before encoding any fix into StartOS.

Let me know if this looks good to merge as a purely diagnostic tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants