fix: correct NVML_FI_* field IDs and add runtime v12/v13U1 remapping#137
Conversation
The NVML_FI_PWR_SMOOTHING_* constants were using CUDA 13.0 Update 1 numbering (starting at 251) despite the crate declaring NVML API v12. This caused silent data corruption on CUDA 12 hosts — querying power smoothing fields would return clock event reason data instead. NVIDIA broke ABI compatibility for field IDs 251-273 between CUDA 13.0 and 13.0 Update 1 (driver >= 580.82). This commit: - Fixes nvml.h and bindings.rs to use correct v12 numbering - Adds 5 missing constants (CLOCKS_EVENT_REASON_*, POWER_SYNC_*) - Shifts PWR_SMOOTHING constants from 251-268 to 256-273 - Detects driver version at init and transparently remaps field IDs when running on v13U1+ drivers (>= 580.82) Callers are unaffected — field_values_for() handles the translation. Fixes rust-nvml#134 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
This looks good and seems like a solid implementation to me. My two cents, for what it's worth: I recommend updating the test(s) for |
Expand translate_field_id_v13u1_remaps_affected_range to check every constant in the 251-273 range by name, and verify the mapping is a bijection (no collisions). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Queries all 23 field IDs in the affected 251-273 range against real hardware. On a v13U1+ driver (>= 580.82), this exercises the translate_field_id remapping path end-to-end. CLOCKS_EVENT_REASON fields should return throttle-reason data on most GPUs. PWR_SMOOTHING fields are Blackwell-only and expected to return NotSupported on older architectures — getting NotSupported (rather than wrong data) confirms the correct field was queried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@blthayer - good suggestions. Added some test coverage. |
|
Given |
Show a table with NAME, V12_ID, DRIVER_ID, and RESULT for each field. Run query once instead of 3x to keep output clean. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Tested against an RTX4090 running a recent driver, remapping appears to be working. |
Summary
NVML_FI_PWR_SMOOTHING_*constants that used CUDA 13.0 Update 1 numbering despite the crate declaring NVML API v12, causing silent data corruption on CUDA 12 hostsCLOCKS_EVENT_REASON_*,POWER_SYNC_BALANCING_*) at IDs 251-255PWR_SMOOTHING_*constants to their correct v12 positions (256-273)Nvml::init()to transparently remap field IDs when running on v13U1+ drivers (>= 580.82)Background
NVIDIA broke ABI compatibility for field IDs 251-273 between CUDA 13.0 and 13.0 Update 1 (driver >= 580.82). See NVIDIA's known issues. The previous constants were inadvertently taken from a v13U1 source while the crate declares
NVML_API_VERSION = 12.The remapping is transparent — callers use the canonical v12 constants and
field_values_for()handles translation based on the detected driver version.Test plan
detect_field_id_scheme)translate_field_id) covering v12 no-op, v13U1 remapping, and passthrough for unaffected IDsFixes #134
🤖 Generated with Claude Code