-
Notifications
You must be signed in to change notification settings - Fork 77
Support load FP8 model on HPU #1449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Enable loading FP8 models on Habana HPU by faking CUDA capability checks during from_pretrained, and add an HPU-focused FP8 quantization test.
Changes:
- Add a context manager to temporarily report CUDA availability on HPU and override CUDA device capability checks.
- Wrap model loading with these context managers when HPEx is available to support FP8 model load on HPU.
- Add an HPU test validating FP8 quantization output weights and basic numerics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| test/test_hpu/test_quant_fp8.py | New HPU test that quantizes small models to FP8 and validates dtype + NaN/Inf. |
| auto_round/utils/model.py | Wrapes from_pretrained with HPU/CUDA-override contexts to enable FP8 loading on HPU. |
| auto_round/utils/device.py | Introduces fake_cuda_for_hpu context manager to temporarily force torch.cuda.is_available() true on HPU. |
| auto_round/compressors/base.py | Removes HPU-specific exclusion of FP8 layers to allow FP8 on HPU. |
| def test_small_model_rtn_generation(self, model_name): | ||
| ar = AutoRound(model_name, iters=0, scheme="FP8_STATIC", nsamples=16) | ||
| model, folder = ar.quantize_and_save(output_dir=self.save_dir, format="llm_compressor") |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will likely fail in environments without HPU/HPEx because it unconditionally runs and attempts an FP8/HPU-specific flow. Add a skipif (or importorskip) guard so the test only runs when the HPU runtime is available (e.g., based on is_hpex_available() / HPU availability).
| @@ -0,0 +1,35 @@ | |||
| import os | |||
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os is imported but not used in this new test file. Please remove it to keep the test minimal and avoid lint warnings.
| import os |
| trust_remote_code=trust_remote_code, | ||
| device_map="auto" if use_auto_mapping else None, | ||
| ) | ||
| if is_hpex_available(): |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA-faking/capability-override is applied whenever HPEx is available, regardless of the selected device_str. This can unintentionally alter load-time behavior for non-HPU runs on machines that have HPEx installed. Consider additionally gating this block on device_str (e.g., only apply when loading for HPU) so other device paths aren’t affected.
| if is_hpex_available(): | |
| if is_hpex_available() and device_str is not None and "hpu" in device_str: |
| @@ -339,6 +339,25 @@ def __exit__(self, exc_type, exc, exc_tb): | |||
| return False | |||
|
|
|||
|
|
|||
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a class but is named like a function (lower_snake_case). For clarity and consistency, consider either renaming it to a CapWords class name (e.g., FakeCudaForHpu) or converting it into a @contextmanager function named fake_cuda_for_hpu.
| if is_hpex_available(): | ||
| self._orig_is_available = torch.cuda.is_available |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mutates a global function (torch.cuda.is_available) process-wide, which can cause surprising behavior if other threads/tasks call CUDA checks while this context is active. If possible, prefer a safer patching approach (e.g., unittest.mock.patch scoped to the smallest block) and keep the patched window as short as possible.
Description
Please briefly describe your main changes, the motivation.
Type of Change
Related Issues
Fixes or relates to #
Checklist Before Submitting