Skip to content

Conversation

@yiliu30
Copy link
Contributor

@yiliu30 yiliu30 commented Feb 12, 2026

Description

Please briefly describe your main changes, the motivation.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Copilot AI review requested due to automatic review settings February 12, 2026 08:55
@yiliu30 yiliu30 changed the title Support load FP8 on HPU Support load FP8 model on HPU Feb 12, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enable loading FP8 models on Habana HPU by faking CUDA capability checks during from_pretrained, and add an HPU-focused FP8 quantization test.

Changes:

  • Add a context manager to temporarily report CUDA availability on HPU and override CUDA device capability checks.
  • Wrap model loading with these context managers when HPEx is available to support FP8 model load on HPU.
  • Add an HPU test validating FP8 quantization output weights and basic numerics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
test/test_hpu/test_quant_fp8.py New HPU test that quantizes small models to FP8 and validates dtype + NaN/Inf.
auto_round/utils/model.py Wrapes from_pretrained with HPU/CUDA-override contexts to enable FP8 loading on HPU.
auto_round/utils/device.py Introduces fake_cuda_for_hpu context manager to temporarily force torch.cuda.is_available() true on HPU.
auto_round/compressors/base.py Removes HPU-specific exclusion of FP8 layers to allow FP8 on HPU.

Comment on lines +22 to +24
def test_small_model_rtn_generation(self, model_name):
ar = AutoRound(model_name, iters=0, scheme="FP8_STATIC", nsamples=16)
model, folder = ar.quantize_and_save(output_dir=self.save_dir, format="llm_compressor")
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test will likely fail in environments without HPU/HPEx because it unconditionally runs and attempts an FP8/HPU-specific flow. Add a skipif (or importorskip) guard so the test only runs when the HPU runtime is available (e.g., based on is_hpex_available() / HPU availability).

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,35 @@
import os
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os is imported but not used in this new test file. Please remove it to keep the test minimal and avoid lint warnings.

Suggested change
import os

Copilot uses AI. Check for mistakes.
trust_remote_code=trust_remote_code,
device_map="auto" if use_auto_mapping else None,
)
if is_hpex_available():
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA-faking/capability-override is applied whenever HPEx is available, regardless of the selected device_str. This can unintentionally alter load-time behavior for non-HPU runs on machines that have HPEx installed. Consider additionally gating this block on device_str (e.g., only apply when loading for HPU) so other device paths aren’t affected.

Suggested change
if is_hpex_available():
if is_hpex_available() and device_str is not None and "hpu" in device_str:

Copilot uses AI. Check for mistakes.
@@ -339,6 +339,25 @@ def __exit__(self, exc_type, exc, exc_tb):
return False


Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a class but is named like a function (lower_snake_case). For clarity and consistency, consider either renaming it to a CapWords class name (e.g., FakeCudaForHpu) or converting it into a @contextmanager function named fake_cuda_for_hpu.

Copilot uses AI. Check for mistakes.
Comment on lines +349 to +350
if is_hpex_available():
self._orig_is_available = torch.cuda.is_available
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mutates a global function (torch.cuda.is_available) process-wide, which can cause surprising behavior if other threads/tasks call CUDA checks while this context is active. If possible, prefer a safer patching approach (e.g., unittest.mock.patch scoped to the smallest block) and keep the patched window as short as possible.

Copilot uses AI. Check for mistakes.
@yiliu30 yiliu30 added the hpu label Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant