Skip to content

[AMD/ROCm] kimik2.5 int4 mi355x update#909

Closed
seungrokj wants to merge 5 commits intomainfrom
srok/kimik2.5int4_mi355x
Closed

[AMD/ROCm] kimik2.5 int4 mi355x update#909
seungrokj wants to merge 5 commits intomainfrom
srok/kimik2.5int4_mi355x

Conversation

@seungrokj
Copy link
Collaborator

@seungrokj seungrokj commented Mar 13, 2026

waiting for the optimized upstream docker image.

Regards,
Seungrok

Signed-off-by: seungrokj <seungrok.jung@amd.com>
@github-actions
Copy link
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

@functionstackx
Copy link
Contributor

@seungrokj can u add in ur perf-changelog that VLLM_ROCM_USE_AITER=1 works now due to PR vllm-project/vllm#35641

seungrokj and others added 4 commits March 16, 2026 12:11
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
Signed-off-by: seungrokj <seungrok.jung@amd.com>
@functionstackx
Copy link
Contributor

@seungrokj any updates on this PR?

@functionstackx
Copy link
Contributor

@claude create a new PR that ports this PR over & uses vllm/vllm-openai-rocm:v0.18.0 image

@Klaud-Cold
Copy link
Collaborator

Klaud-Cold commented Mar 26, 2026

Claude finished @functionstackx's task in 2m 59s —— View job


Porting PR #909 with vllm/vllm-openai-rocm:v0.18.0 image

Summary

Created PR #950 which ports all changes from this PR with the resolved upstream docker image:

  • Image: v0.15.1v0.18.0 in amd-master.yaml
  • AITER MLA: Added export VLLM_ROCM_USE_AITER=1
  • Benchmark script: Added --max-num-seqs 256, removed --disable-log-requests
  • perf-changelog: Added entry documenting all changes

#950

--- |

Klaud-Cold pushed a commit that referenced this pull request Mar 26, 2026
Port changes from PR #909 with the resolved upstream docker image:
- Upgrade image from v0.15.1 to v0.18.0
- Enable AITER MLA (VLLM_ROCM_USE_AITER=1)
- Add --max-num-seqs 256, remove --disable-log-requests
- Update perf-changelog

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Co-authored-by: seungrokj <seungrokj@users.noreply.github.com>
@seungrokj
Copy link
Collaborator Author

hi @functionstackx I tested a few thing to figure out non-deterministic behavior of kimik2 fp4 case. And now it's resolved. So now will work on "int4" model. Or, I'll update on top of mr. Klaud's work on #950

@functionstackx
Copy link
Contributor

@seungrokj feel free to continue working on this! appreipcate ur help!

@functionstackx
Copy link
Contributor

i started an klaud cold PR cuz i thought u were busy with other tasks and didnt wanna delay adding kimi k2.5 int4 mi355 AITER which already is an easy win cuz it has great perf improvmenets

@seungrokj
Copy link
Collaborator Author

i started an klaud cold PR cuz i thought u were busy with other tasks and didnt wanna delay adding kimi k2.5 int4 mi355 AITER which already is an easy win cuz it has great perf improvmenets

Sure will work on this today!

@functionstackx
Copy link
Contributor

hi @seungrokj

since #909 already passed validation & improvements is easy win & we ideally want to show these improvements on the frontend ASAP, i am gonna merge #909 . for any additional changes, can u build on top of #909
image

functionstackx added a commit that referenced this pull request Mar 27, 2026
…which has the AITER MLA patch for num_heads=8 (#950)

* [AMD/ROCm] kimik2.5 int4 mi355x: upgrade to vllm-openai-rocm:v0.18.0

Port changes from PR #909 with the resolved upstream docker image:
- Upgrade image from v0.15.1 to v0.18.0
- Enable AITER MLA (VLLM_ROCM_USE_AITER=1)
- Add --max-num-seqs 256, remove --disable-log-requests
- Update perf-changelog

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Co-authored-by: seungrokj <seungrokj@users.noreply.github.com>

* Update perf-changelog PR link to #950

Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>

---------

Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com>
Co-authored-by: functionstackx <functionstackx@users.noreply.github.com>
Co-authored-by: seungrokj <seungrokj@users.noreply.github.com>
@seungrokj
Copy link
Collaborator Author

@functionstackx yes if we are just using tp8. Then #909 should be solid. It also aligned with internal measurement on
vllm/vllm-openai-rocm:v0.18.0
img.

@functionstackx
Copy link
Contributor

@seungrokj do u see anything better with TP4? if TP4 is on the pareto frontier, feel free to add it

@seungrokj
Copy link
Collaborator Author

@functionstackx based on fp4 case int4 (same memory footprint) could have better tput/gpu. On testing internally and if it looks good then will raise a subsequent PR.

@functionstackx
Copy link
Contributor

thanks! looking forward to ur follow up PR on if tp4 is better or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants