Skip to content

CI: Detect FAILURE in halo exchange test if any#321

Closed
leo-automation wants to merge 1 commit into
masterfrom
leo/peer-halo-exchange-test-propagate-errors
Closed

CI: Detect FAILURE in halo exchange test if any#321
leo-automation wants to merge 1 commit into
masterfrom
leo/peer-halo-exchange-test-propagate-errors

Conversation

@leo-automation
Copy link
Copy Markdown
Collaborator

To propagate the errors to the CI
Example of the run where we didn't find the failures and greened a run https://github.com/ROCm/apex/actions/runs/23252552706/job/67612063034?pr=320

Comment thread .github/workflows/rocm-ci.yml
Copy link
Copy Markdown
Collaborator

@jithunnair-amd jithunnair-amd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amd-sriram https://github.com/ROCm/apex/actions/runs/23303497615/job/67780836893?pr=321 runs the Halo exchange tests for 10-11h!!! That's not tenable at all. We need to reduce the runtime for these tests, or disable them in the meantime if a resolution is not straightforward.

@amd-sriram
Copy link
Copy Markdown
Collaborator

@jithunnair-amd @leo-amd We could check if the assert statement reduces the time taken for the halo test. If the time doesn't reduce, then we can disable it for the mean time.

@jithunnair-amd
Copy link
Copy Markdown
Collaborator

Closing this in favor of #323

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants