Skip to content

Conversation

@davidradl
Copy link
Contributor

We were getting an intermittent error in our CI pipeline, as below. This change adds a small sleep to allow the 2 slow tasks to start before it checks they are there. This approach is in line with other parts of this code e.g. this

[ERROR] Tests run: 13, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 2.068 s <<< FAILURE! -- in org.apache.flink.runtime.scheduler.slowtaskdetector.ExecutionTimeBasedSlowTaskDetectorTest
#24 511.0 [ERROR] org.apache.flink.runtime.scheduler.slowtaskdetector.ExecutionTimeBasedSlowTaskDetectorTest.testBalancedInput -- Time elapsed: 0.011 s <<< FAILURE!
#24 511.0 java.lang.AssertionError:
#24 511.0
#24 511.0 Expected size: 2 but was: 0 in:
#24 511.0 {}
#24 511.0 at org.apache.flink.runtime.scheduler.slowtaskdetector.ExecutionTimeBasedSlowTaskDetectorTest.testBalancedInput(ExecutionTimeBasedSlowTaskDetectorTest.java:269)
#24 511.0 at java.base/java.lang.reflect.Method.invoke(Method.java:586)
#24 511.0 at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
#24 511.0 at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
#24 511.0 at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
#24 511.0 at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
#24 511.0 at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)

…wTaskDetectorTest

Signed-off-by: davidradl <david_radley@uk.ibm.com>
@flinkbot
Copy link
Collaborator

flinkbot commented Jan 16, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

Signed-off-by: davidradl <david_radley@uk.ibm.com>
@davidradl
Copy link
Contributor Author

We have hit 2 more intermittent unit test failures in this area. Fixing those here too.

[ERROR]   ExecutionTimeBasedSlowTaskDetectorTest.testNoFinishedTaskButRatioIsZero:81 
#24 546.3 Expected size: 3 but was: 0 i

and

ExecutionGraphRestartTest.testCancelWhileFailing:217 
#24 541.9 expected: RUNNING
#24 541.9  but was: FAILING

@davidradl
Copy link
Contributor Author

also IBM Bob is adding .metals to github repos, so I have git ignored this folder.

@davidradl davidradl changed the title [hotfix] Add sleep to stop intermittent failure in ExecutionTimeBasedSlowTaskDetectorTest [hotfix] Add 3 sleeps to address intermittent unit test failures we have hit in ExecutionGraph tests Jan 19, 2026
Comment on lines 223 to 225
// Give time for the failure to be processed
Thread.sleep(10);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not rely on timeouts in tests
this is is also mentioned in contribution guide https://flink.apache.org/how-to-contribute/code-style-and-quality-common/#avoid-timeouts-in-junit-tests

Copy link
Contributor Author

@davidradl davidradl Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I was following the existing approach in the code. Also I see ExecutionGraphTestUtils.waitUntilJobStatus that is effectively a method doing the timeout. The comment says // this is a poor implementation - we may want to improve it eventually

I will investigate to see if there is another way,

@davidradl davidradl marked this pull request as draft January 19, 2026 14:40
@davidradl davidradl force-pushed the hotfixsleep branch 2 times, most recently from adb379d to e238c2f Compare January 20, 2026 09:39
Signed-off-by: davidradl <david_radley@uk.ibm.com>
@davidradl davidradl changed the title [hotfix] Add 3 sleeps to address intermittent unit test failures we have hit in ExecutionGraph tests [hotfix]Intermittent unit test failures we have hit in ExecutionGraph tests Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants