We’re having a problem on our Jenkins server: test runs often hang. It seems to be related to OSSubprocess. I tried to capture it in a small example:
https://github.com/Rinzwind/OSSubprocessTest
The example:
Below is a sample from the output we get on our Jenkins server. It shows that the test hangs:
- The line
Did #run (m: 2542, pid: 6480) shows that a ‘curl’ process was spawned with PID 6480
- The lines
6480 Z 00:00:00 \_ curl <defunct> from the ‘ps’ command show that that process has terminated but that its status has not been consumed yet (it’s a zombie process)
- There is no corresponding line
Did #waitForExit (m: 2542) showing a return from #waitForExit
- The only remaining output is from the ‘ps’ command
Note there’s also no more line Will #checkFinishedChildren after the last Did #run. As far as I understand, the child watcher in OSSVMProcess should be sending #checkFinishedChildren at least once a second:
|
childWatcher := [[ |
|
"OSProcess authors suspected that there were various ways in which OS signals |
|
could be missed under conditions of heavy load. For that reason, we use |
|
#waitTimeoutMSecs: with the semaphore, so that if a signal is missed, |
|
we time out and rescan occasionally anyway (#updateActiveChildrenAndNotifyDead |
|
sends queryExitStatus which sends waitpid() ) |
|
" |
|
self sigChldSemaphore waitTimeoutMSecs: 1000. |
|
processSynchronizationDelay wait. "Avoids lost signals in heavy process switching" |
|
self checkFinishedChildren. |
|
] repeat] newProcess priority: Processor highIOPriority. |
So far I’ve only been able to reproduce this on our Jenkins server though, making it a little hard to debug.
Sample output:
[…]
+ /home/packer/smalltalkCI-master/bin/smalltalkci --smalltalk Pharo64-9.0 .smalltalk.ston
[…]
Downloading the latest 90 Image:
http://files.pharo.org/get-files/90/pharoImage-x86_64.zip
[…]
Downloading the latest pharoVM:
http://files.pharo.org/get-files/90/pharo-vm-Linux-x86_64-stable.zip
[…]
Build information: Pharo-9.0.0+build.1575.sha.9bb5f998e8a6d016ec7abde3ed09c4a60c0b4551 (64 Bit)
[…]
Running suite "Pharo64-9.0" with 1 tests.
[…]
Did #run (m: 2541, pid: 6478)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT
Test
Will #checkFinishedChildren
Did #checkFinishedChildren
Did #waitForExit (m: 2541)
Did #run (m: 2542, pid: 6480)
HTTP/1.1 200 OK
Content-Length: 5
Content-Type: text/plain
Server: Zinc HTTP Components 1.0 (Pharo/9.0)
Date: Wed, 03 Aug 2022 19:16:13 GMT
Test
Did sleep (n: 26)
PID STAT TIME COMMAND
1230 Sl 00:00:10 pharo
1245 S 00:00:00 \_ bash
6480 Z 00:00:00 \_ curl <defunct>
Did sleep (n: 27)
PID STAT TIME COMMAND
1230 Sl 00:00:10 pharo
1245 S 00:00:00 \_ bash
6480 Z 00:00:00 \_ curl <defunct>
[…]
We’re having a problem on our Jenkins server: test runs often hang. It seems to be related to OSSubprocess. I tried to capture it in a small example:
https://github.com/Rinzwind/OSSubprocessTest
The example:
TestRequestHandler>>#handleFiltered:)#checkFinishedChildren(see:BaselineOfTest>>#postLoad)TestTestCase>>#test)Below is a sample from the output we get on our Jenkins server. It shows that the test hangs:
Did #run (m: 2542, pid: 6480)shows that a ‘curl’ process was spawned with PID 64806480 Z 00:00:00 \_ curl <defunct>from the ‘ps’ command show that that process has terminated but that its status has not been consumed yet (it’s a zombie process)Did #waitForExit (m: 2542)showing a return from#waitForExitNote there’s also no more line
Will #checkFinishedChildrenafter the lastDid #run. As far as I understand, the child watcher in OSSVMProcess should be sending#checkFinishedChildrenat least once a second:OSSubprocess/repository/OSSubprocess/OSSVMProcess.class.st
Lines 234 to 244 in 67ded10
So far I’ve only been able to reproduce this on our Jenkins server though, making it a little hard to debug.
Sample output: