mpirun can mangle tagged output lines, so use heuristics to fix that.#510
mpirun can mangle tagged output lines, so use heuristics to fix that.#510
Conversation
|
A general remark: launching tasks on HPC systems poses a really large and complex problem space, and that it does require complex solutions. I am not sure though that adding solution complexity to a single shell script is a viable route in the long run. Don't get me wrong - I love shell scripting for it's directness, performance and conciseness - but maintainability and readability are not features of shell, and slowly growing the launcher into a single, non-modular and / or large shell script is not a route I would recommend, really. Obviously I am biased, as we have been there and done that also in RP ;-) Our approach at the moment is to shove the complexities into modular python code and to generate small, readable and self-contained shell scripts on the fly. I wonder since quite some time if we should try to extract that code from RP, remove all dependencies, and make it usable for psi/j. I'd love to have a discussion about that at some point... On to the problem at hand: Yes, I agree, finding the tag in the middle of a line is unlikely. But even so, it remains messy. Is it worth the effort? The original motivation was that Having said all that: the code seems to be correct and seems to address the stated problem, so I'd probably approve the PR ;-) PS.: XML? I happily would avoid that bottomless pit... |
I am very much in agreement with that statement. I believe that if you seek correctness you should probably stay as far away as possible from non-formal languages. Even Python is, to me, a step in the wrong direction. We could dismiss worries about sprawling shell scripts by noting that, after testing on a relatively diverse number of machines, the likelihood of needing to do this significantly in the future appears low. That said, the file_staging branch does, unfortunately, resort to shell scripts to emulate staging when it's not natively supported at the level defined by PSI/J (which ends up being pretty much everywhere). Even there, with the specification somewhat complete, it is unlikely that significant additions will be needed. All that said, there might be clever solutions that I have not thought of, so suggestions are welcome. PS: Thanks for taking a look at this. |
|
Most of the issues were popping up in the file_staging branch because of its additional tests involving cat/stdout gymnastics. Since merging this branch into file_staging, it looks like those issues are not showing up any more. |
Of course, the saga is not over.
When specifying
--tag-output, mpirun is supposed to "tag each line" with[jobid, rank]<stdxxx>:. It mostly does. Howeverit occasionally does something else. Assume that
a.txtandb.txtcontainABCDandEFGH, respectively. Runningmpirun --tag-output -n 1 cat a.txt b.txtmostly producesOccasionally, the following shows up instead:
That is indistinguishable from
b.txthaving contained[1,0]<stdout>:EFGH. This, my guess would be, is due to a brief delay between the files thatcatintroduces. This can be verified by adding more files forcatand seeing all kinds of combinations of tags popping out in the middle of a line.One solution is to use heuristics and consider an output line to begin with the tag while also assuming that it is very unlikely for the application to produce the tag in the middle. Hence, we can filter on lines that start with the tag and then
remove any other tags that appear in the middle. This should significantly reduce the likelihood of random mishaps, but transforms it into less likely but deterministic mishaps (e.g., running
echo "[1, 0]<stdout>:bla"through mpirun.Another choice is --xml. Unfortunately, parsing XML in POSIX only is difficult and many simplifying assumptions are made. Nonetheless, that branch appears to work fine with OpenMPI 4, so, perhaps, the loss in clarity might not outweigh the benefits.