-
Notifications
You must be signed in to change notification settings - Fork 26
Wrappers for iPic3D #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Wrappers for iPic3D #223
Conversation
| DMTCP_PLUGIN_DISABLE_CKPT(); | ||
| MPI_Request realRequest = VIRTUAL_TO_REAL_REQUEST(*request); | ||
| JUMP_TO_LOWER_HALF(lh_info.fsaddr); | ||
| if( realRequest == MPI_REQUEST_NULL ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the if branch superfluous? Won't the lower half simply return with success if the request is MPI_REQUEST_NULL?
If so, why should MANA duplicate the logic of MPI?
| DMTCP_PLUGIN_DISABLE_CKPT(); | ||
| MPI_Request realRequest = VIRTUAL_TO_REAL_REQUEST(*request); | ||
| JUMP_TO_LOWER_HALF(lh_info.fsaddr); | ||
| if( realRequest == MPI_REQUEST_NULL ){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, as above. If the lower half will handle MPI_REQUEST_NULL correctly, then why do we need to duplicate that logic in MANA?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good afternoon!
You are right. Truth is I encountered a problem and this if-statement seemed to solve it, but I have reverted to not having the if-statement and it works just as good!
|
@Marc-Miranda , I'll write more later. Please ping me if I forget. I want to do a careful analysis of MPI_Cancel, and where a ckpt-restart might happen. In each situation, either the MPI_Cancel cancels it or else the communication succeeds. So, that makes four cases to analyze. I'm thinking that we might need to modify the log-and-replay logic to log the MPI_Cancel, and replay it after restart. This is needed to mark the request with a cancellation. But I'm not sure if this is required, or we can fix things at checkpoint time, so that we don't need to worry. Note that the MPI request will be disposed of (converted to MPI_REQUEST_NULL) only by MPI_Wait and friends. MPI_Cancel does not complete a request. It can only mark a request as "cancelled" for special processing later by MPI_Wait. It's only at the time of MPI_Wait (or MPI_Test, etc.) that the cancelled request is converted into MPI_REQUEST_NULL. And the MANA logic already removes a virtual request at the time of MPI_Wait/MPI_Test. So, that probably works. Note that MPI_Cancel is a local operation. It does not use the network to cancel a communication by the peer. If a peer initiates an MPI_Isend or MPI_Irecv, then the MPI_Cancel will cancel the local recv (of remote MPI_Isend) or send (to remote MPI_Irecv). The result is that there is still a pending MPI_Isend/MPI_Irecv initiated on the remote need. But locally, the user will need to post a new, matching recv or send. ==== === |
|
@Marc-Miranda , I have a proposal for one more test of correctness. Can you wrote a simple MPI program with send/recv? But instead of MPI_Isend/MPI_Wait, try MPI_Isend/MPI_Cancel/sleep(20)/MPI_Wait. Then, using your branch, see if ckpt-restart works, and if ordinary ckpt-resume to finish the original process works. After that, please try the same game with MPI_Irecv. I'm worried that in your previous tests, maybe you weren't checkpointing between the MPI_Cancel and MPI_Wait, because it's unlikely, but possible. If this does uncover a bug, I have a suggestion on how to fix it, but let's first see if it's a problem. If so, I'll work out the details of a proposed fix. |
|
Good afternoon @gc00!
Best, |
|
Good morning! Just as a note, before putting any MPI_Isend/MPI_Irecv I first tried just an MPI_Cancel. I had first an MPI_Isend (or an MPI_Irecv, I do not recall), then sleep during 30 seconds and then MPI_Cancel. Doing a checkpoint during the sleep converted the request to MPI_REQUEST_NULL, and MPI_Cancel then raised an error. If I am reading correctly the subsection of Null Handles here, it seems that MPI_Cancel does in fact treat null handles as an error. It seems that just Wait and Test calls can deal with null handles. Best, |
|
Good afternoon! When testing the wrapper for MPI_Test_cancelled I stumbled into a problem for the wrapper for MPI_Cancel. The test application does the following: MPI_Isend/MPI_Cancel/sleep(20)/MPI_Wait/sleep(20)/MPI_Test_cancelled. When I do a checkpoint during the second sleep no problem arises: the request created by MPI_Isend, then locally cancelled my MPI_Cancel and "made global cancelled" by MPI_Wait. So when checkpoint arrives, the request has already the global status of cancelled and MPI_Test_cancelled has no problem with that. Checkpoint-resume and checkpoint-restart work here. However, when I checkpoint during the first sleep there is the following problem. Before the checkpoint the request has been locally cancelled, but this cancellation has not been made globally aware by MPI_Wait, so during the draining process when checkpointing, the request is set to MPI_REQUEST_NULL. After checkpoint, the request is not set as being cancelled, but as MPI_REQUEST_NULL altogether. How do you think this should be sorted out? I have not yet studied much how the draining process works, so I do not know if it should be easy to make a test during the draining process to ignore requests that have been cancelled. Note: It was quite strange to see that although the real request associated to my application was set to MPI_REQUEST_NULL, the virtual partner was not. Best, |
Good afternoon!
This is a PR with all the modifications and implemented wrappers for functions that iPic3D required and that we have been discussing through several issues. Some comments:
Best,
Marc