Skip to content

Conversation

@Marc-Miranda
Copy link

Good afternoon!
This is a PR with all the modifications and implemented wrappers for functions that iPic3D required and that we have been discussing through several issues. Some comments:

  1. I am not sure whether the simple modification done to the wrapper for MPI_Wait carries any problems.
  2. The wrappers for MPI_Cancel and MPI_Request_free worked for me as I wrote them, but I assume that maybe some further treatment has to be done to MPI_Request objects, like updating the virtual-real mapping.

Best,
Marc

@JainTwinkle JainTwinkle requested a review from gc00 July 25, 2022 18:18
DMTCP_PLUGIN_DISABLE_CKPT();
MPI_Request realRequest = VIRTUAL_TO_REAL_REQUEST(*request);
JUMP_TO_LOWER_HALF(lh_info.fsaddr);
if( realRequest == MPI_REQUEST_NULL ){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the if branch superfluous? Won't the lower half simply return with success if the request is MPI_REQUEST_NULL?
If so, why should MANA duplicate the logic of MPI?

DMTCP_PLUGIN_DISABLE_CKPT();
MPI_Request realRequest = VIRTUAL_TO_REAL_REQUEST(*request);
JUMP_TO_LOWER_HALF(lh_info.fsaddr);
if( realRequest == MPI_REQUEST_NULL ){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, as above. If the lower half will handle MPI_REQUEST_NULL correctly, then why do we need to duplicate that logic in MANA?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good afternoon!
You are right. Truth is I encountered a problem and this if-statement seemed to solve it, but I have reverted to not having the if-statement and it works just as good!

@gc00
Copy link
Collaborator

gc00 commented Jul 26, 2022

@Marc-Miranda , I'll write more later. Please ping me if I forget.

I want to do a careful analysis of MPI_Cancel, and where a ckpt-restart might happen.
CASES:
Request for MPI_Isend, with MPI_Cancel, and MPI_Wait --
Suppose the ckpt-restart happens after MPI_Cancel
Request for MPI_Irecv, with MPI_Cancel, and MPI_Wait --
Suppose the ckpt-restart happens after MPI_Cancel

In each situation, either the MPI_Cancel cancels it or else the communication succeeds. So, that makes four cases to analyze.

I'm thinking that we might need to modify the log-and-replay logic to log the MPI_Cancel, and replay it after restart. This is needed to mark the request with a cancellation. But I'm not sure if this is required, or we can fix things at checkpoint time, so that we don't need to worry.

Note that the MPI request will be disposed of (converted to MPI_REQUEST_NULL) only by MPI_Wait and friends. MPI_Cancel does not complete a request. It can only mark a request as "cancelled" for special processing later by MPI_Wait. It's only at the time of MPI_Wait (or MPI_Test, etc.) that the cancelled request is converted into MPI_REQUEST_NULL. And the MANA logic already removes a virtual request at the time of MPI_Wait/MPI_Test. So, that probably works.

Note that MPI_Cancel is a local operation. It does not use the network to cancel a communication by the peer. If a peer initiates an MPI_Isend or MPI_Irecv, then the MPI_Cancel will cancel the local recv (of remote MPI_Isend) or send (to remote MPI_Irecv). The result is that there is still a pending MPI_Isend/MPI_Irecv initiated on the remote need. But locally, the user will need to post a new, matching recv or send.

====
Finally, I have a question.
QUESTION: Don't we need a wrapper for MPI_Test_cancelled ? It seems like a lot of codes that use MPI_Cancel will eventually decide to also call MPI_Test_cancelled. So, we might as well add that wrapper now.

===
Anyway, thanks very much for this PR! We'll continue to work with you to verify its correctness, and see if we need to suggest any improvements.

@gc00
Copy link
Collaborator

gc00 commented Jul 26, 2022

@Marc-Miranda , I have a proposal for one more test of correctness. Can you wrote a simple MPI program with send/recv? But instead of MPI_Isend/MPI_Wait, try MPI_Isend/MPI_Cancel/sleep(20)/MPI_Wait. Then, using your branch, see if ckpt-restart works, and if ordinary ckpt-resume to finish the original process works.

After that, please try the same game with MPI_Irecv.

I'm worried that in your previous tests, maybe you weren't checkpointing between the MPI_Cancel and MPI_Wait, because it's unlikely, but possible.

If this does uncover a bug, I have a suggestion on how to fix it, but let's first see if it's a problem. If so, I'll work out the details of a proposed fix.

@Marc-Miranda
Copy link
Author

Good afternoon @gc00!
Excuse me for taking time to answer. I have been running more tests and found some errors which I had not seen before. Basically when I run iPic3D with MANA from time to time I can get one of the following errors:

  1. Although now almost all statuses.MPI_ERROR are set to MPI_SUCCESS after MPI_Waitall, it might happen that a couple MPI_Status.MPI_ERROR are nonsense which causes the application to stop. I still have not found the source of the issue.
  2. An assertion fail from MPICH mpir_request.h when wanting a request, followed by a segfault.
  3. A part from these I also see sometimes the typical problem when MANA gets stuck, but that is an unrelated issue.
    I am trying to solve or at least locate the issues, but I still have limited knowledge of many tools and concepts (I just graduated from Maths and Physics, not CS).
    I'll try writing the wrapper for MPI_Test_cancelled! But first I'll go with this proposal for a test. I have a compromise now, so if it is ok I will set with this by (what is my) tomorrow morning!

Best,
Marc

@Marc-Miranda
Copy link
Author

Good morning!
I have been playing with the test application you suggested and here are the results. In the case of MPI_Isend all worked perfectly. The Isend request is posted and then cancelled. During the sleep a checkpoint is done, but it resumes until the end of the application, and MPI_Wait does not raise any issue. The only thing that appears is that during checkpoint, MANA would like to drain the MPI_Isend, but as there is no matching MPI_Irecv it is unable to do so and a timeout message appears and the checkpoint is done. Here, both checkpoint-resume and checkpoint-restart work.
On the contrary, in the case of MPI_Irecv, although the checkpoint-reusme works, when restarting it gets stuck. Is this one possible behaviour you anticipated? I wrote the test application in C++ and now I am seeing that all the test applications you have are written in C. I will write it in C and commit it, so you can have a look.

Just as a note, before putting any MPI_Isend/MPI_Irecv I first tried just an MPI_Cancel. I had first an MPI_Isend (or an MPI_Irecv, I do not recall), then sleep during 30 seconds and then MPI_Cancel. Doing a checkpoint during the sleep converted the request to MPI_REQUEST_NULL, and MPI_Cancel then raised an error. If I am reading correctly the subsection of Null Handles here, it seems that MPI_Cancel does in fact treat null handles as an error. It seems that just Wait and Test calls can deal with null handles.

Best,
Marc

@Marc-Miranda
Copy link
Author

Good afternoon!

When testing the wrapper for MPI_Test_cancelled I stumbled into a problem for the wrapper for MPI_Cancel. The test application does the following: MPI_Isend/MPI_Cancel/sleep(20)/MPI_Wait/sleep(20)/MPI_Test_cancelled. When I do a checkpoint during the second sleep no problem arises: the request created by MPI_Isend, then locally cancelled my MPI_Cancel and "made global cancelled" by MPI_Wait. So when checkpoint arrives, the request has already the global status of cancelled and MPI_Test_cancelled has no problem with that. Checkpoint-resume and checkpoint-restart work here.

However, when I checkpoint during the first sleep there is the following problem. Before the checkpoint the request has been locally cancelled, but this cancellation has not been made globally aware by MPI_Wait, so during the draining process when checkpointing, the request is set to MPI_REQUEST_NULL. After checkpoint, the request is not set as being cancelled, but as MPI_REQUEST_NULL altogether.

How do you think this should be sorted out? I have not yet studied much how the draining process works, so I do not know if it should be easy to make a test during the draining process to ignore requests that have been cancelled.

Note: It was quite strange to see that although the real request associated to my application was set to MPI_REQUEST_NULL, the virtual partner was not.

Best,
Marc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants