Skip to content

Conversation

@karya0
Copy link
Collaborator

@karya0 karya0 commented Aug 28, 2023

No description provided.

@karya0 karya0 requested review from gc00 and jiamingz9925 August 28, 2023 18:26
Copy link
Collaborator

@gc00 gc00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please modify the comment, and add the extra requested comment. Otherwise, in the future, we look back at this mysterious special case, and wonder why.

Separately, I assume that you're going to squash the two commits together, before pushing this in.

I'd like to wait to see the added comment before approving, just to make sure we're documenting the code well. Thanks.

LhCoreRegions_t *lh_regions_list = NULL;
int total_lh_regions = lh_info->numCoreRegions;

// Don't skip munmap of mtcp_restart regions.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change the comment to remove the double negative:
// Do an munmap of mtcp_restart regions during restart. Don't skip this.

Also, please add a comment about why we need to munmap the mtcp_restart regions within MANA, but we don't need to do that within ordinary DMTCP. Where is the potential address conflict that we're trying to avoid?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add that we should not skip all [heap] but only the [heap] right after the mtcp_restart region.


// Don't skip munmap of mtcp_restart regions.
if (mtcp_strendswith(area->name, "/mtcp_restart") ||
mtcp_strendswith(area->name, "[heap]")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we only want to skip the [heap] right after mtcp_restart. Also in mpi_plugin.cpp we need to do the same to skip those area for libsStart consideration.


// Don't skip munmap of mtcp_restart regions.
if (mtcp_strendswith(area->name, "/mtcp_restart") ||
mtcp_strendswith(area->name, "[heap]")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: For some reason unmap the heap region here would cause seg fault, but without unmapping it we will encounter conflict as well later. @karya0

@gc00
Copy link
Collaborator

gc00 commented Aug 29, 2023

@karya0 ,
This issue is a blocker for MANA development. I hope you can get back to it soon.
Best,

  • Gene

@karya0
Copy link
Collaborator Author

karya0 commented Aug 30, 2023

@gc00 : This PR is insufficient for the fix. The problem lies in how lower-half/lh-proxy are accounting "core" vs rest of the regions. The current logic in the split process considers all areas until [heap] as core regions and refuses to munmap them. This includes the mtcp_restart region as well.

Further, the upper-half plugin, mpi_plugin.cpp, logic incorrectly labels the heap created by the new lh-proxy process as part of the upper half and saves it as part of checkpoint. That's why heap also sees a conflict on second restart.

We need to come up with a proper fix to handle both cases. This PR can plaster over the mtcp_restart conflict but not heap.

@jiamingz9925
Copy link
Collaborator

@karya0 @gc00 I can try to do some experiment in my forked repo and based on this PR as well

@gc00
Copy link
Collaborator

gc00 commented Sep 3, 2023

See PR #357 for the continuation of this analysis. We should probably close this PR without committing

@gc00
Copy link
Collaborator

gc00 commented Sep 13, 2023

@karya0 , If this PR #353 is now obsolete, can you close it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants