Skip to content

Conversation

@karya0
Copy link
Collaborator

@karya0 karya0 commented May 31, 2023

No description provided.

@karya0 karya0 force-pushed the dev/karya0/doublefree branch from 50770ec to 336d626 Compare May 31, 2023 10:52
Copy link
Collaborator

@gc00 gc00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving now, in order to get this into MANA quickly. But if I understand the issue properly, then I have a question.

@karya0, You report that you're seeing a double free. In the code in this file alone, there is no obvious double-free. Could you say more about where you are seeing the double-free?

Could it be that we are explicitly calling the FncArg destructor somewhere, and then implicitly calling it again when we destroy a higher level object that uses FncArg? But I don't see any explicitly destructor of FncArg. So, maybe this happens inside a higher-level object. If you can point to where the double-free is, that will be helpful.

@karya0, @jiamingz9925, @dahongli : Can one of you insert some logic inside ~FncArg (the destructor) to see if it is being called twice? An obvious way of testing is to add to the constructor a true field: : inUse=true. And then in ~FncArg, check if inUse is already false (an error), or sel set inUse to be `false'. Thanks.

I can guess how this bad design might have happened. Originally, a FncArg was created only during regular launch/resume, and it was only during restart that we would destroy a FncArg. This probably did not have any malloc bugs. If so, this bug would be exhibited only in a complex use of MPI_Isend/Irecv, etc. Our simple tests would not show this.

Later, @xuyao0127 and I modified the logic of FncArg to create it temporarily during launch, and then destroy it again later during launch, once we could verify that we would never have to replay it. We were doing this for MPI_Send/Recv, or maybe MPI_Isend/Irecv. But I'm not sure about the details. It was a year ago.

@xuyao0127, can you also review our use of FncArg? And I'll try to do that too.

: _data(JALLOC_HELPER_MALLOC(len))
// Default _type set to TYPE_INT_ARRAY because this constructor is also used
// by CREATE_LOG_BUF in MPI_Cart functions.
FncArg(const void *data, size_t len, dmtcp_mpi::TYPE type = TYPE_INT_ARRAY)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for consolidating these two constructors with an optional argument. This is definitely clearer. But wouldn't it be even better if we changed:

#define CREATE_LOG_BUF(buf, len) dmtcp_mpi::FncArg(buf, len)
->
`#define CREATE_LOG_BUF(buf, len) dmtcp_mpi::FncArg(buf, len, TYPE_INT_ARRAY)

Then we don't need an optional argument, and the logic of the code is more local.
`

if (ptr) {
_data = std::shared_ptr<void>(ptr,
[](void *p) { JALLOC_HELPER_FREE(p);});
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the if (ptr), shouldn't we add an else clause to assert or panic? If JALLOC_HELPER_MALLOC() failed, then we want to know this now.


~FncArg()
{
if (!_data) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't matter now that we are revising this code. But isn't if (!_data) a separate original bug (a memory leak)?

First of all, _data will always be a valid pointer. All versions of the constructor allocated a _data buffer. Maybe this logic is left over from an earlier design. So, it seems like !_data will always be false, and so we never free _data. This is a memory leak.

memcpy(_data, data, len);
void *ptr = JALLOC_HELPER_MALLOC(len);
if (ptr) {
_data = std::shared_ptr<void>(ptr,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest an added comment here:

          // FIXME:  This 'shared_ptr' fixes an apparent double-free.
           //   However, this is a temporary fix, until we can discover
           //     why this happens.
```

@gc00
Copy link
Collaborator

gc00 commented Sep 13, 2023

@karya0 , Is this PR #318 still valid, or should it be closed without committing? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants