-
Notifications
You must be signed in to change notification settings - Fork 26
Fixed double-free memory corruption. #318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
50770ec to
336d626
Compare
gc00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approving now, in order to get this into MANA quickly. But if I understand the issue properly, then I have a question.
@karya0, You report that you're seeing a double free. In the code in this file alone, there is no obvious double-free. Could you say more about where you are seeing the double-free?
Could it be that we are explicitly calling the FncArg destructor somewhere, and then implicitly calling it again when we destroy a higher level object that uses FncArg? But I don't see any explicitly destructor of FncArg. So, maybe this happens inside a higher-level object. If you can point to where the double-free is, that will be helpful.
@karya0, @jiamingz9925, @dahongli : Can one of you insert some logic inside ~FncArg (the destructor) to see if it is being called twice? An obvious way of testing is to add to the constructor a true field: : inUse=true. And then in ~FncArg, check if inUse is already false (an error), or sel set inUse to be `false'. Thanks.
I can guess how this bad design might have happened. Originally, a FncArg was created only during regular launch/resume, and it was only during restart that we would destroy a FncArg. This probably did not have any malloc bugs. If so, this bug would be exhibited only in a complex use of MPI_Isend/Irecv, etc. Our simple tests would not show this.
Later, @xuyao0127 and I modified the logic of FncArg to create it temporarily during launch, and then destroy it again later during launch, once we could verify that we would never have to replay it. We were doing this for MPI_Send/Recv, or maybe MPI_Isend/Irecv. But I'm not sure about the details. It was a year ago.
@xuyao0127, can you also review our use of FncArg? And I'll try to do that too.
| : _data(JALLOC_HELPER_MALLOC(len)) | ||
| // Default _type set to TYPE_INT_ARRAY because this constructor is also used | ||
| // by CREATE_LOG_BUF in MPI_Cart functions. | ||
| FncArg(const void *data, size_t len, dmtcp_mpi::TYPE type = TYPE_INT_ARRAY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for consolidating these two constructors with an optional argument. This is definitely clearer. But wouldn't it be even better if we changed:
#define CREATE_LOG_BUF(buf, len) dmtcp_mpi::FncArg(buf, len)
->
`#define CREATE_LOG_BUF(buf, len) dmtcp_mpi::FncArg(buf, len, TYPE_INT_ARRAY)
Then we don't need an optional argument, and the logic of the code is more local.
`
| if (ptr) { | ||
| _data = std::shared_ptr<void>(ptr, | ||
| [](void *p) { JALLOC_HELPER_FREE(p);}); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the if (ptr), shouldn't we add an else clause to assert or panic? If JALLOC_HELPER_MALLOC() failed, then we want to know this now.
|
|
||
| ~FncArg() | ||
| { | ||
| if (!_data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't matter now that we are revising this code. But isn't if (!_data) a separate original bug (a memory leak)?
First of all, _data will always be a valid pointer. All versions of the constructor allocated a _data buffer. Maybe this logic is left over from an earlier design. So, it seems like !_data will always be false, and so we never free _data. This is a memory leak.
| memcpy(_data, data, len); | ||
| void *ptr = JALLOC_HELPER_MALLOC(len); | ||
| if (ptr) { | ||
| _data = std::shared_ptr<void>(ptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest an added comment here:
// FIXME: This 'shared_ptr' fixes an apparent double-free.
// However, this is a temporary fix, until we can discover
// why this happens.
```
No description provided.