Skip to content

Feature/stats_shm#413

Open
prestonpiercey1 wants to merge 18 commits intollnl:v0.10-devfrom
prestonpiercey1:feature/stats_shm
Open

Feature/stats_shm#413
prestonpiercey1 wants to merge 18 commits intollnl:v0.10-devfrom
prestonpiercey1:feature/stats_shm

Conversation

@prestonpiercey1
Copy link
Copy Markdown
Contributor

Implements features to be used in coordination with YGM-top.

  • Adds shared memory backing capability for ygm communicator stats, allowing external reader to see comm stats data during live program runs.

  • Also adds cleanup infrastructure to account for abnormal program exit conditions where possible.

Comment thread include/ygm/detail/comm_stats.hpp Outdated
shm::unregister_path(m_stats_path.c_str());
}

void open_shm(int rank, int comm_size, int local_size, std::string uuid) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is a little confusing with shm_open. Might be better to use comm_stats_shm_open or similar. If this is changed, change close_shm the same way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, will change to be more descriptive.

Comment thread include/ygm/detail/stats_shm_signal.hpp Outdated
// Tracking array. Mutated by register/unregister_path in main thread;
// read by signal handler to unlink in case of abnormal behavior.
inline int num_active_paths = 0;
inline char active_shm_paths[MAX_SHM_PATHS][kShmPathMaxLen] = {0};
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the live_comm_uuids from ygm_uuids for unlinking files during a crash? The use of a C array of strings with MAX_SHM_PATHS entries is going to be unnecessarily limiting.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manipulation/use of std::set and std::string are very likely to violate signal handler calling context assumptions due to use of dynamic memory, and many associated functions not safe for potential reentry behavior. Current code was intentionally written to not use any operations or structures that are not verifiably async-signal-safe.

It may be overly complicated now that we are considering session removal / recovery in YGM-top. If it is overkill I can roll back to live_comm_uuids.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can work around this in our case. We aren't trying to call malloc in the signal handler. Just read from the set. We can also get away with not removing the UUIDs when we are in the signal handler.

The one case that could be an issue is if the signal was raised during the insertion of the UUID into live_comm_uuids. If we want to be safe in this case, I believe we could use sigprocmask to turn signals off when adding/removing UUIDs, and then turn them back on.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean. I'll change it to read from live_comm_uuid and reconstruct the shm paths in handler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants