Multiprocessing and Performance Improvements#1117
Multiprocessing and Performance Improvements#1117carlsonp wants to merge 2 commits intocapitalone:devfrom
Conversation
* add downloads tile (capitalone#1085) * Replace snappy with cramjam * Delete test_no_snappy --------- Co-authored-by: Taylor Turner <taylorfturner@gmail.com>
|
I made a bit more progress in understanding what's going on. No solutions yet though. Maybe someone will have suggestions. The profiling via snakeviz looks like this: It's iterating through the |
|
you'll want a rebase onto |

This is related to some of the discussion in #1098
In my testing, I have a single dataset. I am running this in a Docker container. I'm running with the following settings:
When Data Profiler gets to the first
tqdmloop and displaysFinding the Null values in the columns...it's pretty quick. It also lists 19 processes corresponding to the pool_size available in the Python multiprocessing pool. This works fine.Then when it gets to the second
tqdmloop and displaysCalculating the statistics...I noticed that it was only using 4 processes. When I looked at what was running, I am only seeing a single core being used. When I looked at the code,profile_builder.pyhas 4 hard-coded. This doesn't seem right. There's a utility functionprofiler_utils.suggest_pool_sizethat's not even used anywhere in the codebase as far as I can tell that returns the pool size. So I swapped that out. Now when I run, instead of 4 processes it shows 19 so that seems better. At least we're not leaving potential performance on the table with hard-coding.However, I'm still seeing only a single core being used. I also checked the CPU affinity after reading some comments on Stackoverflow. It looks reasonable to me.
I'm going to try profiling a bit more and see if I can figure out where it's hanging. It seems like it should be faster, particularly on a multi-core machine. Calculating all the statistics and stuff is taxing but it seems like it should be faster.