Perf: 2-4x speedup for ShermanMorrison _solve_2D2#445
Conversation
|
Just for fun I went ahead and benchmarked against fastshermanmorrison and this approach is a bit faster on apple silicon: But slower on x86: |
|
Same approach for fastshermanmorrison here: nanograv/fastshermanmorrison#10, some decent speedups. Also explains some of the mystery of why Python beat the C kernel on my Apple silicon. |
|
Hi @jberg5 , it is great you are looking at this! Thanks. Can you modify the PR to merge into |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #445 +/- ##
==========================================
+ Coverage 71.58% 71.69% +0.10%
==========================================
Files 13 13
Lines 3245 3243 -2
==========================================
+ Hits 2323 2325 +2
+ Misses 922 918 -4
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
Thanks @vhaasteren ! Have switched to |
2-4x speedup for
enterprise/enterprise/signals/signal_base.py
Line 1283 in 6335ff7
A single numpy matrix multiplication is much faster than outer product + subtraction in a loop. We can make the
_solve_2D2correction anywhere from 10-100x faster (end to end speedup will be lower because other things are unchanged) just by accumulating allznandxnterms into matrices, since the sum of scaled outer products is itself a matrix product, as long as you decompose beta:where we define:
Note that this isn't really an algorithmic improvement; the total number of flops is going to be roughly the same. The speedup all comes from being able to use the BLAS dgemm kernel (and the corresponding speedup you see will depend on your hardware). There is a negligible memory overhead from larger intermediate matrices.
Here's a synthetic benchmarking script that you can run standalone to see the boost (thanks Claude for writing it):
Click to expand script
On my macbook I see:
On a gcloud c2-standard-4: