-
Notifications
You must be signed in to change notification settings - Fork 1
Feature/qgpu optimize nonbonded ww using openmm method #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/qgpu
Are you sure you want to change the base?
Feature/qgpu optimize nonbonded ww using openmm method #68
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the CUDA water-water nonbonded force calculation kernel to use a warp-synchronous algorithm inspired by OpenMM. The new implementation replaces the previous block-based approach with a more efficient warp-level computation that leverages shuffle operations for better performance.
Key Changes:
- Replaced template-based block computation with warp-synchronous tile-based algorithm
- Introduced helper functions for index mapping (idx2xy) and warp shuffle operations for double/coord_t types
- Simplified grid/block configuration from 2D to 1D with tile-based work distribution
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
|
|
||
|
|
||
|
|
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple blank lines should be removed. This appears to be accidental whitespace that reduces code readability.
|
|
||
|
|
||
|
|
||
| // printf("WW E_vdw: %f, E_coul: %f\n", WW_evdw_TOT, WW_ecoul_TOT); |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out printf statement should be removed before merging. Debug code should not remain in production code.
| // printf("WW E_vdw: %f, E_coul: %f\n", WW_evdw_TOT, WW_ecoul_TOT); |
| coord_t invalid = {-1e9, -1e9, -1e9}; | ||
| coord_t row = (rowIdx < N ? W[rowIdx] : invalid); | ||
|
|
||
| // "col" state that will be rotated around the warp | ||
| int colIdx = colIdx0; | ||
| coord_t col = (colIdx < N ? W[colIdx] : invalid); |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable name 'invalid' is vague and could be more descriptive. Consider renaming to 'invalidCoord' or 'outOfBoundsMarker' to clarify its purpose as a sentinel value for out-of-range atoms.
| coord_t invalid = {-1e9, -1e9, -1e9}; | |
| coord_t row = (rowIdx < N ? W[rowIdx] : invalid); | |
| // "col" state that will be rotated around the warp | |
| int colIdx = colIdx0; | |
| coord_t col = (colIdx < N ? W[colIdx] : invalid); | |
| coord_t invalidCoord = {-1e9, -1e9, -1e9}; | |
| coord_t row = (rowIdx < N ? W[rowIdx] : invalidCoord); | |
| // "col" state that will be rotated around the warp | |
| int colIdx = colIdx0; | |
| coord_t col = (colIdx < N ? W[colIdx] : invalidCoord); |
| __device__ __forceinline__ void idx2xy(int n, int t, int& x, int& y) { | ||
| x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f); | ||
| y = t - (x * n - (x * (x - 1) >> 1)); | ||
| if (y < 0) { | ||
| x--; | ||
| y += (n - x); | ||
| } | ||
| y += x; | ||
| } |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idx2xy function lacks documentation explaining its algorithm and parameters. Add a comment describing that this maps a linear tile index to 2D coordinates in the upper triangle of a matrix, and explain the meaning of parameters n (number of blocks per dimension) and t (tile index).
| __device__ __forceinline__ double shfl(double v, int srcLane, unsigned mask = 0xffffffffu) { | ||
| int2 a = *reinterpret_cast<int2*>(&v); | ||
| a.x = __shfl_sync(mask, a.x, srcLane); | ||
| a.y = __shfl_sync(mask, a.y, srcLane); | ||
| return *reinterpret_cast<double*>(&a); | ||
| } | ||
|
|
||
| // Initialize sum arrays | ||
| #pragma unroll | ||
| for (int i = cur_thread_num; i < y_cal_num && block_y_left_begin + i < NY; i += thread_num) { | ||
| sum_row_x[i] = sum_row_x[y_cal_num + i] = 0; | ||
| sum_row_y[i] = sum_row_y[y_cal_num + i] = 0; | ||
| sum_row_z[i] = sum_row_z[y_cal_num + i] = 0; | ||
| } | ||
| __syncthreads(); | ||
| // Initialize column sums | ||
| double sum_col_x[2 * Block_x], sum_col_y[2 * Block_x], sum_col_z[2 * Block_x]; | ||
| #pragma unroll | ||
| for (int i = 0; i < Block_x; i++) { | ||
| sum_col_x[i] = sum_col_x[Block_x + i] = 0; | ||
| sum_col_y[i] = sum_col_y[Block_x + i] = 0; | ||
| sum_col_z[i] = sum_col_z[Block_x + i] = 0; | ||
| } | ||
| __device__ __forceinline__ coord_t shfl_coord(coord_t v, int srcLane, unsigned mask = 0xffffffffu) { | ||
| v.x = shfl(v.x, srcLane, mask); | ||
| v.y = shfl(v.y, srcLane, mask); | ||
| v.z = shfl(v.z, srcLane, mask); | ||
| return v; | ||
| } |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shfl and shfl_coord helper functions lack documentation. Add comments explaining that these functions enable warp shuffle operations for double and coord_t types respectively, which are essential for the warp-synchronous algorithm.
| @@ -1,5 +1,6 @@ | |||
| #include "cuda/include/CudaContext.cuh" | |||
| #include "cuda/include/CudaNonbondedWWForce.cuh" | |||
| #include <iostream> | |||
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iostream header is included but not used in this file. This adds unnecessary compilation overhead and should be removed.
| #include <iostream> |
| evdw_sum += __shfl_down_sync(0xffffffffu, evdw_sum, offset); | ||
| ecoul_sum += __shfl_down_sync(0xffffffffu, ecoul_sum, offset); | ||
| } | ||
| // printf("evdw_sum: %f, ecoul_sum: %f\n", evdw_sum, ecoul_sum); |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out printf statement should be removed before merging. Debug code should not remain in production code.
| // printf("evdw_sum: %f, ecoul_sum: %f\n", evdw_sum, ecoul_sum); |
| p[i] = W[block_x_left_begin + i]; | ||
| p[x_cal_num + i] = W[N - (block_x_left_begin + i)]; | ||
| __device__ __forceinline__ void idx2xy(int n, int t, int& x, int& y) { | ||
| x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f); |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idx2xy function uses float precision (floorf, sqrtf) but operates on integer inputs. For large values of n and t, this could lead to precision loss and incorrect index calculations. Consider using double precision (floor, sqrt) or an integer-only algorithm for robustness.
| x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f); | |
| // Use double precision for the quadratic inversion to avoid precision loss for large n, t. | |
| double dn = static_cast<double>(n); | |
| double dt = static_cast<double>(t); | |
| double tmp = 2.0 * dn + 1.0; | |
| double disc = tmp * tmp - 8.0 * dt; | |
| double xd = floor((tmp - sqrt(disc)) * 0.5); | |
| x = static_cast<int>(xd); |
| // Map tile -> (x,y) with y>=x (upper triangle) | ||
| int x, y; | ||
| idx2xy((N + 31) >> 5, tile, x, y); // nBlocks = ceil(N/32) | ||
|
|
||
| // Optimized reduction using warp-level primitives | ||
| #pragma unroll | ||
| for (unsigned w = 16; w >= 1; w /= 2) { | ||
| ecoul_tot += __shfl_down_sync(0xffffffff, ecoul_tot, w); | ||
| evdw_tot += __shfl_down_sync(0xffffffff, evdw_tot, w); | ||
| } | ||
| const int baseX = x << 5; // x*32 | ||
| const int baseY = y << 5; // y*32 |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining the grid decomposition strategy. The calculation (N + 31) >> 5 computes the number of 32-atom blocks (ceiling division by 32), and idx2xy maps the linear tile index to upper-triangle coordinates. This is a key part of the algorithm and deserves documentation for maintainability.
|
|
||
| const int N_ITERATION_Y = 32; // Keep at 32 for better memory access pattern | ||
| const int N_ITERATION_X = 1; // Keep at 1 for better memory coalescing | ||
| const int thread_num = 128; |
Copilot
AI
Jan 6, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The magic number 128 for thread_num should be defined as a named constant (e.g., THREAD_BLOCK_SIZE) to improve code maintainability and make it clear that this value is related to warpsPerBlock calculation in the kernel (line 74).
No description provided.