Skip to content

Conversation

@goodstudyqaq
Copy link
Collaborator

No description provided.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the CUDA water-water nonbonded force calculation kernel to use a warp-synchronous algorithm inspired by OpenMM. The new implementation replaces the previous block-based approach with a more efficient warp-level computation that leverages shuffle operations for better performance.

Key Changes:

  • Replaced template-based block computation with warp-synchronous tile-based algorithm
  • Introduced helper functions for index mapping (idx2xy) and warp shuffle operations for double/coord_t types
  • Simplified grid/block configuration from 2D to 1D with tile-based work distribution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +203 to +206




Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple blank lines should be removed. This appears to be accidental whitespace that reduces code readability.

Suggested change

Copilot uses AI. Check for mistakes.



// printf("WW E_vdw: %f, E_coul: %f\n", WW_evdw_TOT, WW_ecoul_TOT);
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out printf statement should be removed before merging. Debug code should not remain in production code.

Suggested change
// printf("WW E_vdw: %f, E_coul: %f\n", WW_evdw_TOT, WW_ecoul_TOT);

Copilot uses AI. Check for mistakes.
Comment on lines +91 to +96
coord_t invalid = {-1e9, -1e9, -1e9};
coord_t row = (rowIdx < N ? W[rowIdx] : invalid);

// "col" state that will be rotated around the warp
int colIdx = colIdx0;
coord_t col = (colIdx < N ? W[colIdx] : invalid);
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable name 'invalid' is vague and could be more descriptive. Consider renaming to 'invalidCoord' or 'outOfBoundsMarker' to clarify its purpose as a sentinel value for out-of-range atoms.

Suggested change
coord_t invalid = {-1e9, -1e9, -1e9};
coord_t row = (rowIdx < N ? W[rowIdx] : invalid);
// "col" state that will be rotated around the warp
int colIdx = colIdx0;
coord_t col = (colIdx < N ? W[colIdx] : invalid);
coord_t invalidCoord = {-1e9, -1e9, -1e9};
coord_t row = (rowIdx < N ? W[rowIdx] : invalidCoord);
// "col" state that will be rotated around the warp
int colIdx = colIdx0;
coord_t col = (colIdx < N ? W[colIdx] : invalidCoord);

Copilot uses AI. Check for mistakes.
Comment on lines +46 to +54
__device__ __forceinline__ void idx2xy(int n, int t, int& x, int& y) {
x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f);
y = t - (x * n - (x * (x - 1) >> 1));
if (y < 0) {
x--;
y += (n - x);
}
y += x;
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idx2xy function lacks documentation explaining its algorithm and parameters. Add a comment describing that this maps a linear tile index to 2D coordinates in the upper triangle of a matrix, and explain the meaning of parameters n (number of blocks per dimension) and t (tile index).

Copilot uses AI. Check for mistakes.
Comment on lines +56 to +68
__device__ __forceinline__ double shfl(double v, int srcLane, unsigned mask = 0xffffffffu) {
int2 a = *reinterpret_cast<int2*>(&v);
a.x = __shfl_sync(mask, a.x, srcLane);
a.y = __shfl_sync(mask, a.y, srcLane);
return *reinterpret_cast<double*>(&a);
}

// Initialize sum arrays
#pragma unroll
for (int i = cur_thread_num; i < y_cal_num && block_y_left_begin + i < NY; i += thread_num) {
sum_row_x[i] = sum_row_x[y_cal_num + i] = 0;
sum_row_y[i] = sum_row_y[y_cal_num + i] = 0;
sum_row_z[i] = sum_row_z[y_cal_num + i] = 0;
}
__syncthreads();
// Initialize column sums
double sum_col_x[2 * Block_x], sum_col_y[2 * Block_x], sum_col_z[2 * Block_x];
#pragma unroll
for (int i = 0; i < Block_x; i++) {
sum_col_x[i] = sum_col_x[Block_x + i] = 0;
sum_col_y[i] = sum_col_y[Block_x + i] = 0;
sum_col_z[i] = sum_col_z[Block_x + i] = 0;
}
__device__ __forceinline__ coord_t shfl_coord(coord_t v, int srcLane, unsigned mask = 0xffffffffu) {
v.x = shfl(v.x, srcLane, mask);
v.y = shfl(v.y, srcLane, mask);
v.z = shfl(v.z, srcLane, mask);
return v;
}
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shfl and shfl_coord helper functions lack documentation. Add comments explaining that these functions enable warp shuffle operations for double and coord_t types respectively, which are essential for the warp-synchronous algorithm.

Copilot uses AI. Check for mistakes.
@@ -1,5 +1,6 @@
#include "cuda/include/CudaContext.cuh"
#include "cuda/include/CudaNonbondedWWForce.cuh"
#include <iostream>
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iostream header is included but not used in this file. This adds unnecessary compilation overhead and should be removed.

Suggested change
#include <iostream>

Copilot uses AI. Check for mistakes.
evdw_sum += __shfl_down_sync(0xffffffffu, evdw_sum, offset);
ecoul_sum += __shfl_down_sync(0xffffffffu, ecoul_sum, offset);
}
// printf("evdw_sum: %f, ecoul_sum: %f\n", evdw_sum, ecoul_sum);
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out printf statement should be removed before merging. Debug code should not remain in production code.

Suggested change
// printf("evdw_sum: %f, ecoul_sum: %f\n", evdw_sum, ecoul_sum);

Copilot uses AI. Check for mistakes.
p[i] = W[block_x_left_begin + i];
p[x_cal_num + i] = W[N - (block_x_left_begin + i)];
__device__ __forceinline__ void idx2xy(int n, int t, int& x, int& y) {
x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f);
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idx2xy function uses float precision (floorf, sqrtf) but operates on integer inputs. For large values of n and t, this could lead to precision loss and incorrect index calculations. Consider using double precision (floor, sqrt) or an integer-only algorithm for robustness.

Suggested change
x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f);
// Use double precision for the quadratic inversion to avoid precision loss for large n, t.
double dn = static_cast<double>(n);
double dt = static_cast<double>(t);
double tmp = 2.0 * dn + 1.0;
double disc = tmp * tmp - 8.0 * dt;
double xd = floor((tmp - sqrt(disc)) * 0.5);
x = static_cast<int>(xd);

Copilot uses AI. Check for mistakes.
Comment on lines +81 to +86
// Map tile -> (x,y) with y>=x (upper triangle)
int x, y;
idx2xy((N + 31) >> 5, tile, x, y); // nBlocks = ceil(N/32)

// Optimized reduction using warp-level primitives
#pragma unroll
for (unsigned w = 16; w >= 1; w /= 2) {
ecoul_tot += __shfl_down_sync(0xffffffff, ecoul_tot, w);
evdw_tot += __shfl_down_sync(0xffffffff, evdw_tot, w);
}
const int baseX = x << 5; // x*32
const int baseY = y << 5; // y*32
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining the grid decomposition strategy. The calculation (N + 31) >> 5 computes the number of 32-atom blocks (ceiling division by 32), and idx2xy maps the linear tile index to upper-triangle coordinates. This is a key part of the algorithm and deserves documentation for maintainability.

Copilot uses AI. Check for mistakes.

const int N_ITERATION_Y = 32; // Keep at 32 for better memory access pattern
const int N_ITERATION_X = 1; // Keep at 1 for better memory coalescing
const int thread_num = 128;
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 128 for thread_num should be defined as a named constant (e.g., THREAD_BLOCK_SIZE) to improve code maintainability and make it clear that this value is related to warpsPerBlock calculation in the kernel (line 74).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants