Feature/qgpu optimize nonbonded ww using openmm method #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

goodstudyqaq wants to merge 2 commits into feature/qgpu from feature/qgpu_optimize_nonbonded_ww_using_openmm_method

Collaborator

goodstudyqaq commented Jan 5, 2026

No description provided.

goodstudyqwq added 2 commits

January 5, 2026 14:45


          use openmm method to optimize nonbonded ww force

7352a1f


          remove debug log

ff4d1fb

goodstudyqaq requested a review from Copilot

January 6, 2026 09:14

Copilot started reviewing on behalf of goodstudyqaq

January 6, 2026 09:15

Copilot AI reviewed

View reviewed changes

Copilot AI left a comment

Pull request overview

This PR refactors the CUDA water-water nonbonded force calculation kernel to use a warp-synchronous algorithm inspired by OpenMM. The new implementation replaces the previous block-based approach with a more efficient warp-level computation that leverages shuffle operations for better performance.

Key Changes:

Replaced template-based block computation with warp-synchronous tile-based algorithm
Introduced helper functions for index mapping (idx2xy) and warp shuffle operations for double/coord_t types
Simplified grid/block configuration from 2D to 1D with tile-based work distribution

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/core/cuda/src/CudaNonbondedWWForce.cu

Comment on lines +203 to +206

Copilot AI Jan 6, 2026

Multiple blank lines should be removed. This appears to be accidental whitespace that reduces code readability.

Suggested change

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu




		// printf("WW E_vdw: %f, E_coul: %f\n", WW_evdw_TOT, WW_ecoul_TOT);

Copilot AI Jan 6, 2026

Commented-out printf statement should be removed before merging. Debug code should not remain in production code.

Suggested change

// printf("WW E_vdw: %f, E_coul: %f\n", WW_evdw_TOT, WW_ecoul_TOT);

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu

Comment on lines +91 to +96

    
                  coord_t invalid = {-1e9, -1e9, -1e9};

                  coord_t row = (rowIdx < N ? W[rowIdx] : invalid);

                  // "col" state that will be rotated around the warp

                  int colIdx = colIdx0;

                  coord_t col = (colIdx < N ? W[colIdx] : invalid);

Copilot AI Jan 6, 2026

Variable name 'invalid' is vague and could be more descriptive. Consider renaming to 'invalidCoord' or 'outOfBoundsMarker' to clarify its purpose as a sentinel value for out-of-range atoms.

Suggested change

      
                coord_t invalid = {-1e9, -1e9, -1e9};
          
                coord_t row = (rowIdx < N ? W[rowIdx] : invalid);
          
                // "col" state that will be rotated around the warp
          
                int colIdx = colIdx0;
          
                coord_t col = (colIdx < N ? W[colIdx] : invalid);
          
                coord_t invalidCoord = {-1e9, -1e9, -1e9};
          
                coord_t row = (rowIdx < N ? W[rowIdx] : invalidCoord);
          
                // "col" state that will be rotated around the warp
          
                int colIdx = colIdx0;
          
                coord_t col = (colIdx < N ? W[colIdx] : invalidCoord);

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu

Comment on lines +46 to +54

    
              __device__ __forceinline__ void idx2xy(int n, int t, int& x, int& y) {

                  x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f);

                  y = t - (x * n - (x * (x - 1) >> 1));

                  if (y < 0) {

                      x--;

                      y += (n - x);

                  }

                  y += x;

              }

Copilot AI Jan 6, 2026

The idx2xy function lacks documentation explaining its algorithm and parameters. Add a comment describing that this maps a linear tile index to 2D coordinates in the upper triangle of a matrix, and explain the meaning of parameters n (number of blocks per dimension) and t (tile index).

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu

Comment on lines +56 to +68

    
              __device__ __forceinline__ double shfl(double v, int srcLane, unsigned mask = 0xffffffffu) {

                  int2 a = *reinterpret_cast<int2*>(&v);

                  a.x = __shfl_sync(mask, a.x, srcLane);

                  a.y = __shfl_sync(mask, a.y, srcLane);

                  return *reinterpret_cast<double*>(&a);

              }

              // Initialize sum arrays

              #pragma unroll

                  for (int i = cur_thread_num; i < y_cal_num && block_y_left_begin + i < NY; i += thread_num) {

                      sum_row_x[i] = sum_row_x[y_cal_num + i] = 0;

                      sum_row_y[i] = sum_row_y[y_cal_num + i] = 0;

                      sum_row_z[i] = sum_row_z[y_cal_num + i] = 0;

                  }

                  __syncthreads();

                  // Initialize column sums

                  double sum_col_x[2 * Block_x], sum_col_y[2 * Block_x], sum_col_z[2 * Block_x];

              #pragma unroll

                  for (int i = 0; i < Block_x; i++) {

                      sum_col_x[i] = sum_col_x[Block_x + i] = 0;

                      sum_col_y[i] = sum_col_y[Block_x + i] = 0;

                      sum_col_z[i] = sum_col_z[Block_x + i] = 0;

                  }

              __device__ __forceinline__ coord_t shfl_coord(coord_t v, int srcLane, unsigned mask = 0xffffffffu) {

                  v.x = shfl(v.x, srcLane, mask);

                  v.y = shfl(v.y, srcLane, mask);

                  v.z = shfl(v.z, srcLane, mask);

                  return v;

              }

Copilot AI Jan 6, 2026

The shfl and shfl_coord helper functions lack documentation. Add comments explaining that these functions enable warp shuffle operations for double and coord_t types respectively, which are essential for the warp-synchronous algorithm.

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu

    
            @@ -1,5 +1,6 @@
          
              #include "cuda/include/CudaContext.cuh"

              #include "cuda/include/CudaNonbondedWWForce.cuh"

              #include <iostream>

Copilot AI Jan 6, 2026

The iostream header is included but not used in this file. This adds unnecessary compilation overhead and should be removed.

Suggested change

#include <iostream>

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu

    
                      evdw_sum += __shfl_down_sync(0xffffffffu, evdw_sum, offset);

                      ecoul_sum += __shfl_down_sync(0xffffffffu, ecoul_sum, offset);

                  }

                  // printf("evdw_sum: %f, ecoul_sum: %f\n", evdw_sum, ecoul_sum);

Copilot AI Jan 6, 2026

Commented-out printf statement should be removed before merging. Debug code should not remain in production code.

Suggested change

// printf("evdw_sum: %f, ecoul_sum: %f\n", evdw_sum, ecoul_sum);

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu

    
                      p[i] = W[block_x_left_begin + i];

                      p[x_cal_num + i] = W[N - (block_x_left_begin + i)];

              __device__ __forceinline__ void idx2xy(int n, int t, int& x, int& y) {

                  x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f);

Copilot AI Jan 6, 2026

The idx2xy function uses float precision (floorf, sqrtf) but operates on integer inputs. For large values of n and t, this could lead to precision loss and incorrect index calculations. Consider using double precision (floor, sqrt) or an integer-only algorithm for robustness.

Suggested change

      
                x = (int)floorf((2 * n + 1 - sqrtf((2 * n + 1) * (2 * n + 1) - 8 * t)) * 0.5f);
          
                // Use double precision for the quadratic inversion to avoid precision loss for large n, t.
          
                double dn = static_cast<double>(n);
          
                double dt = static_cast<double>(t);
          
                double tmp = 2.0 * dn + 1.0;
          
                double disc = tmp * tmp - 8.0 * dt;
          
                double xd = floor((tmp - sqrt(disc)) * 0.5);
          
                x = static_cast<int>(xd);

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu

Comment on lines +81 to +86

    
                  // Map tile -> (x,y) with y>=x (upper triangle)

                  int x, y;

                  idx2xy((N + 31) >> 5, tile, x, y);  // nBlocks = ceil(N/32)

              // Optimized reduction using warp-level primitives

              #pragma unroll

                  for (unsigned w = 16; w >= 1; w /= 2) {

                      ecoul_tot += __shfl_down_sync(0xffffffff, ecoul_tot, w);

                      evdw_tot += __shfl_down_sync(0xffffffff, evdw_tot, w);

                  }

                  const int baseX = x << 5;  // x*32

                  const int baseY = y << 5;  // y*32

Copilot AI Jan 6, 2026

Add a comment explaining the grid decomposition strategy. The calculation (N + 31) >> 5 computes the number of 32-atom blocks (ceiling division by 32), and idx2xy maps the linear tile index to upper-triangle coordinates. This is a key part of the algorithm and deserves documentation for maintainability.

Copilot uses AI. Check for mistakes.

src/core/cuda/src/CudaNonbondedWWForce.cu

    
                  const int N_ITERATION_Y = 32;  // Keep at 32 for better memory access pattern

                  const int N_ITERATION_X = 1;   // Keep at 1 for better memory coalescing

                  const int thread_num = 128;

Copilot AI Jan 6, 2026

The magic number 128 for thread_num should be defined as a named constant (e.g., THREAD_BLOCK_SIZE) to improve code maintainability and make it clear that this value is related to warpsPerBlock calculation in the kernel (line 74).

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet