perf(BEVFusion): faster bit-identical GPU voxelizer (training)#212
perf(BEVFusion): faster bit-identical GPU voxelizer (training)#212Max-Bin wants to merge 5 commits into
Conversation
The deterministic C++ hard_voxelize is O(N^2) (point_to_voxelidx scans every earlier point) plus single-thread (determin_voxel_num is launched <<<1, 1>>>), costing ~70ms per 250k-point frame and dominating the BEVFusion training data path. Add voxelize_fast_gpu: a parallel sort + unique + scatter PyTorch path that reproduces the C++ deterministic op's output bit-for-bit (coords, num_points, per-voxel mean) at ~0.7ms (~100x). Voxelization.forward uses it on GPU in deterministic mode and falls back to the C++ op on CPU, in non-deterministic mode, or when M > max_voxels. Unlike deterministic=False (fast but not reproducible), this is both fast AND deterministic. Deployment is unaffected: Autoware uses the spconv Point2VoxelGPU hash voxelizer, not this op. test_voxelize_fast.py verifies bit-identical output on synthetic clouds including dense clusters that trigger the max_num_points truncation. Signed-off-by: Max-Bin <vborisw@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 88f0e926bd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| in_range = ((points[:, :3] >= rmin) & (points[:, :3] < rmax)).all(dim=1) | ||
| pts = points[in_range] |
There was a problem hiding this comment.
Reject coordinates outside the rounded grid
For configs where (range / voxel_size) is not an exact integer (or for points very close to the upper bound where division rounds up), this filter admits points solely because they are < rmax, while the CUDA dynamic_voxelize_kernel computes c_* and then drops any c_* >= grid_*. The fast path then emits coords such as x == grid_x when M <= max_voxels, which the old op would have discarded and which are outside the pcd_shape derived from the same rounded grid, so downstream sparse encoding can receive invalid voxels. Please apply the same coord < grid check after computing coord rather than using only the raw range comparison.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
fixed in 7fc9d22. The fast path now mirrors dynamic_voxelize_kernel exactly: it computes the voxel index for every point and keeps only 0 <= coord < grid (the kernel's c < 0 || c >= grid rejection), replacing the raw >= min & < max range filter. So for a non-integer range/voxel_size, or upper-bound points that floor up to c == grid, the same points are now dropped and no out-of-grid voxel is emitted.
Verified bit-identical to the C++ op (coords / num_points / per-voxel mean, max_diff = 0.0) on a non-integer grid (10 / 0.3 → grid 33), exact upper-bound points (x == 10.0), and real frames — and the fast output is always within the grid.
Per review (P2): the fast path used a raw range filter (>= min & < max), but the CUDA dynamic_voxelize_kernel computes the voxel index and drops points whose index is out of the rounded grid (c < 0 or c >= grid). For non-integer range/voxel_size, or points at the upper bound that floor up to c == grid, the two diverged and the fast path could emit out-of-grid voxels the encoder would reject. Now compute coord first and keep only 0 <= coord < grid, exactly like the C++ kernel. Verified bit-identical (coords / num_points / per-voxel mean) vs the C++ op on non-integer-grid + upper-edge synthetic clouds and on real frames; fast output is always within the grid. Signed-off-by: Max-Bin <vborisw@gmail.com>
|
Thanks @Max-Bin for the implementation. Do you think it is possible to migrate this implementation to autoware.universe as well? |
@KSeangTan Since this is a Torch-based method, it is difficult to make it fully compatible with the existing C++ implementation. |
There was a problem hiding this comment.
Pull request overview
This PR introduces a new deterministic GPU voxelization path for BEVFusion training to replace the current slow deterministic CUDA implementation of hard_voxelize, while preserving output equivalence and falling back to the existing op when needed.
Changes:
- Added
voxelize_fast_gpu()invoxelize.py, using stable sort + segment logic to deterministically keep the firstmax_num_pointsper voxel and return(voxels, coors, num_points_per_voxel). - Updated
Voxelization.forward()to use the fast path whendeterministic=Trueand inputs are on CUDA, with a fallback to the existing C++ op whenM > max_voxels. - Added a CUDA-side equivalence + benchmark script
test_voxelize_fast.py.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| projects/BEVFusion/bevfusion/ops/voxel/voxelize.py | Adds the fast deterministic GPU voxelizer and wires it into Voxelization.forward() with safe fallback behavior. |
| projects/BEVFusion/bevfusion/ops/voxel/test_voxelize_fast.py | Adds a verification/benchmark script comparing the new GPU path against the compiled deterministic C++ op. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
KSeangTan
left a comment
There was a problem hiding this comment.
LGTM overall, would you mind addressing the comments? Thanks
… test Address review feedback on the fast GPU voxelizer: - voxelize_fast_gpu: compute the grid dims on the host from the config instead of via a CUDA tensor, removing the per-forward device->host .item() sync on this hot path (Copilot). floor(x+0.5) mirrors the C++ round() exactly; grid values verified identical to the old computation across configs, and the op stays bit-identical on 8 real frames (coords / num_points / full voxel tensors, max_abs_diff = 0). - Comments: clarify that the z-major linear key is an internal sort/group key only and the output coords are (x, y, z) (matching this repo's compiled hard_voxelize), and why M > max_voxels falls back to the C++ op rather than truncating here (preserves bit-identity) (KSeangTan). - test: convert to unittest (auto-skips without CUDA) and compare the full per-voxel tensors instead of only per-voxel means, so kept point-set and slot-order differences are caught, not just permutation-invariant means (Copilot). Signed-off-by: Max-Bin <vborisw@gmail.com>
Summary
This PR adds a fast GPU path for deterministic BEVFusion hard voxelization.
In training, the current deterministic
hard_voxelizeis a major bottleneck, taking around 70 ms per 250k-point LiDAR frame. The newvoxelize_fast_gpukeeps deterministic behavior while reducing runtime to about 0.7 ms/frame on H100, roughly 100× faster.The change is backward-compatible and falls back to the existing C++ op on CPU, in non-deterministic mode, or when the voxel count exceeds
max_voxels.Why
The existing deterministic CUDA path is slow mainly because of its algorithm:
point_to_voxelidx_kernelscans previous points to find each point's rank inside its voxel.determin_voxel_numruns with a single CUDA thread and loops over all points.For large point clouds, this makes voxelization one of the biggest costs in BEVFusion training.
Using
deterministic=Falseis faster, but it is not reproducible because atomic ordering can change which points are kept when a voxel is full. Many training setups therefore keepdeterministic=Trueand pay the extra cost.What changed
voxelize_fast_gpureproduces the deterministic behavior with parallel PyTorch operations:max_num_pointspoints per voxel.The voxel row order is different from the C++ op, but this does not affect BEVFusion because downstream sparse encoding uses voxel coordinates. The final BEV output was verified to be identical.
Verification
Verified against the freshly compiled C++
hard_voxelize(deterministic=True)op from this repo on 8 real 120 m LiDAR frames with around 250k points and 103k voxels per frame.The following matched exactly:
HardSimpleVFEextract_featoutputResult:
Speed
Benchmark setup: single H100, torch 2.12 + cu130, freshly JIT-compiled op, 5 warm-up runs + 30 measured runs, voxelization only.
In a downstream E2E trainer, this reduced per-step time from 2.61 s to 0.38 s, about 6.9× faster.
Test
cd projects/BEVFusion python -m bevfusion.ops.voxel.test_voxelize_fastThe test checks output equivalence and prints benchmark numbers for the current GPU/data.