CUDA errors during cross entropy loss in train_step

Thank you for sharing this code. Great work! I am able to reach the final step in the pipeline where I use the matched segmentation masks to train the instance NeRF field. However, I am getting this error when trying to run the `main_nerf_mask.py` at this point: https://github.com/zymk9/torch-ngp/blob/6be6af198f1092e8d75574727a030ae15e199fe8/nerf/utils.py#L1312. I have followed all the previous steps in the README.

```
  0% 0/161 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [3,0,0], thread: [192,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [3,0,0], thread: [193,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [3,0,0], thread: [194,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [3,0,0], thread: [195,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
.....
.....
.....
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [829,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [830,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [831,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
  File "main_nerf_mask.py", line 220, in <module>
    trainer.train(train_loader, valid_loader, max_epoch)
  File "/scratch/vv15/dec_5_instance_nerf/instance_nerf/nerf/utils.py", line 716, in train
    self.train_one_epoch(train_loader)
  File "/scratch/vv15/dec_5_instance_nerf/instance_nerf/nerf/utils.py", line 932, in train_one_epoch
    preds, truths, loss = self.train_step(data)
  File "/scratch/vv15/dec_5_instance_nerf/instance_nerf/nerf/utils.py", line 1320, in train_step
    loss = self.criterion(pred_masks_labeled, gt_masks_labeled) # [B*N], loss fn with reduction='none'
  File "/home/vv15/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/vv15/.local/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/vv15/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

```

When I train with 0 loss (basically comment out the cross entropy lines), the NeRF training progresses without issue.

Before this error, I also ran into the issue that `num_instances` was not found in the `transforms.json` file that came with the NeRF data at this point in the code: https://github.com/zymk9/torch-ngp/blob/6be6af198f1092e8d75574727a030ae15e199fe8/nerf/provider.py#L429. Therefore, I removed all references to `num_instances` temporarily assuming that it would take a default value of 2 from the constructors. I am not sure if this could be the cause here.

Please let me know if you have any solution for the CUDA error. Thanks in advance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA errors during cross entropy loss in train_step #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

CUDA errors during cross entropy loss in train_step #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions