Skip to content

CUDA errors during cross entropy loss in train_step #4

@vsuresh95

Description

@vsuresh95

Thank you for sharing this code. Great work! I am able to reach the final step in the pipeline where I use the matched segmentation masks to train the instance NeRF field. However, I am getting this error when trying to run the main_nerf_mask.py at this point: https://github.com/zymk9/torch-ngp/blob/6be6af198f1092e8d75574727a030ae15e199fe8/nerf/utils.py#L1312. I have followed all the previous steps in the README.

  0% 0/161 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [3,0,0], thread: [192,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [3,0,0], thread: [193,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [3,0,0], thread: [194,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [3,0,0], thread: [195,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
.....
.....
.....
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [829,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [830,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
../aten/src/ATen/native/cuda/Loss.cu:176: nll_loss_forward_no_reduce_cuda_kernel: block: [1,0,0], thread: [831,0,0] Assertion `cur_target >= 0 && cur_target < n_classes` failed.
Traceback (most recent call last):
  File "main_nerf_mask.py", line 220, in <module>
    trainer.train(train_loader, valid_loader, max_epoch)
  File "/scratch/vv15/dec_5_instance_nerf/instance_nerf/nerf/utils.py", line 716, in train
    self.train_one_epoch(train_loader)
  File "/scratch/vv15/dec_5_instance_nerf/instance_nerf/nerf/utils.py", line 932, in train_one_epoch
    preds, truths, loss = self.train_step(data)
  File "/scratch/vv15/dec_5_instance_nerf/instance_nerf/nerf/utils.py", line 1320, in train_step
    loss = self.criterion(pred_masks_labeled, gt_masks_labeled) # [B*N], loss fn with reduction='none'
  File "/home/vv15/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/vv15/.local/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/home/vv15/.local/lib/python3.8/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

When I train with 0 loss (basically comment out the cross entropy lines), the NeRF training progresses without issue.

Before this error, I also ran into the issue that num_instances was not found in the transforms.json file that came with the NeRF data at this point in the code: https://github.com/zymk9/torch-ngp/blob/6be6af198f1092e8d75574727a030ae15e199fe8/nerf/provider.py#L429. Therefore, I removed all references to num_instances temporarily assuming that it would take a default value of 2 from the constructors. I am not sure if this could be the cause here.

Please let me know if you have any solution for the CUDA error. Thanks in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions