CUDA strange behave

Hi !
I’m trying to build an object detection model. I tested it on CPU device, everything was good. But I set up my device to cuda, everything was wrong. First I had an error RuntimeError: CUDA error: device-side assert triggered cause of the line image[0].to(device) (with image a good tensor from the code image, target = next(iter(dataloader)) ). I fixed this error by running : CUDA_LAUNCH_BLOCKING=1 and restarting the kernel.
Then I launched this code imgs = [im.to(device) for im in image], it works but when I run imgs alone, I have an error : RuntimeError: CUDA error: invalid argument
and more specificly :

RuntimeError                              Traceback (most recent call last)
~/anaconda3/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/anaconda3/lib/python3.8/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    375                 if cls in self.type_pprinters:
    376                     # printer registered in self.type_pprinters
--> 377                     return self.type_pprinters[cls](obj, self, cycle)
    378                 else:
    379                     # deferred printer

~/anaconda3/lib/python3.8/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
    553                 p.text(',')
    554                 p.breakable()
--> 555             p.pretty(x)
    556         if len(obj) == 1 and type(obj) is tuple:
    557             # Special case for 1-item tuples.

~/anaconda3/lib/python3.8/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

~/anaconda3/lib/python3.8/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

~/anaconda3/lib/python3.8/site-packages/torch/tensor.py in __repr__(self)
    191             return handle_torch_function(Tensor.__repr__, (self,), self)
    192         # All strings are unicode in Python 3.
--> 193         return torch._tensor_str._str(self)
    194 
    195     def backward(self, gradient=None, retain_graph=None, create_graph=False, inputs=None):

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in _str(self)
    381 def _str(self):
    382     with torch.no_grad():
--> 383         return _str_intern(self)

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in _str_intern(inp)
    356                     tensor_str = _tensor_str(self.to_dense(), indent)
    357                 else:
--> 358                     tensor_str = _tensor_str(self, indent)
    359 
    360     if self.layout != torch.strided:

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in _tensor_str(self, indent)
    240         return _tensor_str_with_formatter(self, indent, summarize, real_formatter, imag_formatter)
    241     else:
--> 242         formatter = _Formatter(get_summarized_data(self) if summarize else self)
    243         return _tensor_str_with_formatter(self, indent, summarize, formatter)
    244 

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in __init__(self, tensor)
     88 
     89         else:
---> 90             nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
     91 
     92             if nonzero_finite_vals.numel() == 0:

RuntimeError: CUDA error: invalid argument 

And then when i runned this line :

targets = [{k: v.to(device) for k, v in tgt.items()} for tgt in target]

I found this error :

--------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-27-6d1a7f1a9662> in <module>
----> 1 targets = [{k: v.to(device) for k, v in tgt.items()} for tgt in target]

<ipython-input-27-6d1a7f1a9662> in <listcomp>(.0)
----> 1 targets = [{k: v.to(device) for k, v in tgt.items()} for tgt in target]

<ipython-input-27-6d1a7f1a9662> in <dictcomp>(.0)
----> 1 targets = [{k: v.to(device) for k, v in tgt.items()} for tgt in target]

RuntimeError: CUDA error: invalid argument

Do someone have an idea how to fix it ?
Thanks!

If you are using an older PyTorch version, could you update to the latest stable release (1.9.0) or the nightly one?
In case you are still hitting this issue, could you post an executable code snippet as well as the output of python -m torch.utils.collect_env here, please?

Hi Thanks for your answer !
I updated my Pytorch version, and I show you the python -m torch.utils.collect_env output :

Collecting environment information...
PyTorch version: 1.9.0+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.2 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8 (64-bit runtime)
Python platform: Linux-5.8.0-53-generic-x86_64-with-glibc2.10
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 3060 Laptop GPU
Nvidia driver version: 460.73.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.20.2
[pip3] numpydoc==1.1.0
[pip3] torch==1.9.0
[pip3] torchaudio==0.8.0a0+e4e171a
[pip3] torchvision==0.9.1
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] mkl                       2021.2.0           h06a4308_296  
[conda] mkl-service               2.4.0            py38h497a2fe_0    conda-forge
[conda] mkl_fft                   1.3.0            py38h42c9631_2  
[conda] mkl_random                1.2.2            py38h1abd341_0    conda-forge
[conda] numpy                     1.18.5                   pypi_0    pypi
[conda] numpy-base                1.20.2           py38hfae3a4d_0  
[conda] numpydoc                  1.1.0                      py_1    conda-forge
[conda] torch                     1.9.0                    pypi_0    pypi
[conda] torchaudio                0.8.1                      py38    pytorch
[conda] torchvision               0.9.1                py38_cu111    pytorch

And I still have an issue when I run imgs as previously, but the error changed, I got:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/anaconda3/lib/python3.8/site-packages/IPython/core/formatters.py in __call__(self, obj)
    700                 type_pprinters=self.type_printers,
    701                 deferred_pprinters=self.deferred_printers)
--> 702             printer.pretty(obj)
    703             printer.flush()
    704             return stream.getvalue()

~/anaconda3/lib/python3.8/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    375                 if cls in self.type_pprinters:
    376                     # printer registered in self.type_pprinters
--> 377                     return self.type_pprinters[cls](obj, self, cycle)
    378                 else:
    379                     # deferred printer

~/anaconda3/lib/python3.8/site-packages/IPython/lib/pretty.py in inner(obj, p, cycle)
    553                 p.text(',')
    554                 p.breakable()
--> 555             p.pretty(x)
    556         if len(obj) == 1 and type(obj) is tuple:
    557             # Special case for 1-item tuples.

~/anaconda3/lib/python3.8/site-packages/IPython/lib/pretty.py in pretty(self, obj)
    392                         if cls is not object \
    393                                 and callable(cls.__dict__.get('__repr__')):
--> 394                             return _repr_pprint(obj, self, cycle)
    395 
    396             return _default_pprint(obj, self, cycle)

~/anaconda3/lib/python3.8/site-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    698     """A pprint that just redirects to the normal repr function."""
    699     # Find newlines and replace them with p.break_()
--> 700     output = repr(obj)
    701     lines = output.splitlines()
    702     with p.group():

~/anaconda3/lib/python3.8/site-packages/torch/_tensor.py in __repr__(self)
    201             return handle_torch_function(Tensor.__repr__, (self,), self)
    202         # All strings are unicode in Python 3.
--> 203         return torch._tensor_str._str(self)
    204 
    205     def backward(self, gradient=None, retain_graph=None, create_graph=False, inputs=None):

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in _str(self)
    404 def _str(self):
    405     with torch.no_grad():
--> 406         return _str_intern(self)

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in _str_intern(inp)
    379                     tensor_str = _tensor_str(self.to_dense(), indent)
    380                 else:
--> 381                     tensor_str = _tensor_str(self, indent)
    382 
    383     if self.layout != torch.strided:

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in _tensor_str(self, indent)
    240         return _tensor_str_with_formatter(self, indent, summarize, real_formatter, imag_formatter)
    241     else:
--> 242         formatter = _Formatter(get_summarized_data(self) if summarize else self)
    243         return _tensor_str_with_formatter(self, indent, summarize, formatter)
    244 

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in get_summarized_data(self)
    274         return torch.stack([get_summarized_data(x) for x in (start + end)])
    275     else:
--> 276         return torch.stack([get_summarized_data(x) for x in self])
    277 
    278 def _str_intern(inp):

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in <listcomp>(.0)
    274         return torch.stack([get_summarized_data(x) for x in (start + end)])
    275     else:
--> 276         return torch.stack([get_summarized_data(x) for x in self])
    277 
    278 def _str_intern(inp):

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in get_summarized_data(self)
    272         end = ([self[i]
    273                for i in range(len(self) - PRINT_OPTS.edgeitems, len(self))])
--> 274         return torch.stack([get_summarized_data(x) for x in (start + end)])
    275     else:
    276         return torch.stack([get_summarized_data(x) for x in self])

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in <listcomp>(.0)
    272         end = ([self[i]
    273                for i in range(len(self) - PRINT_OPTS.edgeitems, len(self))])
--> 274         return torch.stack([get_summarized_data(x) for x in (start + end)])
    275     else:
    276         return torch.stack([get_summarized_data(x) for x in self])

~/anaconda3/lib/python3.8/site-packages/torch/_tensor_str.py in get_summarized_data(self)
    265     if dim == 1:
    266         if self.size(0) > 2 * PRINT_OPTS.edgeitems:
--> 267             return torch.cat((self[:PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems:]))
    268         else:
    269             return self

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

And I already runned CUDA_LAUNCH_BLOCKING=1, but it doesn’t change nothing.

Thank you very much for your help !

Thanks for the update!
You are using an Ampere GPU (3060), which needs CUDA>=11, while you’ve installed the PyTorch wheel with the CUDA10.2 runtime: PyTorch version: 1.9.0+cu102.
The error message also shows this error:

CUDA error: no kernel image is available for execution on the device

Install the PyTorch pip wheel or conda binary with CUDA11.1 and it should work.

1 Like

Thank you very much, it worked for me !!

But I have an other issue, I want to train my model, I have a 6Go RAM GPU, and I have image with sizes [3, 1080, 1920]. I have a CUDA out of memory because my batch size is 2. And sometime I also have the same error with a batch size = 1. And now, I trained my model with my GPU, I saved it, and when I reboot my computer and downloaded the model, I just tried to apply a model.eval on one image and i have the same error :

and then i found something very similar

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-20-78cfb42ac7f0> in <module>
----> 1 list_images, list_predictions = inference_gpu([25000], path_to_video = path_to_video_lab7, draw_bbox = True, threshold = 0.5)

<ipython-input-18-ff953f2b711e> in inference_gpu(list_path_inf, path_to_video, draw_bbox, threshold)
     17             data_transform = transforms.Compose([transforms.Resize((img_array.shape[0],img_array.shape[1] )), transforms.ToTensor()])
     18 
---> 19             image = data_transform(image).cuda() # donne une shape torch.Size([3, 905, 662])
     20             model.cuda()
     21             model.eval()

RuntimeError: CUDA out of memory. Tried to allocate 48.00 MiB (GPU 0; 5.81 GiB total capacity; 4.44 GiB already allocated; 6.38 MiB free; 4.50 GiB reserved in total by PyTorch)
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-15-0c2804b316ac> in <module>
----> 1 list_images, list_predictions = inference_gpu([25000,18000], path_to_video = path_to_video_lab7, draw_bbox = True, threshold = 0.5)

<ipython-input-14-81c5908686f8> in inference_gpu(list_path_inf, path_to_video, draw_bbox, threshold)
     20             model.cuda()
     21             model.eval()
---> 22             out = model([image])
     23             list_images.append(image)
     24             list_predictions.append(out)

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py in forward(self, images, targets)
     94         if isinstance(features, torch.Tensor):
     95             features = OrderedDict([('0', features)])
---> 96         proposals, proposal_losses = self.rpn(images, features, targets)
     97         detections, detector_losses = self.roi_heads(features, proposals, images.image_sizes, targets)
     98         detections = self.transform.postprocess(detections, images.image_sizes, original_image_sizes)

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/rpn.py in forward(self, images, features, targets)
    341         # RPN uses all feature maps that are available
    342         features = list(features.values())
--> 343         objectness, pred_bbox_deltas = self.head(features)
    344         anchors = self.anchor_generator(images, features)
    345 

~/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

~/anaconda3/lib/python3.8/site-packages/torchvision/models/detection/rpn.py in forward(self, x)
     55         bbox_reg = []
     56         for feature in x:
---> 57             t = F.relu(self.conv(feature))
     58             logits.append(self.cls_logits(t))
     59             bbox_reg.append(self.bbox_pred(t))

~/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py in relu(input, inplace)
   1296         result = torch.relu_(input)
   1297     else:
-> 1298         result = torch.relu(input)
   1299     return result
   1300 

RuntimeError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 5.81 GiB total capacity; 4.44 GiB already allocated; 6.38 MiB free; 4.50 GiB reserved in total by PyTorch)

Thanks for your help ! Do you have a linkedin, or a website where you publish some interesting stuff ?

[23-09 17:29] Mitesh Vyas

Traceback (most recent call last): File “/home/hardik/Downloads/STRNN-master/train_torch.py”, line 202, in total_loss += run(batch_user, batch_td, batch_ld, batch_loc, batch_dst, step=1) File “/home/hardik/Downloads/STRNN-master/train_torch.py”, line 183, in run J.backward() File “/home/hardik/Downloads/STRNN-master/myvenv/lib/python3.6/site-packages/torch/_tensor.py”, line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File “/home/hardik/Downloads/STRNN-master/myvenv/lib/python3.6/site-packages/torch/autograd/init.py”, line 149, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.

[23-09 17:29] Mitesh Vyas

Traceback (most recent call last): File “/home/hardik/Downloads/STRNN-master/train_torch.py”, line 202, in total_loss += run(batch_user, batch_td, batch_ld, batch_loc, batch_dst, step=1) File “/home/hardik/Downloads/STRNN-master/train_torch.py”, line 183, in run J.backward() File “/home/hardik/Downloads/STRNN-master/myvenv/lib/python3.6/site-packages/torch/_tensor.py”, line 255, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File “/home/hardik/Downloads/STRNN-master/myvenv/lib/python3.6/site-packages/torch/autograd/init.py”, line 149, in backward allow_unreachable=True, accumulate_grad=True) # allow_unreachable flag RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [0,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [1,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [2,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [3,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [4,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [5,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [6,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [7,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [8,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [9,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [10,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [11,0,0] Assertion srcIndex < srcSelectDimSize failed. /pytorch/aten/src/ATen/native/cuda/Indexing.cu:649: indexSelectSmallIndex: block: [0,0,0], thread: [12,0,0] Assertion srcIndex < srcSelectDimSize failed.

Please help me with this error

The stacktrace points to an invalid indexing operation. I don’t see which line of code fails, so you would either have to check the entire stacktrace (via re-running the code with CUDA_LAUNCH_BLOCKING=1 or on the CPU).

1 Like

Okay thank you sir. I will try this out

Sir Can you tell me Where I have to write the CUDA_LAUNCH_BLOCKING=1

Like where exactly I have to mention in my code?

You should set it in the terminal via:

CUDA_LAUNCH_BLOCKING=1 python script.py args

and rerun the code.

You would have to replace script.py with the file name of your training script and args with the (optional) arguments passed to it.