CUDA error: device-side assert triggered after a number of epochs

antran96 · August 19, 2020, 7:21pm

My model was run fine but after 89 epochs, I got this error:

Traceback (most recent call last):
  File "main.py", line 36, in <module>
    main()
  File "main.py", line 32, in main
    method.method(settings, job_id)
  File "/scratch/project_2002806/tranan11/AC-hybrid-transformer/processes/method.py", line 574, in method
    nb_classes=nb_classes)
  File "/scratch/project_2002806/tranan11/AC-hybrid-transformer/processes/method.py", line 328, in _do_training
    optimizer=None)
  File "/scratch/project_2002806/tranan11/AC-hybrid-transformer/tools/model.py", line 156, in module_epoch_passing
    y_hat, y, f_names_tmp = module_forward_passing(example, module, use_y)
  File "/scratch/project_2002806/tranan11/AC-hybrid-transformer/tools/model.py", line 217, in module_forward_passing
    return module(x, None), y, f_names
  File "/appl/soft/ai/miniconda3/envs/pytorch-1.3.1-1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/.../models/wavenet_rnn.py", line 94, in forward
    return self._inference(x)
  File "/.../models/wavenet_rnn.py", line 139, in _inference
    self.max_length,
  File "/.../modules/decode_utils.py", line 22, in greedy_decode
    attention_mask = None
  File "/appl/soft/ai/miniconda3/envs/pytorch-1.3.1-1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/..../AC-hybrid-transformer/modules/transformer.py", line 31, in forward
    s_mask = subsequent_mask(word_embed.size(0)).to(device)
RuntimeError: CUDA error: device-side assert triggered

This is the code where it triggers this error:

def subsequent_mask(sz):
    mask = (torch.triu(torch.ones(sz, sz)) == 1).transpose(0, 1).float()
    mask = mask.masked_fill(mask == 0, float('-inf')).masked_fill(mask == 1, float(0.0))
    return mask

albanD · August 19, 2020, 9:13pm

Hi,

Did you get any prints (in stderr) before the error that would hint at what is the issue?

Also you can rerun your code while setting the CUDA_LAUNCH_BLOCKING=1 environment variable to make sure that the line showed by the stack trace is relevant (the API is asynchronous by default and so it most likely points to the wrong line right now).

antran96 · August 19, 2020, 9:19pm

Hi, the code was run with CUDA_LAUNCH_BLOCKING=1 . It worked normally until epoch 90 in my case, which is weird…

albanD · August 19, 2020, 9:27pm

It worked normally until epoch 90 in my case, which is weird…

Actually this is quite common as these asserts often come from indexing out of bound.
So maybe a label in your dataset is incorrect. Or you use a learnt value to index something and that ends up generating a value out of bound.

the code was run with CUDA_LAUNCH_BLOCKING=1

Given that the stack trace points to the l 31 in transformer.py, and not inside the subsequent_mask() function, I think it means that the error either comes from the .size() or the .to() function.
The first one does not access the GPU (GPU Tensor metadata are saved on the cpu side).
So most likely the second one. And this second one is a know sync point for the GPU and so is usually the place where the error is raised in the async case.
I am surprised though that it still points here with CUDA_LAUNCH_BLOCKING=1. Can you double check that you do set this before doing anything else and that it is taken into account? (you can try a simple program that index out of bound on the GPU to make sure it does what you expect).

Also, do you use multithreading or multiprocessing? These error put the GPU in an un-recoverable state and the process needs to be restarted. So another thread doing something bad could lead another thread to through this kind of errors at random places.