RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.c:32

josueortc · November 8, 2017, 6:37pm

I am having a problem while training my network in the first epoch. The model starts training but it throws this error:

/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [9,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.c line=32 error=59 : device-side assert triggered
Traceback (most recent call last):
File “main.py”, line 322, in
main()
File “main.py”, line 158, in main
train(train_loader, model, optimizer, epoch, criterion)
File “main.py”, line 206, in train
losses.update(loss.data[0], input_var.size(0))
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.c:32

This comes from this snippet of code:

    input_var = torch.autograd.Variable(input).cuda()
    target_var = torch.autograd.Variable(target)
    #optimizer.zero_grad()
    #print(input_var.volatile)
    # compute output for the number of timesteps selected by train loader
    output = model.forward(x=input_var)
    #print(output.volatile)
    # CLean the gradient
    #optimizer.zero_grad()
    # Calculate the loss function based on the criterion. For example, UCF-101 is CrossEntropy
    loss = criterion(output, target_var)
    #print(loss.volatile)
    # measure accuracy and record loss
    prec1, prec5 = accuracy(output.data, target, topk=(1, 5))
    #print(loss.data.size())
    losses.update(loss.data[0], input_var.size(0))
    top1.update(prec1[0], input.size(0))
    top5.update(prec5[0], input.size(0))

I am not sure what’s happening because the model runs a couple of iteration before throwing this error. Is there something related to the memory of the GPU?

Thank you.

richard · November 8, 2017, 8:35pm

It’s a little hard to tell from your snippet. I don’t think you should be calling model.forward: it should be model(x) (I’m not sure if those two are equivalent or if that has anything to do with this).

The cuda runtime error (59) is triggered most commonly when a memory location that is out of bounds is accessed (something like an array out of bounds error).

josueortc · November 8, 2017, 8:40pm

This is the full error:

/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/ClassNLLCriterion.cu:57: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [7,0,0] Assertion t >= 0 && t < n_classes failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/generic/ClassNLLCriterion.cu line=87 error=59 : device-side assert triggered
Traceback (most recent call last):
File “main.py”, line 321, in
main()
File “main.py”, line 158, in main
train(train_loader, model, optimizer, epoch, criterion)
File “main.py”, line 201, in train
loss = criterion(output, target_var)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 224, in call
result = self.forward(*input, **kwargs)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/modules/loss.py”, line 482, in forward
self.ignore_index)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py”, line 746, in cross_entropy
return nll_loss(log_softmax(input), target, weight, size_average, ignore_index)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py”, line 672, in nll_loss
return _functions.thnn.NLLLoss.apply(input, target, weight, size_average, ignore_index)
File “/home/josueortc/anaconda3/lib/python3.6/site-packages/torch/nn/_functions/thnn/auto.py”, line 47, in forward
output, *ctx.additional_args)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THCUNN/generic/Clas

I tried it with output = model(x) and it throws the same error. I don’t think that makes a difference or does it?

richard · November 8, 2017, 8:40pm

Sorry, didn’t read this. The error is saying that your target value is out of bounds (it should be in [0, n_classes) ). Can you check your targets?

josueortc · November 8, 2017, 8:46pm

My current output is [num_classes], but why isn’t throwing an error before that? Also, it’s NLLoss taking care of making it into a one-hot vector?

richard · November 8, 2017, 8:51pm

target_var should be a Variable wrapping a 1-D tensor of size n_batch. Each element t of target_var should be in [0, num_classes). If it isn’t, the assertion that you’re seeing is triggered.

One thing you can do to debug this is insert the following before your criterion call (pseudocode):

if torch.sum( (target_var.data >= num_classes).long() + (target_var.data < 0).long()) > 0:
    import pdb; pdb.set_trace()

SKYHOWIE25 · November 8, 2017, 11:04pm

Hi

Please check the output of your model to make sure that the dimension of the output matches the number of classes.

tghoshmo · December 16, 2017, 4:55am

Hi,

I am getting a similar error. Therefore I am curious to know if you have solved the problem and if so, then how.

Thanks.

taiky · October 8, 2018, 7:33pm

maybe you are using hdf5?
In my case, # num_workers should be none in hdf5 situation, otherwise I will get this error.

surojit_sengupta · November 20, 2018, 2:27pm

Upon running the following code snippet,

class CNN_NER(nn.Module): 
    def __init__(self,vocab_size,embedding_size):
        # vocab_size, embedding_size, window_size, hidden_size, output_size

        super(CNN_NER, self).__init__()
        
        self.embed = nn.Embedding(vocab_size, embedding_size)
        self.cnn1 = nn.Conv2d(in_channels=1,padding=2,out_channels=100,kernel_size=(11,54),stride=1,dilation=1)
        self.cnn2 = nn.Conv2d(in_channels=1,padding=2,out_channels=100,kernel_size=(10,54),stride=1,dilation=1)
        self.cnn3 = nn.Conv2d(in_channels=1,padding=2,out_channels=100,kernel_size=(9,54),stride=1,dilation=1)
        
        self.relu = nn.ReLU()
        self.maxpool3 = nn.MaxPool2d(kernel_size=(1,3))
        self.maxpool4 = nn.MaxPool2d(kernel_size=(1,4))
        self.maxpool5 = nn.MaxPool2d(kernel_size=(1,5))
        self.linear = nn.Linear(60,9)
#        self.dropout = nn.Dropout(0.2)
        
    def forward(self,sent_grams,is_training):
        embeds = self.embed(sent_grams)
        embeds = embeds.unsqueeze(1)
        print('inputs',embeds.size())
        l1 = self.cnn1(embeds)
        l1 = self.relu(l1)
        print('cnn1',l1.size())
        l1 = l1.squeeze(3)
        l1 = self.maxpool3(l1)
        print('maxpool3',l1.size())
        l2 = self.cnn2(embeds)
        l2 = self.relu(l2)
        print('cnn2',l1.size())
        l2 = l2.squeeze(3)
        l2 = self.maxpool4(l2)
        print('maxpool4',l2.size())
        l3 = self.cnn3(embeds)
        l3 = self.relu(l3)
        print('cnn3',l1.size())
        l3 = l3.squeeze(3)
        l3 = self.maxpool5(l3)
        print('maxpool5',l3.size())
        
        l4 = torch.cat((l1,l2,l3),1)
        print('concatenated',l4.size())
        l4 = l4.view(l4.size(0),5,60)
        print('quished before linear layer',l4.size())
        l5 = self.linear(l4)
        print('output',l5.size())
        return l5

model = CNN_NER(len(word2index), EMBEDDING_SIZE)
model.cuda()

I get the cuda error below:

RuntimeError Traceback (most recent call last)
in ()
5 LEARNING_RATE = 0.001
6 model = CNN_NER(len(word2index), EMBEDDING_SIZE)
----> 7 model.cuda()
8 # Visualize the model paramters
9 for name,params in model.named_parameters():

/opt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in cuda(self, device)
256 Module: self
257 “”"
→ 258 return self._apply(lambda t: t.cuda(device))
259
260 def cpu(self):

/opt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
183 def _apply(self, fn):
184 for module in self.children():
→ 185 module._apply(fn)
186
187 for param in self._parameters.values():

/opt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in _apply(self, fn)
189 # Tensors stored in modules are graph leaves, and we don’t
190 # want to create copy nodes, so we have to unpack the data.
→ 191 param.data = fn(param.data)
192 if param._grad is not None:
193 param._grad.data = fn(param._grad.data)

/opt/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py in (t)
256 Module: self
257 “”"
→ 258 return self._apply(lambda t: t.cuda(device))
259
260 def cpu(self):

RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1535491974311/work/aten/src/THC/generic/THCTensorCopy.cpp:20

Am unable to comprehend the same. Any help would be much appreciated!