RuntimeError: cuda runtime error (710)

antgr · November 21, 2019, 3:11pm

error:
RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/THCCachingHostAllocator.cpp:278

How to trigger it:
print (logits)

about this tensor:
print (logits.shape)
torch.Size([32, 80, 7])

type (logits)
torch.Tensor

Full stack trace:
RuntimeError Traceback (most recent call last)
in ()
----> 1 print (logits)

7 frames
/usr/local/lib/python3.6/dist-packages/torch/tensor.py in repr(self)
128 # characters to replace unicode characters with.
129 if sys.version_info > (3,):
–> 130 return torch._tensor_str._str(self)
131 else:
132 if hasattr(sys.stdout, ‘encoding’):

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in _str(self)
309 tensor_str = _tensor_str(self.to_dense(), indent)
310 else:
–> 311 tensor_str = _tensor_str(self, indent)
312
313 if self.layout != torch.strided:

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in _tensor_str(self, indent)
207 if self.dtype is torch.float16 or self.dtype is torch.bfloat16:
208 self = self.float()
–> 209 formatter = _Formatter(get_summarized_data(self) if summarize else self)
210 return _tensor_str_with_formatter(self, indent, formatter, summarize)
211

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in get_summarized_data(self)
240 end = ([self[i]
241 for i in range(len(self) - PRINT_OPTS.edgeitems, len(self))])
–> 242 return torch.stack([get_summarized_data(x) for x in (start + end)])
243 else:
244 return torch.stack([get_summarized_data(x) for x in self])

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in (.0)
240 end = ([self[i]
241 for i in range(len(self) - PRINT_OPTS.edgeitems, len(self))])
–> 242 return torch.stack([get_summarized_data(x) for x in (start + end)])
243 else:
244 return torch.stack([get_summarized_data(x) for x in self])

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in get_summarized_data(self)
240 end = ([self[i]
241 for i in range(len(self) - PRINT_OPTS.edgeitems, len(self))])
–> 242 return torch.stack([get_summarized_data(x) for x in (start + end)])
243 else:
244 return torch.stack([get_summarized_data(x) for x in self])

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in (.0)
240 end = ([self[i]
241 for i in range(len(self) - PRINT_OPTS.edgeitems, len(self))])
–> 242 return torch.stack([get_summarized_data(x) for x in (start + end)])
243 else:
244 return torch.stack([get_summarized_data(x) for x in self])

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in get_summarized_data(self)
233 if dim == 1:
234 if self.size(0) > 2 * PRINT_OPTS.edgeitems:
–> 235 return torch.cat((self[:PRINT_OPTS.edgeitems], self[-PRINT_OPTS.edgeitems:]))
236 else:
237 return self

RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/THCGeneral.cpp:371

This is triggered in evaluation mode. I could provide the code

albanD · November 21, 2019, 3:29pm

Hi,

Was there anything else printed before the error is raised?
Also you can run your code with CUDA_LAUNCH_BLOCKING=1 otherwise the python stack trace is not correct (as cuda calls are asynchronous).

Also if you’re in an interpreter. After one such assert is raised, you need to restart the interpreter as it puts the GPU in a bad state and any attempt to use the GPU will through an error.

antgr · November 22, 2019, 6:48am

Thanks for the reply! I use a colab session.
restarted the sesion between faillures.
I have in the code the following line:
!export CUDA_LAUNCH_BLOCKING=1
and I have two errors depending if I have a print or not. The problem I think is with the logit tensor.
The error is:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-43-9064aecd8313> in <module>()
     18         #print("[evaluation] tmp_eval_loss: ", tmp_eval_loss.shape)
     19         logits = model(b_input_ids, token_type_ids=None,
---> 20                        attention_mask=b_input_mask)[0]
     21         #print("logits: ", logits.shape)
     22         #print("logits: ", logits.detach().cpu().numpy())

5 frames
/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in __init__(self, tensor)
     85 
     86         else:
---> 87             nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
     88 
     89             if nonzero_finite_vals.numel() == 0:

RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

and the stack

RuntimeError                              Traceback (most recent call last)
<ipython-input-43-9064aecd8313> in <module>()
     18         #print("[evaluation] tmp_eval_loss: ", tmp_eval_loss.shape)
     19         logits = model(b_input_ids, token_type_ids=None,
---> 20                        attention_mask=b_input_mask)[0]
     21         #print("logits: ", logits.shape)
     22         #print("logits: ", logits.detach().cpu().numpy())

5 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

<ipython-input-28-7837bc2054a4> in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, labels)
    196             outputs = (loss,) + outputs
    197         # <------- end of loss 3 ------->
--> 198         print("outputs: ", outputs)
    199         return outputs  # (loss), scores, (hidden_states), (attentions)

/usr/local/lib/python3.6/dist-packages/torch/tensor.py in __repr__(self)
    128         # characters to replace unicode characters with.
    129         if sys.version_info > (3,):
--> 130             return torch._tensor_str._str(self)
    131         else:
    132             if hasattr(sys.stdout, 'encoding'):

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in _str(self)
    309                 tensor_str = _tensor_str(self.to_dense(), indent)
    310             else:
--> 311                 tensor_str = _tensor_str(self, indent)
    312 
    313     if self.layout != torch.strided:

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in _tensor_str(self, indent)
    207     if self.dtype is torch.float16 or self.dtype is torch.bfloat16:
    208         self = self.float()
--> 209     formatter = _Formatter(get_summarized_data(self) if summarize else self)
    210     return _tensor_str_with_formatter(self, indent, formatter, summarize)
    211 

/usr/local/lib/python3.6/dist-packages/torch/_tensor_str.py in __init__(self, tensor)
     85 
     86         else:
---> 87             nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
     88 
     89             if nonzero_finite_vals.numel() == 0:

RuntimeError: cuda runtime error (710) : device-side assert triggered at /pytorch/aten/src/THC/THCReduceAll.cuh:327

And if I remove the print then I have the following error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-43-9064aecd8313> in <module>()
     26     #print("it:", it)
     27     #print("[evaluation] logits: ", logits)
---> 28     logits = logits.cpu().data.numpy()
     29     #print("logits: ", logits)
     30 

RuntimeError: CUDA error: device-side assert triggered

antgr · November 22, 2019, 6:50am

and the code:

    model.eval()
    eval_loss, eval_accuracy = 0, 0
    nb_eval_steps, nb_eval_examples = 0, 0
    predictions , true_labels = [], []
    it = 0
    for batch in valid_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels, b_tags, b_adu = batch
        #print("[evaluation] b_input_ids: ", b_input_ids.shape)
        #print("[evaluation] b_input_mask: ", b_input_mask.shape)
        #print("[evaluation] b_labels: ", b_labels.shape)
        #print("[evaluation] b_tags: ", b_tags.shape)
        #print("[evaluation] b_adu: ", b_adu.shape)

        with torch.no_grad():
            tmp_eval_loss = model(b_input_ids, token_type_ids=None,
                                  attention_mask=b_input_mask, labels=[b_adu, b_tags, b_labels])[0]
            #print("[evaluation] tmp_eval_loss: ", tmp_eval_loss.shape)
            logits = model(b_input_ids, token_type_ids=None,
                           attention_mask=b_input_mask)[0]
            #print("logits: ", logits.shape)
            #print("logits: ", logits.detach().cpu().numpy())
        #logits = logits.detach().cpu().numpy()
        #logits = logits.cpu().numpy()
        it = it + 1
        #print("it:", it)
        #print("[evaluation] logits: ", logits)
        logits = logits.cpu().data.numpy()
        #print("logits: ", logits)

        label_ids = b_labels.to('cpu').numpy()
        predictions.extend([list(p) for p in np.argmax(logits, axis=2)])
        true_labels.append(label_ids)

        tmp_eval_accuracy = flat_accuracy(logits, label_ids)

        eval_loss += tmp_eval_loss.mean().item()
        eval_accuracy += tmp_eval_accuracy

        nb_eval_examples += b_input_ids.size(0)
        nb_eval_steps += 1
    eval_loss = eval_loss/nb_eval_steps
    print("Validation loss: {}".format(eval_loss))
    print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
    pred_bios = [bio_vals[p_i] for p in predictions for p_i in p]
    valid_bios = [bio_vals[l_ii] for l in true_labels for l_i in l for l_ii in l_i]
    print("F1-Score: {}".format(f1_score(pred_bios, valid_bios)))
    print(confusion_matrix(valid_bios, pred_bios))
    print(classification_report(valid_bios, pred_bios, digits=4, target_names=bio_vals))

albanD · November 22, 2019, 2:54pm

Hi,

I reformatted your post to make it readable.
For your future posts, you can use triple backticks before and after your code to do it:

```
# Your code
```

This is really hard to say but it looks like some error occurs and is raises whenever you try to use your logits.
Could you try reducing the size of your code to isolate which part creates the error please?
By removing the data loading and just using random data for example.
Or removing part of your model.

Thanks

antgr · November 24, 2019, 10:54am

Thank you!
I found the bugy code:
I use label embeddings, but it does not work during evaluation (else), and the error occurs.
Works though during training (if), where the labels variable has value.

        if labels is not None:
          lab2_emb = self.label2_embedding(labels2)
        else:
          pred_tag = torch.argmax(logits2_normed, dim=2)
          lab2_emb = self.label1_embedding(pred_tag)

I cannot understand the difference though and why it does not work…
For now I have removed this part of code.

antgr · November 24, 2019, 11:07am

In else I use label1_embedding ! There should be self.label2_embedding.

Shariq_Ali · January 6, 2020, 9:12am

I am getting the same error. please help me. Following is my code.

for epoch in range(1, args.num_epochs+1):
epoch_loss = []

    for step, (images, labels) in enumerate(loader):
        if args.cuda:
            images = images.cuda()
            labels = labels.cuda()

        inputs = Variable(images)
        targets = Variable(labels)
        outputs = model(inputs)
        optimizer.zero_grad()
        loss = criterion(outputs, targets[:, 0])
        loss.backward()
        optimizer.step()

        epoch_loss.append(loss.data[0])
        if args.steps_plot > 0 and step % args.steps_plot == 0:
            image = inputs[0].cpu().data
            image[0] = image[0] * .229 + .485
            image[1] = image[1] * .224 + .456
            image[2] = image[2] * .225 + .406
            board.image(image,
                f'input (epoch: {epoch}, step: {step})')
            board.image(color_transform(outputs[0].cpu().max(0)[1].data),
                f'output (epoch: {epoch}, step: {step})')
            board.image(color_transform(targets[0].cpu().data),
                f'target (epoch: {epoch}, step: {step})')
        if args.steps_loss > 0 and step % args.steps_loss == 0:
            average = sum(epoch_loss) / len(epoch_loss)
            print(f'loss: {average} (epoch: {epoch}, step: {step})')
        if args.steps_save > 0 and step % args.steps_save == 0:
            filename = f'{args.model}-{epoch:03}-{step:04}.pth'
            torch.save(model.state_dict(), filename)
            print(f'save: {filename} (epoch: {epoch}, step: {step})')

antgr · January 6, 2020, 9:59am

Is the following correct?
loss = criterion(outputs, targets[:, 0])

Shariq_Ali · January 6, 2020, 11:13am

I don’t know. when i print this it shows like this

Outputs tensor([[[[0.0454, 0.0453, 0.0452, …, 0.0459, 0.0459, 0.0460],
[0.0455, 0.0454, 0.0452, …, 0.0460, 0.0460, 0.0461],
[0.0457, 0.0455, 0.0452, …, 0.0461, 0.0462, 0.0462],
…,
[0.0460, 0.0459, 0.0459, …, 0.0457, 0.0457, 0.0458],
[0.0463, 0.0462, 0.0461, …, 0.0461, 0.0462, 0.0463],
[0.0466, 0.0464, 0.0463, …, 0.0464, 0.0466, 0.0468]]]],
device=‘cuda:0’, grad_fn=)
targets[:, 0] tensor([[[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
…,
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0],
[0, 0, 0, …, 0, 0, 0]]], device=‘cuda:0’)
Outputs Size torch.Size([1, 1, 256, 256])
targets[:, 0] Size torch.Size([1, 256, 256])

antgr · January 7, 2020, 10:13am

Usually this error is a shape mismatch. In my case, I used a wrong embedding layer (with different output size), so again a shape mismatch. Try to break your code in simple steps and figure out which step triggers the error. For example unit test your code, or run it interactively. I have not much time to help you out. Maybe the pytorch experts of the forum can provide some better ideas.
@albanD any idea?

albanD · January 7, 2020, 3:21pm

I did not look at the code in details, but from the print and your comment:

Outputs Size torch.Size([1, 1, 256, 256])
targets[:, 0] Size torch.Size([1, 256, 256])

The sizes don’t seem to be the same So maybe this is the reason?