CUDA Invalid Configuration Error on GPU Only

Hi everyone, long-time reader, first-time poster!

I seem to be running into an odd bug when training my model. I am trying to train an embedding model using the Resnet18 architecture, which I have essentially cut off the last linear layer of (I can post the architecture if needed). I have written my own loss function which implements the “Batch-Hard” loss found in this paper. The training loop (standard zero_grad, output, loss, backward, step) works fine when running on CPU, but throws the following error when running on GPU:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-11-8cf2cfddbcc6> in <module>()
      3     output = learner.model(imgs)
      4     loss = criterion(output, labels)
----> 5     loss.backward()
      6     optimizer.step()
      7     break

.../anaconda/anaconda3-2019a/lib/python3.6/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph)
    105                 products. Defaults to ``False``.
    106         """
--> 107         torch.autograd.backward(self, gradient, retain_graph, create_graph)
    108 
    109     def register_hook(self, hook):

.../anaconda/anaconda3-2019a/lib/python3.6/site-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     91     Variable._execution_engine.run_backward(
     92         tensors, grad_tensors, retain_graph, create_graph,
---> 93         allow_unreachable=True)  # allow_unreachable flag
     94 
     95 

RuntimeError: CUDA error: invalid configuration argument

Does anyone know why a backwards pass would work on a CPU but not a GPU? I can confirm that my GPU works generally, I have trained other models with other custom loss functions.

For reference, this is the implementation of the “Batch-Hard” loss that I’ve written:

class BatchHardLoss(torch.nn.modules.loss._Loss):
    def __init__(self, margin, k=5, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.margin = margin
        self._k = k

    def forward(self, x, targets):
        distances = torch.cdist(x, x)  # Get the distances
        losses = []
        for t in torch.unique(targets):
            pos_idxs = (targets == t).nonzero().reshape(-1)  # Positive indices
            neg_idxs = (targets != t).nonzero().reshape(-1)  # Negative indices

            choice = np.random.choice(
                len(pos_idxs), min(self._k, len(pos_idxs))
            ).tolist()
            anchors = pos_idxs[choice]  # Chosen anchors

            # Only compute loss if we can do it for the full batch, for simplicity
            loss = 0.0
            if len(anchors) > 1 and len(neg_idxs) > 0:
                pos_distances = torch.index_select(
                    torch.index_select(distances, 0, anchors), 1, pos_idxs
                )
                pos_loss = pos_distances.max(dim=1)[0].sum()

                neg_distances = torch.index_select(
                    torch.index_select(distances, 0, anchors), 1, neg_idxs
                )
                neg_loss = neg_distances.min(dim=1)[0].sum()

                marg_loss = self.margin * len(anchors)

                losses.append(marg_loss + pos_loss - neg_loss)
            else:
                losses.append(torch.tensor(0.0, device=x.device, requires_grad=True))
        return torch.stack(losses).sum()

If anyone sees any glaring errors or has any insight, I’d really appreciate it.

Thanks!

P.S. I have seen this post but can’t seem to find a similar issue when debugging.

I don’t see anything obviously wrong in your code.
Could you print the output and labels tensors before calling loss.backward()?
Maybe we could isolate this issue to these values and reproduce the error.

If these tensors do not yield the same error, could you additionally also store the state_dict of your model and the current imgs?

1 Like

Thanks for replying @ptrblck! While trying to reproduce the error at home (I was at work when I first got the error) I started with a smaller batch size (I was using 1024), and lo and behold it worked! I did some investigating, and apparently 1023 works, but any batch_size >= 1024 causes that error. It seems to me that it’s a GPU memory issue, but I don’t know why I’m not getting the traditional RuntimeError: CUDA out of memory. I’ll double-check that this is the case on my work setup as well, and will reply back tomorrow.

1 Like

That does seem to have been the problem - still not sure why the error wasn’t the usual out of memory, but glad it’s fixed!

1 Like

Good to hear it’s fixed now!
That’s indeed a bit weird, but sometimes OOM errors are masked by e.g. cudnn errors.
Haven’t seen this invalid config error yet.

1 Like