Nan return during training model with custom layers

Hi,

I make a custom layer that has the parameter weight of size (Cin, Cout, rank).
In the forward method, I need to permute weight as follows:

weight_col = self.weight.permute(1, 0, 2).reshape(self.in_channels, self.out_channels * self.rank)

During the training of the model, it yields some odd errors with nan value.
From my [reading], I thought that it may relate to the function permute.
So I made an assertion as follows:
assert not torch.isnan(weight_col).any(), "weight_col tensor is nan"
In the training loop, sometimes the assert happens.

To debug this, I make this:

torch.autograd.set_detect_anomaly(True)
while epoch < 100:
    train(epoch,  train_loader, model, criterion, optimizer, scheduler)
    _, valid_top1_acc, valid_top5_acc = validate(val_loader, model, criterion)

Here’s the log:

So my questions are:

  1. Does the permute function possibly yield nan in autograd flow?
    If yes, how can I fix this?
  2. In my custom layer, I only need to permute the weight one time. So I’m thinking of making the permutation outside of the forward method. E.g, before instantiating the custom layer, I will permute the weight. Is this a good practice?
    # permute before instantiating 
    weight = weight.permute(1, 0, 2)
    myLayer = MyLayer(Cin, Cout, rank, padding, weight)
    

Any recommendation is appreciated!

  1. No, this should not be the case as only the metadata of the tensor is changed.
  2. Yes, this might work but I would still recommend trying to narrow down where the NaN values come from.
1 Like

The permute function should not return NaN values. Can you provide reproducible code that can be run in its entirety with torch.rand() data as inputs?

1 Like

Thank you for your quick responses.
It’s my fault, the error doesn’t come from permute function.
In my case, it came from another source (the parafac of tensorly, when the value is too small; for future reference).
Thank you so much again.

Excuse me, can I have a not-very-related question? Is this a good practice to put the permutation outside the forward loop? So that it’ll be faster during the forward and backward. Cause I am scared that this will affect the gradient flow. Here’s my full forward method of the custom layer:

    def forward(self, input):
        # Add padding to input
        batch_size, Cin, h, w = input.shape
        padded_I = nn.functional.pad(input, [self.padding]*4)
        padded_I = padded_I.permute(0, 2, 3, 1)
        device = input.device
        Cout, _, r = self.C.shape
        # Calculate output size after padding
        padded_h = h + 2 * self.padding
        padded_w = w + 2 * self.padding

        # Step 1: Compute Oc
        padded_I_col = padded_I.reshape(batch_size * padded_h * padded_w, Cin)

        # Should I put the permutation outside this method?
        C_col = self.C.permute(1, 0, 2).reshape(Cin, Cout * r)

        # Compute matrix multiplication and reshape output
        output = torch.matmul(padded_I_col, C_col).reshape(batch_size, padded_h, padded_w, Cout, r)

        return output

Applying a .permute operation on your tensor won’t break the gradient flow and will redirect the gradients properly as seen in this small example:

x = torch.arange(16).float().view(2, 2, 2, 2).requires_grad_(True)
print(x)
# tensor([[[[ 0.,  1.],
#           [ 2.,  3.]],

#          [[ 4.,  5.],
#           [ 6.,  7.]]],


#         [[[ 8.,  9.],
#           [10., 11.]],

#          [[12., 13.],
#           [14., 15.]]]], requires_grad=True)
y = x.permute(0, 2, 3, 1)
print(y)
# tensor([[[[ 0.,  4.],
#           [ 1.,  5.]],

#          [[ 2.,  6.],
#           [ 3.,  7.]]],


#         [[[ 8., 12.],
#           [ 9., 13.]],

#          [[10., 14.],
#           [11., 15.]]]], grad_fn=<PermuteBackward0>)
grad = torch.arange(y.nelement()).view_as(y)
print(grad)
# tensor([[[[ 0,  1],
#           [ 2,  3]],

#          [[ 4,  5],
#           [ 6,  7]]],


#         [[[ 8,  9],
#           [10, 11]],

#          [[12, 13],
#           [14, 15]]]])
y.backward(gradient=grad)
print(x.grad)
# tensor([[[[ 0.,  2.],
#           [ 4.,  6.]],

#          [[ 1.,  3.],
#           [ 5.,  7.]]],


#         [[[ 8., 10.],
#           [12., 14.]],

#          [[ 9., 11.],
#           [13., 15.]]]])

However, permuting a tensor could make the data non-contiguous:

x = torch.arange(16).float().view(2, 2, 2, 2).requires_grad_(True)
print(x.is_contiguous())
# True
y = x.permute(0, 2, 3, 1)
print(y.is_contiguous())
# False

If a layer needs contiguous inputs for its internal operations, it will call input = input.contiguous() on it which could then trigger a copy (it will be a no-op if the tensor is already contiguous).
Because of this I would recommend trying to create the tensors in the desired shape and in a contiguous memory layout to avoid these copies inside your model.

1 Like