Using repeat() on differentiable layer parameters

I have a custom nonlinearity whose initialization is given by

def __init__(self):
    super().__init__()
    bias = torch.Tensor(100)
    self.bias = nn.Parameter(bias)
    nn.init.normal_(self.bias)

and whose forward pass is given by

def forward(self, x):
    batch_size = x.shape[0]
    frob_norms = torch.linalg.norm(x, axis=2)
    bias = self.bias.repeat(batch_size).view(batch_size, -1)
    relative = frob_norms - bias
    ...
    return out

The shape of x is given by (batch_size, 100, 2), i.e. each sample is an array of 100 vectors of dimension 2. For each sample, I’d like to subtract the learnable self.bias parameter from the vector norms. The same biases should be used for each sample. However, since I’d like this operation to occur across a batch, I figured I could repeat my self.bias parameter as shown above and then compute the difference. I printed the contents of self.bias during training, only to find that it consists purely of NaN types, thus preventing the model from learning. How can I correct this issue? I’m not sure that I want to detach(), as it’s my understanding that this requires passing to the CPU which diminishes efficiency.

This should not happen and could point towards invalid gradients etc. so you should check why the bias is updated with invalid values or overflows etc.

detach() does not decrease the efficiency and does not move the data to the CPU. However, I also don’t know how it would help in this case as detaching a tensor from the computation graph would stop the gradient propagation at this point.

Hi @ptrblck. Thanks for your response. How would you suggest I check what’s causing the invalid gradients?

I would start with making sure the bias value contains valid values at the beginning and then observe when the first NaN or Inf is seen it in. Afterwards, I would probably check its .grad attribute to see if the gradient is invalid or what’s causing the invalid update.

I found the issue! It’s a divide by zero error. I have a line further down that reads out = torch.divide(relative, frob_norms). But I’m not sure how to handle this. Since relative depends on bias and out depends on relative, surely out is added to the computation graph. I tried clamping frob_norms to a minimum value of 1e-20, but this still ends up introducing nan, probably because I’m now getting huge outputs. Ultimately, I would just like for torch.divide() to insert a 0 whenever a divide-by-zero were to occur, but clearly I cannot naively replace nan with 0 after using torch.divide().

The common approach is to add a small eps value to the denominator such as 1e-6 to avoid dividing by zero.