I have a custom nonlinearity whose initialization is given by
bias = torch.Tensor(100)
self.bias = nn.Parameter(bias)
and whose forward pass is given by
def forward(self, x):
batch_size = x.shape
frob_norms = torch.linalg.norm(x, axis=2)
bias = self.bias.repeat(batch_size).view(batch_size, -1)
relative = frob_norms - bias
The shape of
x is given by
(batch_size, 100, 2), i.e. each sample is an array of 100 vectors of dimension 2. For each sample, I’d like to subtract the learnable
self.bias parameter from the vector norms. The same biases should be used for each sample. However, since I’d like this operation to occur across a batch, I figured I could repeat my
self.bias parameter as shown above and then compute the difference. I printed the contents of
self.bias during training, only to find that it consists purely of
NaN types, thus preventing the model from learning. How can I correct this issue? I’m not sure that I want to
detach(), as it’s my understanding that this requires passing to the CPU which diminishes efficiency.
This should not happen and could point towards invalid gradients etc. so you should check why the
bias is updated with invalid values or overflows etc.
detach() does not decrease the efficiency and does not move the data to the CPU. However, I also don’t know how it would help in this case as detaching a tensor from the computation graph would stop the gradient propagation at this point.
Hi @ptrblck. Thanks for your response. How would you suggest I check what’s causing the invalid gradients?
I would start with making sure the
bias value contains valid values at the beginning and then observe when the first NaN or Inf is seen it in. Afterwards, I would probably check its
.grad attribute to see if the gradient is invalid or what’s causing the invalid update.
I found the issue! It’s a divide by zero error. I have a line further down that reads
out = torch.divide(relative, frob_norms). But I’m not sure how to handle this. Since
relative depends on
out depends on
out is added to the computation graph. I tried clamping
frob_norms to a minimum value of
1e-20, but this still ends up introducing
nan, probably because I’m now getting huge outputs. Ultimately, I would just like for
torch.divide() to insert a
0 whenever a divide-by-zero were to occur, but clearly I cannot naively replace
0 after using
The common approach is to add a small
eps value to the denominator such as
1e-6 to avoid dividing by zero.