And all the grads and loss after this step became Nan (the scale still is 0).
What’s wrong with my model, loss function or training data? How to fix it?
Is your model generally working fine without using amp?
The loss scaler might run into this “death spiral” of decreasing the scale value, if the model output or loss contains NaN values.
These NaN values in the loss would thus create NaN gradients and the loss scaler will decrease the scale factor as it thinks the gradients are overflowing.
However, in fact the gradients are not overflowing, but your model yields invalid outputs.
Could you check the output and loss for NaNs and check, if they are also created without amp?
If the model output or losses do not contain NaN values and the scaler is constantly decreasing the scale value, some operation might cause large gradients, which are still overflowing. Could you post an executable code snippet, which reproduces this error?
#in the training loop:
### Network forward, generate SR
with autocast():
self.fake_H = self.netG(self.var_L)
l_g_total = 0
with autocast(): #this is separated, because there are some conditionals in the middle
loss_results = self.generatorlosses(self.fake_H, self.var_H, self.log_dict, self.f_low) # multiple losses can be used here, but I'm testing with just the one loss in the link above
l_g_total += sum(loss_results)/self.accumulations
self.amp_scaler.scale(l_g_total).backward()
if (step + 1) % self.accumulations == 0:
self.amp_scaler.step(self.optimizer_G)
self.amp_scaler.update()
self.optimizer_G.zero_grad()
I’m checking the isnan() and isinf() for the loss and output right after the loss. This might be difficult to try to debug as-is, but any pointers at what I should look at or how to catch the potential overflow would be awesome. It’s the only loss with this behaviour ATM.
Note that a NaN output or loss would be a different issue than a valid output or loss and a constantly reduced scaling factor (and thus also the debugging steps). Which case are you dealing with?
This, same as OP, my scaler’s scale is halving each iteration until it becomes of magnitude 1e-45 and after that zeroes out, so all of the model’s output become black. There are no NaN or Inf happening in the loss or output.
I know it may not be easy to reproduce like this, but maybe I could help with some guidance? I already tried changing the default GradScaler() parameter initialization, increasing the backoff_factor only delays the inevitable and the others can’t really change anything.
After more testing I found out that the problem is in the autoscaling of the losses. If I only autoscale the forward pass, everything works fine (of course, minus the proper loss handling). After diagnosing the losses, I also found that SSIM caused a similar behavior, but I think that the important thing is that while the model output is converted into a HalfTensor, the target (the ground truth image) is not, it stays as a FloatTensor.
Update: I fiddled around with the targets and that’s not the issue. They types are inconsistent, but it doesn’t change anything if they are the same type.
Which self.distanceType are you using?
I see that meshgrid is used, which could be numerically unstable (but we haven’t seen a failed use case so far).
Could you disable autocast for mesgrid, while enabling it for the general loss calculation and rerun the script?
I’m using the default distanceType (‘cosine’) and the regular CX loss (‘calculate_CX_Loss()’), so the meshgrid computation is not being used. _random_sampling(), _random_pooling() and _crop_quarters() are also not being used, only _create_using_dotP (cosine distance) and calculate_CX_Loss, in principle.
So far only by taking the whole loss calculation out of the autocasting allows for the scale to remain stable. If I only autocast parts of the loss, wouldn’t calculations become inacurate?
In the case of SSIM, the scale also drops heavily, but stabilizes at a scale of 2. The only similarity I find between the two loss functions is that both are in principle distance metrics, but no idea if that is related.
That shouldn’t be the case, since autocast should take care of using FP16, where it’s safe to do so, and fallback to FP32 if necessary.
Could you post the setup for the loss function, so that I could run it in isolation?
The initial loss scale might drop and stabilize, which is the expected behavior.
However, the outputs (and thus loss) should never contain any NaN values.
It is indeed a common operation.
Could you check the stats (min, max, mean, median, shape) or I_features_i and T_features_i, which cause the NaNs?
If possible, could you upload them so that we could use them to debug?
I also included the gradscaler statedict output, in case it is useful. I wouldn’t know what to say about the stats, but there are no NaNs. The values are small, but I don’t know if that could cause problems? Its at that convolution where the problem is happening.