Why the scale became zero when using torch.cuda.amp.GradScaler?

Cuistiano · July 28, 2020, 6:52am

I print the scale after each step’s update as follows:

print(scaler.get_scale())

...
65536.0
32768.0
32768.0
16384.0
8192.0
4096.0
...
1e-xxx
...
0.0
0.0
0.0

And all the grads and loss after this step became Nan (the scale still is 0).
What’s wrong with my model, loss function or training data? How to fix it?

ptrblck · July 30, 2020, 7:09am

Is your model generally working fine without using amp?
The loss scaler might run into this “death spiral” of decreasing the scale value, if the model output or loss contains NaN values.
These NaN values in the loss would thus create NaN gradients and the loss scaler will decrease the scale factor as it thinks the gradients are overflowing.

However, in fact the gradients are not overflowing, but your model yields invalid outputs.
Could you check the output and loss for NaNs and check, if they are also created without amp?

victorc25 · August 23, 2020, 7:11am

Hello! I’m facing the exact same situation as the OP, the scale just halves after each iteration until the loss becomes zero.

In my case not only it works fine without AMP, but only one out of multiple losses is having this issue. The others work fine with AMP.

I’m also checking both the loss and the output with:

torch.isnan(l_g_total).any() or torch.isinf(l_g_total).any()

And there are no NaN or infs.

What could be causing this behavior?

ptrblck · August 24, 2020, 10:16am

If the model output or losses do not contain NaN values and the scaler is constantly decreasing the scale value, some operation might cause large gradients, which are still overflowing. Could you post an executable code snippet, which reproduces this error?

victorc25 · August 24, 2020, 3:45pm

Hello again @ptrblck, thanks for your reply!

A working snippet will be difficult, because the training is done using multiple files and using a custom loss (https://github.com/victorca25/BasicSR/blob/3fb5851af3a411df153f0b0e5873f3b2de324a3d/codes/models/modules/loss.py#L767), but I think there’s nothing out of the ordinary, this is how the code looks like:

#In the init:
self.amp_scaler =  GradScaler()

#in the training loop:
        ### Network forward, generate SR
        with autocast():
            self.fake_H = self.netG(self.var_L)
        l_g_total = 0
        with autocast(): #this is separated, because there are some conditionals in the middle
            loss_results = self.generatorlosses(self.fake_H, self.var_H, self.log_dict, self.f_low) # multiple losses can be used here, but I'm testing with just the one loss in the link above
            l_g_total += sum(loss_results)/self.accumulations
        self.amp_scaler.scale(l_g_total).backward()
        if (step + 1) % self.accumulations == 0:
            self.amp_scaler.step(self.optimizer_G)
            self.amp_scaler.update() 
            self.optimizer_G.zero_grad()

I’m checking the isnan() and isinf() for the loss and output right after the loss. This might be difficult to try to debug as-is, but any pointers at what I should look at or how to catch the potential overflow would be awesome. It’s the only loss with this behaviour ATM.

ptrblck · August 24, 2020, 4:54pm

The code looks alright.

Note that a NaN output or loss would be a different issue than a valid output or loss and a constantly reduced scaling factor (and thus also the debugging steps). Which case are you dealing with?

victorc25 · August 24, 2020, 5:49pm

This, same as OP, my scaler’s scale is halving each iteration until it becomes of magnitude 1e-45 and after that zeroes out, so all of the model’s output become black. There are no NaN or Inf happening in the loss or output.

ptrblck · August 25, 2020, 8:29am

Thanks for the follow-up.
How could we reproduce this issue, i.e. which input shapes and values are you using for which model and loss function?

victorc25 · August 25, 2020, 10:13am

I don’t currently have the AMP code in the repo (because of this bug), but I’m using the ESRGAN model for 4x SISR (RRDBNet: https://github.com/victorca25/BasicSR/blob/3fb5851af3a411df153f0b0e5873f3b2de324a3d/codes/models/modules/architectures/RRDBNet_arch.py), the inputs are images of size 32x32 and the target patch size is 128x128. The loss function is the contextual loss (https://github.com/victorca25/BasicSR/blob/3fb5851af3a411df153f0b0e5873f3b2de324a3d/codes/models/modules/loss.py#L767). I’m not using other losses or a discriminator, only a generator. The model “trainer” is here: https://github.com/victorca25/BasicSR/blob/3fb5851af3a411df153f0b0e5873f3b2de324a3d/codes/models/SRRaGAN_model.py (minus the AMP parts), but its really no different from the official documentation.

I know it may not be easy to reproduce like this, but maybe I could help with some guidance? I already tried changing the default GradScaler() parameter initialization, increasing the backoff_factor only delays the inevitable and the others can’t really change anything.

victorc25 · August 26, 2020, 5:55pm

Hello again!

After more testing I found out that the problem is in the autoscaling of the losses. If I only autoscale the forward pass, everything works fine (of course, minus the proper loss handling). After diagnosing the losses, I also found that SSIM caused a similar behavior, but I think that the important thing is that while the model output is converted into a HalfTensor, the target (the ground truth image) is not, it stays as a FloatTensor.

Update: I fiddled around with the targets and that’s not the issue. They types are inconsistent, but it doesn’t change anything if they are the same type.

ptrblck · August 27, 2020, 5:23am

Which self.distanceType are you using?
I see that meshgrid is used, which could be numerically unstable (but we haven’t seen a failed use case so far).
Could you disable autocast for mesgrid, while enabling it for the general loss calculation and rerun the script?

victorc25 · August 27, 2020, 6:15am

Hello!

I’m using the default distanceType (‘cosine’) and the regular CX loss (‘calculate_CX_Loss()’), so the meshgrid computation is not being used. _random_sampling(), _random_pooling() and _crop_quarters() are also not being used, only _create_using_dotP (cosine distance) and calculate_CX_Loss, in principle.

So far only by taking the whole loss calculation out of the autocasting allows for the scale to remain stable. If I only autocast parts of the loss, wouldn’t calculations become inacurate?

In the case of SSIM, the scale also drops heavily, but stabilizes at a scale of 2. The only similarity I find between the two loss functions is that both are in principle distance metrics, but no idea if that is related.

ptrblck · August 27, 2020, 6:49am

That shouldn’t be the case, since autocast should take care of using FP16, where it’s safe to do so, and fallback to FP32 if necessary.
Could you post the setup for the loss function, so that I could run it in isolation?

The initial loss scale might drop and stabilize, which is the expected behavior.
However, the outputs (and thus loss) should never contain any NaN values.

victorc25 · August 27, 2020, 6:57am

Awesome! I’ll be testing that today then.

Yes, for normal use it should be like :

layers = {"conv_3_2": 1.0, "conv_4_2": 1.0}
Contextual_Loss(layers, crop_quarter=False, max_1d_size=64, distance_type = 'cosine', b=1.0, band_width=0.5, use_vgg = True, net = 'vgg19', calc_type = 'regular')

Perfect! That makes sense.

Thanks a lot for your help!

victorc25 · August 28, 2020, 4:58am

Hello again!

Thanks to your suggestion of autocasting only parts of the loss, I found the exact line causing the scale to drop, its this one: https://github.com/victorca25/BasicSR/blob/3fb5851af3a411df153f0b0e5873f3b2de324a3d/codes/models/modules/loss.py#L704

dist = F.conv2d(I_features_i, T_features_i).permute(0, 2, 3, 1).contiguous()

If just this line is put inside an autocast context, the scale is halved every iteration.

Any suggestions on what could be the reason? It doesn’t appear to be anything special there.

ptrblck · August 28, 2020, 8:25am

It is indeed a common operation.
Could you check the stats (min, max, mean, median, shape) or I_features_i and T_features_i, which cause the NaNs?
If possible, could you upload them so that we could use them to debug?

victorc25 · August 28, 2020, 10:15am

I don’t know if this works, but here is the output from the first few iterations: 20-08-28 10:08:46.412 - INFO: Model [SRRaGANModel] is created.20-08-28 10:08:4 - Pastebin.com

I also included the gradscaler statedict output, in case it is useful. I wouldn’t know what to say about the stats, but there are no NaNs. The values are small, but I don’t know if that could cause problems? Its at that convolution where the problem is happening.

victorc25 · August 30, 2020, 9:24am

I’ve now committed the code with AMP to the branch (https://github.com/victorca25/BasicSR/tree/dev2/codes).

I’m giving up on making AMP work properly after that convolution in the CX loss, I can’t find the cause of the problem, so I’m forcing the tensors to be converted to FP32 right after that convolution as a workaround for now (https://github.com/victorca25/BasicSR/blob/57577667067c99194098ce034a97d0bef028620d/codes/models/modules/loss.py#L707).

It runs and the other losses work fine. The forward pass also works fine in the autocast context, so at least its better than nothing.