Hi,
I today noticed that when I freeze my batchnorm2d layers and using torch.cuda.amp.GradScaler my losses exploding after just 3 or 4 batches. The same code and parameters are giving very good results with not frozen bn layers. I have to scale down the learning rate to get a functioning training process again.
This is obviously not a bug report, I just cannot come up with the reason behind this? I am using a DeepLabV3 with 107 batchnorm layers. For freezing them I simply set them to eval. Maybe someone can help me understand why this is happening.
Yes, the model is also needed as I won’t be able to execute the code otherwise.
Your current code snippet doesn’t show an autocast usage at all, so I’m currently also unsure why you are scaling the gradients in the first place.