Overflow in Mixed Precision training using Apex

MaveriQ · January 21, 2019, 3:24pm

Hello.

I am trying to use Apex Mixed Precision (GitHub - NVIDIA/apex: A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch) training with a mix of convolution and batchnorm layers; but I get overflow error in optimizer.step() till the dynamic scaling goes down to 1.0 and then it stays at 1.0

“OVERFLOW! Skipping step. Attempted loss scale: 1.0, reducing to 1”

I am using the following network

ResnetDownBlock(
  (bn1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (lrelu1): LeakyReLU(negative_slope=0.1)
  (conv1): Conv2d(3, 3, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
  (bn2): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (lrelu2): LeakyReLU(negative_slope=0.1)
  (conv2): Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  (main): Sequential(
    (0): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (1): LeakyReLU(negative_slope=0.1)
    (2): Conv2d(3, 3, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (3): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (4): LeakyReLU(negative_slope=0.1)
    (5): Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
  )
  (shortcut): Sequential(
    (0): Conv2d(3, 3, kernel_size=(1, 1), stride=(2, 2), bias=False)
    (1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): LeakyReLU(negative_slope=0.1)
  )
)

Then I convert the network to mixed precision taking care of the batchnorm:

net=apex.fp16_utils.BN_convert_float(ResnetDownBlock(3,1).cuda().half())
opt=optim.Adam(net.parameters())
opt=apex.fp16_utils.FP16_Optimizer(opt,dynamic_loss_scale=True)

pred=net(torch.rand(8,3,2,2).cuda().half()).squeeze()
loss=nn.MSELoss()(pred.float(),torch.ones((8)).cuda())
opt.backward(loss)
opt.step()

The last step produces the warning : “OVERFLOW! Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0” which reduces to 1.0

I looked up the network parameters if they are converted to mixed precision appropriately and I get following :

for tag,param in net.named_parameters():
print(‘Parameter: {}\t Datatype: {}’.format(tag,param.dtype))

|Parameter: bn1.weight| Datatype: torch.float32|
|Parameter: bn1.bias| Datatype: torch.float32|
|Parameter: conv1.weight| Datatype: torch.float16|
|Parameter: bn2.weight| Datatype: torch.float32|
|Parameter: bn2.bias| Datatype: torch.float32|
|Parameter: conv2.weight| Datatype: torch.float16|
|Parameter: shortcut.0.weight| Datatype: torch.float16|
|Parameter: shortcut.1.weight| Datatype: torch.float32|
|Parameter: shortcut.1.bias| Datatype: torch.float32|

Apparently this is correct, but I can’t get rid of the overflow.

I will appreciate any help.

Thanks!

MaveriQ · April 3, 2019, 3:39pm

So for anyone who reaches this thread, the matter was solved with the new Apex API.

Reference : https://github.com/NVIDIA/apex/issues/238