Issue with automatic mixed precision

smujjiga · August 24, 2020, 1:20pm

Without AMP

import torch
from torch.cuda.amp import autocast, GradScaler
from torchvision import models

model = models.mobilenet_v2(pretrained=True).cuda()
loss_fnc = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

X = torch.randn((32,3,300,300), dtype=torch.float32).cuda()
y = torch.randint(0, 1000, (32,), dtype=torch.long).cuda()

model.train()
for j in range(30):    
    optimizer.zero_grad()
    
    y_hat = model(X)
    loss = loss_fnc(y_hat, y)    
    loss.backward()
    optimizer.step()    
    print (loss.item())

Output:

8.039933204650879
5.690041542053223
3.4787116050720215
1.607206106185913
0.6231755614280701
0.23825135827064514
0.08544095605611801
0.04335329309105873
0.016259444877505302
0.01174827478826046
0.0069425650872290134
0.004459714516997337
0.003734807949513197
0.0024659112095832825
0.0027059323620051146
........................

But with AMP it gives nan’s

model = models.mobilenet_v2(pretrained=True).cuda()
loss_fnc = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)

X = torch.randn((32,3,300,300), dtype=torch.float32).cuda()
y = torch.randint(0, 1000, (32,), dtype=torch.long).cuda()

scaler = GradScaler()
model.train()
for j in range(30):    
    optimizer.zero_grad()
    
    with autocast():
        y_hat = model(X)
        loss = loss_fnc(y_hat, y)
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

    print (loss.item())

Output:

8.393239974975586
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
.......................

ptrblck · August 27, 2020, 6:52am

Thanks for the code snippet. I cannot reproduce the NaN outputs for the amp model and get:

7.946749687194824
7.837947368621826
8.126112937927246
5.472558498382568
3.0419869422912598
1.3717389106750488
0.4931238293647766
0.21829186379909515
0.09434405714273453
0.0426550917327404
0.021307891234755516
0.013564658351242542
0.010585908778011799
0.00768495025113225
0.005186968017369509
0.0036425101570785046
0.003632057225331664
0.0027618384920060635
0.0024199967738240957
0.00201830524019897
0.0015914703253656626
0.0011888183653354645
0.0011378041235730052
0.000954880437348038
0.0007914779707789421
0.000882966909557581
0.0007371000247076154
0.0007495403406210244
0.0006300751701928675

Could you post the PyTorch, torchvision, CUDA, and cudnn versions you are using and how you’ve installed PyTorch?

smujjiga · August 28, 2020, 8:16am

Thanks a lot @ptrblck for looking into it. Below are the details

PyTorch: 1.6.0
torchvision: 0.7.0
cuda : 10.0.130
cudnn: 7.5
GPU: Titan RTX

Installed pytorch using conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

smujjiga · August 28, 2020, 9:03am

It works fine with CUDA 10.2.

I will upgrade my env.

Thanks