[amp]automatic mixed precision training slower than the normal model

Hu_Penglong · November 13, 2020, 2:11am

i’m trying to use the automatic mixed precision training to speed update the training speed.But it seems i get the opposite result. The pytorch is 1.6, cuda is 10.1, gpu is Tesla k80 and Tesla T4,cudnn 8.0.2

the baseline version code is

import torch

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for to in range(500):
    y_pred = model(x)
    loss = torch.nn.functional.mse_loss(y_pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

time is
real 0m3.402s
user 0m2.926s
sys 0m4.955s

the amp version code is

import torch
from  torch.cuda.amp import autocast, GradScaler
N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
scaler = GradScaler()
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for to in range(500):
    with autocast():
        y_pred = model(x)
        loss = torch.nn.functional.mse_loss(y_pred, y)
    scaler.scale(loss).backward()
    optimizer.zero_grad()
    scaler.step(optimizer)
    scaler.update()

the time is
real 0m3.584s
user 0m3.131s
sys 0m4.832s

Hu_Penglong · November 13, 2020, 5:38am

Is it because，Tesla 80k do not have Tensor Core? and amp must have tensor core support.
I change a Tesla T4, get the same result.

ptrblck · November 14, 2020, 10:24am

I would recommend to profile the code directly by synchronizing the code via torch.cuda.synchronize() before starting and stopping the timer and calculate the average time. Currently you are also profiling the startup time etc.

Yes, this would explain the slowdown on the K80 due to the added overhead in amp.

I’ll try to grab a node with a T4 to reproduce it.

Hu_Penglong · November 16, 2020, 3:24am

@ptrblck thanks for your reply. T4 is slow , i guess it is because the batch_size is too small. I reproduce the code in https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html .
The conclusion is mixed precision is slow in Tesla k80, and faster in T4.
I wanna to speed up the interference speed, and i only have k80 dev environment and t4 product environment. it is ok to use Tesla k80 to train and use T4 to interference in amp ?

ptrblck · November 16, 2020, 5:51am

Yes, that should be OK. There is no requirement to use the same GPU for inference and training.

Hu_Penglong · November 16, 2020, 5:57am

But the k80 do not support fp16.

ptrblck · November 16, 2020, 6:00am

It shouldn’t matter or are you seeing different results on the T4 in comparison to the K80?

Hu_Penglong · November 16, 2020, 6:12am

In interference the result is bad when i used torch.cuda.amp.autocast();
if i remove the code, the result is right.
I guess it is because i can’t use mixed precision training in Tesla 80k.

ptrblck · November 16, 2020, 6:13am

Could you compare the absolute and relative error on this device with and without autocast?

Hu_Penglong · November 16, 2020, 11:56am

actually i’m running a gpt2lm model, the diff is big.