[amp]automatic mixed precision training slower than the normal model

i’m trying to use the automatic mixed precision training to speed update the training speed.But it seems i get the opposite result. The pytorch is 1.6, cuda is 10.1, gpu is Tesla k80 and Tesla T4,cudnn 8.0.2

the baseline version code is

import torch

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for to in range(500):
    y_pred = model(x)
    loss = torch.nn.functional.mse_loss(y_pred, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

time is
real 0m3.402s
user 0m2.926s
sys 0m4.955s

the amp version code is

import torch
from  torch.cuda.amp import autocast, GradScaler
N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
scaler = GradScaler()
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for to in range(500):
    with autocast():
        y_pred = model(x)
        loss = torch.nn.functional.mse_loss(y_pred, y)
    scaler.scale(loss).backward()
    optimizer.zero_grad()
    scaler.step(optimizer)
    scaler.update()

the time is
real 0m3.584s
user 0m3.131s
sys 0m4.832s

Is it because,Tesla 80k do not have Tensor Core? and amp must have tensor core support.
I change a Tesla T4, get the same result.

I would recommend to profile the code directly by synchronizing the code via torch.cuda.synchronize() before starting and stopping the timer and calculate the average time. Currently you are also profiling the startup time etc.

Yes, this would explain the slowdown on the K80 due to the added overhead in amp.

I’ll try to grab a node with a T4 to reproduce it.

@ptrblck thanks for your reply. T4 is slow , i guess it is because the batch_size is too small. I reproduce the code in https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html .
The conclusion is mixed precision is slow in Tesla k80, and faster in T4.
I wanna to speed up the interference speed, and i only have k80 dev environment and t4 product environment. it is ok to use Tesla k80 to train and use T4 to interference in amp ?

Yes, that should be OK. There is no requirement to use the same GPU for inference and training.

But the k80 do not support fp16.

It shouldn’t matter or are you seeing different results on the T4 in comparison to the K80?

In interference the result is bad when i used torch.cuda.amp.autocast();
if i remove the code, the result is right.
I guess it is because i can’t use mixed precision training in Tesla 80k.

Could you compare the absolute and relative error on this device with and without autocast?

actually i’m running a gpt2lm model, the diff is big.