i’m trying to use the automatic mixed precision training to speed update the training speed.But it seems i get the opposite result. The pytorch is 1.6, cuda is 10.1, gpu is Tesla k80 and Tesla T4,cudnn 8.0.2
the baseline version code is
import torch
N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for to in range(500):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
time is
real 0m3.402s
user 0m2.926s
sys 0m4.955s
the amp version code is
import torch
from torch.cuda.amp import autocast, GradScaler
N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
scaler = GradScaler()
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
for to in range(500):
with autocast():
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
scaler.scale(loss).backward()
optimizer.zero_grad()
scaler.step(optimizer)
scaler.update()
the time is
real 0m3.584s
user 0m3.131s
sys 0m4.832s
I would recommend to profile the code directly by synchronizing the code via torch.cuda.synchronize() before starting and stopping the timer and calculate the average time. Currently you are also profiling the startup time etc.
Yes, this would explain the slowdown on the K80 due to the added overhead in amp.
I’ll try to grab a node with a T4 to reproduce it.
@ptrblck thanks for your reply. T4 is slow , i guess it is because the batch_size is too small. I reproduce the code in https://pytorch.org/tutorials/recipes/recipes/amp_recipe.html .
The conclusion is mixed precision is slow in Tesla k80, and faster in T4.
I wanna to speed up the interference speed, and i only have k80 dev environment and t4 product environment. it is ok to use Tesla k80 to train and use T4 to interference in amp ?
In interference the result is bad when i used torch.cuda.amp.autocast();
if i remove the code, the result is right.
I guess it is because i can’t use mixed precision training in Tesla 80k.