I would like to accelerate my model training (mainly linear / conv2d / deconv2d) with Pytorch Lightning (1.1.6), so I tried to use the AMP feature. However I didn’t notice any good evolution. So I tried the same but on a very small sample (without Lightning) :
import torch
import time
N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
start_time = time.time()
for to in range(500):
with torch.cuda.amp.autocast(enabled=True):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.synchronize()
compute_time = time.time() - start_time
print(compute_time)
Results : If autocast enabled = True ==> ~0.20s If autocast enabled = False ==> ~0.15s
Remarks :
I got always the same kind of result after many many run
Worst results using GradeScaler
Same kind of results with Nvidia apex amp (here, slide 28)
I tried to profile my tensor core usage but Nvidia Nsight Compute didn’t work
Same results if x and y are in half() (not recommanded by the way)
Hi I believe casting to half precision on such a small tensor is overhead that is why you are not noticing any speed improvement.
I would suggest you try simple pure pytorch on your env with a bit bigger tensors and computational graph. Something like this could give you more reliable answer
import torch
import torchvision
import time
N, D_in, D_out = 64, 224, 1000
x = torch.randn(N, 3, D_in, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
model = torchvision.models.resnet18(pretrained=False).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
start_time = time.time()
for to in range(500):
with torch.cuda.amp.autocast(enabled=False):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.synchronize()
compute_time = time.time() - start_time
print(compute_time)
I am currently on Colab Tesla P100 (which doesnt have native half-precision). But at least I can see that amp version is not worse (amp=True ~ 47.02 sec, amp=False ~ 49.44 sec)
Hi, @vmacheret No worries Regarding, the marketing - I would say they are some special cases with all the highly optimized techniques from smart Nvidia engineers you can imagine. So, it is rarely the case one can see the x4 - x6 speedup in real life.
Back to the point. Resnet18 is still only the toy example. I believe mixed precision would shine on heavy GPU tasks (given that all CPU and IO bottlenecks are resolved). One has to have in mind that amp training is reducing the memory footprint, so you actually can try to train with a larger batch size, which imply you can iterate one epoch of your data faster. My suggestion: always try to utilize GPU compute as close to 100% as possible. It is how you make sure the most expansive part of your build doesn’t just stay idle waiting for the data.