Slow AMP (apex/native) on 3060 ti

vmacheret · February 5, 2021, 11:13am

Hello there,

I would like to accelerate my model training (mainly linear / conv2d / deconv2d) with Pytorch Lightning (1.1.6), so I tried to use the AMP feature. However I didn’t notice any good evolution. So I tried the same but on a very small sample (without Lightning) :

import torch
import time

N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
start_time = time.time()

for to in range(500):
    with torch.cuda.amp.autocast(enabled=True):
        y_pred = model(x)
        loss = torch.nn.functional.mse_loss(y_pred, y)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

torch.cuda.synchronize()

compute_time = time.time() - start_time

print(compute_time)

Results :
If autocast enabled = True ==> ~0.20s
If autocast enabled = False ==> ~0.15s

Remarks :

I got always the same kind of result after many many run
Worst results using GradeScaler
Same kind of results with Nvidia apex amp (here, slide 28)
I tried to profile my tensor core usage but Nvidia Nsight Compute didn’t work
Same results if x and y are in half() (not recommanded by the way)
I followed quite carefully this tutorial

Is there somebody that can help me in that situation : why there is no speed up ?

Thanks a lot by advance for any answer !

My setup :

Pytorch 1.7.1
GPU 3060Ti (Ampere)
cuda 11.2 (EDIT : I tried to downgrade to 11.0 like Pytorch built version, but the corresponding nvidia driver 450 seems to old to support a 3060 GPU)
driver 460.27.04

Alexey_Demyanchuk · February 5, 2021, 3:24pm

Hi I believe casting to half precision on such a small tensor is overhead that is why you are not noticing any speed improvement.
I would suggest you try simple pure pytorch on your env with a bit bigger tensors and computational graph. Something like this could give you more reliable answer

import torch
import torchvision
import time

N, D_in, D_out = 64, 224, 1000
x = torch.randn(N, 3, D_in, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
model = torchvision.models.resnet18(pretrained=False).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
start_time = time.time()

for to in range(500):
    with torch.cuda.amp.autocast(enabled=False):
        y_pred = model(x)
        loss = torch.nn.functional.mse_loss(y_pred, y)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

torch.cuda.synchronize()

compute_time = time.time() - start_time

print(compute_time)

I am currently on Colab Tesla P100 (which doesnt have native half-precision). But at least I can see that amp version is not worse (amp=True ~ 47.02 sec, amp=False ~ 49.44 sec)

vmacheret · February 5, 2021, 4:00pm

Hi @Alexey_Demyanchuk ! Thanks you to take a moment for my situation.

I tried your resnet sample (as it); indeed i got ~45s without AMP, and ~41s with AMP.

But is not supposed to be REALLY faster with tensor core ? Nvidia marketing claims about x4 / x6 speed gain. Maybe resnet is not the best sample ?

Alexey_Demyanchuk · February 6, 2021, 7:38am

Hi, @vmacheret No worries Regarding, the marketing - I would say they are some special cases with all the highly optimized techniques from smart Nvidia engineers you can imagine. So, it is rarely the case one can see the x4 - x6 speedup in real life.

Back to the point. Resnet18 is still only the toy example. I believe mixed precision would shine on heavy GPU tasks (given that all CPU and IO bottlenecks are resolved). One has to have in mind that amp training is reducing the memory footprint, so you actually can try to train with a larger batch size, which imply you can iterate one epoch of your data faster. My suggestion: always try to utilize GPU compute as close to 100% as possible. It is how you make sure the most expansive part of your build doesn’t just stay idle waiting for the data.