Hello there,

I would like to accelerate my model training (mainly linear / conv2d / deconv2d) with Pytorch Lightning (1.1.6), so I tried to use the AMP feature. However I didn’t notice any good evolution. So I tried the same but on a very small sample (without Lightning) :

```
import torch
import time
N, D_in, D_out = 64, 1024, 512
x = torch.randn(N, D_in, device="cuda")
y = torch.randn(N, D_out, device="cuda")
model = torch.nn.Linear(D_in, D_out).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)
start_time = time.time()
for to in range(500):
with torch.cuda.amp.autocast(enabled=True):
y_pred = model(x)
loss = torch.nn.functional.mse_loss(y_pred, y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.synchronize()
compute_time = time.time() - start_time
print(compute_time)
```

**Results :**

**If autocast enabled = True ==> ~0.20s**

**If autocast enabled = False ==> ~0.15s**

Remarks :

- I got always the same kind of result after many many run
- Worst results using GradeScaler
- Same kind of results with Nvidia apex amp (here, slide 28)
- I tried to profile my tensor core usage but Nvidia Nsight Compute didn’t work
- Same results if x and y are in half() (not recommanded by the way)
- I followed quite carefully this tutorial

**Is there somebody that can help me in that situation : why there is no speed up ?**

Thanks a lot by advance for any answer !

My setup :

- Pytorch 1.7.1
- GPU 3060Ti (Ampere)
- cuda 11.2 (EDIT : I tried to downgrade to 11.0 like Pytorch built version, but the corresponding nvidia driver 450 seems to old to support a 3060 GPU)
- driver 460.27.04