Torch.cuda.amp inferencing slower than normal

I am trying to infer results out of a normal resnet18 model present in torchvision.models attribute. The model is simply trained without any mixed precision learning, purely on FP32.
However, I want to get faster results while inferencing, so I enabled torch.cuda.amp.autocast() function only while running a test inference case.

The code for the same is given below -

model = torchvision.models.resnet18()
model = model.to(device) # Pushing to GPU

# Train the model normally

Without amp -

tensor = torch.rand(1,3,32,32).to(device) # Random tensor for testing
with torch.no_grad():
  model.eval()
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  model(tensor) # warmup
  model(tensor) # warmpup
  start.record()
  for i in range(20): # total time over 20 iterations 
    model(tensor)
  end.record()
  torch.cuda.synchronize()
    
  print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))

  execution time in milliseconds: 5.264944076538086

With amp -

tensor = torch.rand(1,3,32,32).to(device)
with torch.no_grad():
  model.eval()
  start = torch.cuda.Event(enable_timing=True)
  end = torch.cuda.Event(enable_timing=True)
  model(tensor)
  model(tensor)

  start.record()
  with torch.cuda.amp.autocast(): # autocast initialized
    for i in range(20):
      model(tensor)
  end.record()
  torch.cuda.synchronize()
  
  print('execution time in milliseconds: {}'. format(start.elapsed_time(end)/20))

  execution time in milliseconds: 10.619884490966797

Clearly, the autocast() enabled code is taking double the time. Even, with larger models like resnet50, the timing variation is approximately the same.

Can someone help me out regarding this ? I am running this example on Google Colab and below are the specifications of the GPU

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   43C    P0    28W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
torch.version.cuda == 10.1
torch.__version__  == 1.8.1+cu101

The P100 doesn’t have TensorCores, so while I wouldn’t expect a slowdown (this seems to be bad) I also wouldn’t expect to see a huge increase in performance.

UPDATE : I executed the same code mentioned above, but on a different GPU, Tesla T4 (around 320 Tensor cores). There seems to be much of an improvement in the execution time, with and without amp

Without amp -

execution time in milliseconds: 3.9147518157958983

With amp -

execution time in milliseconds: 3.4673088073730467

The execution time with autocasting is slightly better than the one without autocasting. However, the time difference is not that great as expected (atleast 2x speedup would have been preferable).

What can be the reason for this ?
Is there any bug in the code ?
Is the GPU ineffective ?
Is the resnet18 model too small and simple to show any significant execution time difference ?

Reminder

I’d be grateful if someone could answer the questions asked in the previous response !