Inference time of fp32 and fp16 roughly the same on RTX3090

from efficientnet_pytorch import EfficientNet
class modelController:
    def __init__(self):
        self.model = self.get_model()
        self.data_transforms = transforms.Compose([
            transforms.Resize((224,112)),
            transforms.ToTensor(),
            transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
            ])
    def get_model(self):
        model = EfficientNet.from_pretrained('efficientnet-b0', num_classes = 2)
        model.cuda()
        model.eval()
        model.half()
        return model
   
    def get_model_inference(self, img):
        with torch.no_grad():       
            result = self.model(img)
            result = self.get_inference(result)
            return result
import cv2
import numpy as np
from classifier import modelController
import torch
import time
import copy
mc = modelController()

tensor = torch.rand((32,3,224,112))
tensor = tensor.cuda()
tensor = tensor.half()
print(tensor.size())
# print(final_img.size())
total_time = 0
for i in range(50):
  t1 = time.time()
  get_person_cat(tensor)
  t2 = time.time() - t1
  print("time taken: ", t2)
  total_time += t2
print("avg totatl time: ", total_time/50)

I have an efficientnet-b0 model. I am getting 0.0173 ms average inference time with model.half() and tensor.half() but when I comment out model.half() and tensor.half(), I get 0.0172 ms average inference time. Why is the fp16 model not taking less time to infer?

Hi,

There are two factors that I think are contributing to your results:

  1. You initialise a random tenor at the start of your program, this tensor is used to calculate the timing for each sample. I would recommend using real-world data if possible for better results or passing a new tensor for each inf step (see example bellow)
  2. The performance difference torch.float32 and torch.float16 is so small that you are not able to measure the difference at the current precision level. You could try making the task more difficult for your model, a quick method could be to pass more data for the model to compute.

I would be interested to see if any PyTorch developers have anything to add to this.

import cv2
import numpy as np
from classifier import modelController
import torch
import time
import copy
mc = modelController()

tensor = torch.rand((50,3,224,112))
tensor = tensor.cuda()
tensor = tensor.half()
print(tensor.size())
# print(final_img.size())
total_time = 0
for i in range(50):
  t1 = time.time()
  get_person_cat(tensor[i].unsqueeze(0))
  t2 = time.time() - t1
  print("time taken: ", t2)
  total_time += t2
print("avg totatl time: ", total_time/50)

I loaded a raw rgb image in the for loop first but no difference in inference time was seen.

With your code, on fp32 I am getting 0.016 ms time and on fp16 I am getting 0.017 ms

@ptrblck can you tell me what I can do to get a smaller inference time using fp16 model when compared with a similar fp32 model

For this kind of timing you should add a torch.cuda.synchronize() before stopping the time to ensure that no kernels are still in-flight on the GPU when timing stops. You might also want to add a warmup run before timing to make sure the noise from the first run is not included.

can u give an example code? I am confused about where to add

torch.cuda.synchronize() between or after the for loop

Sure, the general pattern is like

torch.cuda.synchronize() # clear out everything that might have been going on before
t1 = time.time()
for i in range(iterations):
    do_stuff_on_gpu()
torch.cuda.synchronize() #make sure everything finished
t2 = time.time()
diff = t2 - t1
1 Like