Why no direct relation between CPU or GPU inference time and number of parameters in models?

I am wondering why CPU inference time varies for Vgg16 and ResNet18. I am using the following script to measure the inference time on CPU for three different modes which I did train from scratch for my custom dataset.

inference time: ResNet18 = 12.88 millisecond, Vgg16 = 66.85 millisecond, and my propsoed model = 11.72 milisecond

Also, the number of parameters for each model are as follows:
ResNet18 : 11.1 M
Vgg16: 13.4 M
proposed model: 33 K

The question is why ResNet18 with 11.1 M parameters takes ~13 ms, however, the proposed model with 33 K takes ~12 ms?

P.S. I measure inference time for one image, 100 times, and then I report the average.

Here is my snippet: Am I missing something here?

from ResNet_model_ResNet18  import model
#from vgg_model_vgg16 import model

import torch
import torch.optim as optim
import torch.nn as nn
import time
from PIL import Image
import torchvision.transforms.functional as TF

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

def main():

    epoch = 800
    PATH = 'C:/test_inference_time/resnet18_epoch{}.pth'.format(epoch)
    checkpoint = torch.load(PATH, map_location = 'cpu')
    epoch = checkpoint['epoch']
    with torch.no_grad():

     for step in range(0, 100):
        t0 = time.time_ns()
        width = 32
        height = 32
        t_image = Image.open('C://test_inference_time//test//01a1d54e8b64e81d_b44.png')
        t_image = t_image.resize((width, height), Image.BILINEAR)   
        t_image = TF.to_tensor(t_image)
        t_image = t_image.unsqueeze_(0)

        t_image = t_image     #.to(device)

        t1 =  time.time_ns() - t0    #time of load image

        logits= model(t_image)
        _, predicted = torch.max(logits, 1)

        t2 = time.time_ns() - t0      # time of apply model and prediction  

        t3 = t2 - t1

        print('{:.00f} minutes'.format((t3) / 6e10), '{:.00f} second'.format((t3) / 1e9), '{:.00f} milisecond'.format((t3) / 1e+6), '{:.00f} microsecond'.format((t3) / 1000),'{:.00f} nanosec'.format((t3)% 1e9))

if __name__ == '__main__':

Also, I did measure inference time on GPU for the same models, and I am wondering to see inference time for ResNet18 = 10.21 ms, Vgg16 = 5.49 ms, and proposed model = 4.76 ms

Thanks in advance,

Your code looks generally fine for profiling the CPU.
The inference speed might not linearly depend on the number of parameters.
E.g. convolution layers have very few parameters (just the kernels, which are often small and the bias which is also usually small), while the actual operation might be quite expensive.
Also, some operations (e.g. 3x3 kernels) could have been highly optimized in the underlying library (e.g. MKL-DNN).

If you are timing the models on the GPU, don’t forget to add torch.cuda.synchronize() calls before starting and stopping the timers.

1 Like

I just post my second question here because as you mentioned, the inference time is not only depends on number of parameters, and one of the main factor has effect on inference time is number of operation.
I did count the number of parameters in the model by the following script.

def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)
    print("number of parameters in the model is: ", count_parameters(model))

@ptrblck I was wondering how can I count the number of operations in the model?
Is there any recommended way to do this?

I’m not sure, if there is an easy way to calculate the number of operations.
You could compute the theoretical number of operations, e.g. as done here.
However, this would not necessarily give you the performance of this computation, as it depends highly on the optimization of the performed algorithm.

If you are interested in the performance of a specific method, I think your best bet would be to time it.

1 Like

Thanks @ptrblck for sharing the link.

Here is an example for calculation of Operations , hope this helps : test

1 Like