How to measure time in gpu mode?

I run the code as follows:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '1'

import torch
import torch.nn as nn
from torch.autograd import Variable

import time
from modules import *

def main():
    test_data64 = Variable(torch.randn(1,64,56,56))
    m_baseline64 = nn.Conv2d(64, 64, kernel_size=3, stride=1, padding=1, bias=False)
    time1_list = []
    for i in range(10000):
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        
        start.record()
        m_baseline64(test_data64)
        end.record()
        
        torch.cuda.synchronize()
        time1_list.append(start.elapsed_time(end))

    print('m_baseline64:%.8f' %(sum(time1_list[1000:])/len(time1_list[1000:])))
    
    test_data36 = Variable(torch.randn(1,36,56,56))
    m_baseline36 = nn.Conv2d(36, 36, kernel_size=3, stride=1, padding=1, bias=False)
    time2_list = []
    for i in range(10000):
        start = torch.cuda.Event(enable_timing=True)
        end = torch.cuda.Event(enable_timing=True)
        
        start.record()
        m_baseline36(test_data36)
        end.record()
        
        torch.cuda.synchronize()
        time2_list.append(start.elapsed_time(end))

    print('m_baseline36:%.8f' %(sum(time2_list[1000:])/len(time2_list[1000:])))

if __name__=='__main__':
    main()

The theoretical flops for ‘’‘m_baseline64’’’ should be about 3.16 times than ‘’‘m_baseline36’’’
However, when I run the code in GPU mode, the real time of m_baseline64 and m_baseline36 is 0.100s and 0.089s, respectively, which is very strange. Then I run the code on CPU mode, the real time of m_baseline64 and m_baseline36 is 4.536s and 1.684s, which is match for the theoretical ratio.
Anyone can help me?

I think that it happens because no matter if you use 36 or 64 filters, gpu computes it in parallel think that each filter is independent from the others

Why does this happen? I just use a single gpu, rather than mutiple gpus.

Because gpus are supposed to be multi-core machines. If you check gpu specifications gpus have thousens of cores rather than 4-10 (like a cpu). They can run dozens of process in parallel. That’s why they speed up machine learning, as ML are basically hundreds of independent operations. Each filter in a convolution is totally independent from others.

It’s easier you to google why gpus speed up ML cos it’s a large explanation :slight_smile:

1 Like

You mean the operation on gpu is calculated on mutiple parallel cores (similar to mutiple cpus), and the operation on cpu is just a single core ?

@Tony_Lee in short yes but it’s not that simple.

Although GPU specialises in parallel computation it can only perform vectorized math and in this case tensors. And not general purpose that need sequential work, even callback mechanics might misbehave with GPUs. Also each individual cores of GPUs are significantly weaker than CPUs.

Although in case of GPU programing we tend to prefer high level APIs and avoid programing dealing with thread, processes and so on.

Here’s an article nicely put by NVIDIA.

Oh, thanks so much !

@JuanFMontesinos I have found that the realtime acceleration rate on CPU is also not same as the theoretical flops reduction rate, what other things will bring additional time cost ?

Well, cpu is a complex world. It depends a lot on which library do you use, how well optimized is it, which cpu model do you have… Think that not all the cpus can execute same basic operations.
Toy example, supose you have a cpu which can sum but not multiply. It will need lot of cycles to perform a multiplication meanwhile if cpu has an specific module enabled will be much faster.

In fact there is one specific operation very important for deep learning whose name I forgot. It’s a multiplication + sum.

What exactly did you find?

I have tested the realtime reduction rate on the convolution operation. The theoretical flops reduction rate is always higher than the realtime reduction rate. For example, when the theoretical flops reduction rate is
26.04%, the realtime ruduction rate is about 17%, when the theoretical flops reduction rate is 43.75%, the realtime reduction rate is about 35%. I don’t know what brings the additional time cost.

I have no real answer for that as it’s hardware dependent. Are you timing only the time required for convolution? without measuring time required to load tensors into gpu and so on

Maybe the time to load tensor is also included in the time, tested on CPU as follows:

time1 = time.time()
nn.conv2d()
time2 = time.time()
print(time2-time1)

Well that’s okay.
Are you using the following scheme for gpu?

tensor = tensor.cuda()
convolution = convolution.cuda()
start-timing
convolution(tensor)
end timing

Note that you have to instantiate the class (in both cases) before starting to time

I just tested on CPU mode