Different runing time in nn.conv2d

I have encountered a problem about the forward propagation, the full code is as follows:
I want to accurately record the runing time of nn.conv2d function, where time1 is about
0.00009901s, but time2 is about 0.00011235s. As can be seen in the code, the convolution flops is identical, I don’t know what cause the time difference with time1 and time2.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = '7'

import torch
import torch.nn as nn
from torch.autograd import Variable

import time
import numpy as np
import random

def main():
    
    test_input1 = Variable(torch.rand(1,36,56,56)).cuda()
    test_input2 = Variable(torch.rand(1,64,56,56)).cuda()
    m1 = torch.nn.Conv2d(36, 36, kernel_size=3, stride=1, padding=1).cuda()

    time1_list = []
    for i in range(10000):
        torch.cuda.synchronize()
        t1 =time.time()
        temp_out = m1(test_input1)
        torch.cuda.synchronize()
        t2=time.time()
        time1_list.append(t2-t1)
    print('time1:%.8f' %(sum(time1_list[50:])/len(time1_list[50:])))



    output = Variable(torch.zeros(1, 64, 56, 56)).cuda()
    k_in_mask = torch.from_numpy(np.array(random.sample(range(0,64), 36))).cuda()
    k_out_mask = torch.from_numpy(np.array(random.sample(range(0,64), 36))).cuda()
    
    time2_list = []
    for i in range(10000):
        temp_in = torch.index_select(test_input2, 1, Variable(k_in_mask))
       
        torch.cuda.synchronize()
        t1=time.time()
        temp_out = m1(temp_in)
        torch.cuda.synchronize()
        t2=time.time()
        time2_list.append(t2-t1)
        
        output.index_copy(1, Variable(k_out_mask), temp_out)
    print('time2:%.8f' %(sum(time2_list[50:])/len(time2_list[50:])))

if __name__=='__main__':
    main()

Hi,

I’m not sure what is the reason for this small difference.
Why do you repeatedely perform the index select/copy in the loop? It might be adding some noise?

I run 10000 times just to reduce the randomness of the model. I find that when I run just few times (about 3 times), the runing time several times longer than the normal runing time, so I run 10000 times and abandon the first 50 times to obbtain a more accurate time. But as shown on the code, the flops of time1 and time2 are identical, but the runing time of these two function is different, I don’t know what brings additional time cost ?

Hi,

The problem is that the two following lines:

temp_in = torch.index_select(test_input2, 1, Variable(k_in_mask))
# and
output.index_copy(1, Variable(k_out_mask), temp_out)

Add some noise to the runtime computation.
If you put them outside of the loop, the runtime goes back to the same thing.
If you replace them with other ops line temp_in = temp_in + 1, you will see the same noise.

I also find this phenomenon. But my project is as follows:

temp_in = torch.index_select(test_input2, 1, Variable(k_in_mask))
other functions
output.index_copy(1, Variable(k_out_mask), temp_out)

And I do need to record the real runtime, is there any way to calculate the time ?
Thans so much for your reply !

I think that you hit a case where there is no “real runtime”: doing other ops will influence the runtime a little for some reason. The difference is fairly small and very dependant on the hardware you use: I tried on a Titan Black and both runtime are almost exactly the same.

PS: You don’t need to use Variables anymore, you can just remove all of them as every Tensor is a Variable now.

albanD, thanks so much!
I tested the runtime on Tesla M40 before, and I used the Variable just because my code was written based on Pytorch 0.3.1.
Do you mean that the operation on Tensor instead of variable and maybe try other machine such as Titan Black can get the ‘real runtime’ (or just make the difference fairly small) ?
I do need to calculate the ‘real runtime’ to verify the validity of my algorithm.

I mean that the difference is too small to be meaningful for your algorithm: This is most likely due to some hardware details.
The “real runtime” is whatever time it actually takes to run. But that might vary depending on your hardware/software versions and if other things are runnning on your machine. So it’s a very subjective thing and there is no true “real runtime”.

Right! I know what you mean, I just want to make the convolution time of time1 and time2 to be identical, then I can reduce the addtional time cost brounght by the

temp_in = torch.index_select(test_input2, 1, Variable(k_in_mask))

and

output.index_copy(1, Variable(k_out_mask), temp_out)

Maybe this is meaningless, because I just want to use the index_select, small convolution and index_copy to replace large convolution, but I don’t find that the small convolution brings additional noise time cost.
Maybe the operation is not so meaningful

I am sorry to bother you again !
As for the index_select and index_copy, which cost too much addtional time, is there any other way to accelerate this two process ?

If you just do one of them you can’t really do much better.
Unless you can remove them altogether and apply your function to the full input and ignore some of the results.

It’s a very sad story, thank you for your answer anyway.
Regarding the time noise, is this problem also no solution?

Unfortunately, GPUs are very good at doing large simple operations like element-wise ops or mm but very bad at smart things like indexing.

For the time noise, I don’t see any way around this, as I said, this is most likely some hardware quirks and I don’t know GPU internals enough to have an idea why this happens :confused:

1 Like

Thank you so much ! ! !