Wired image acceleration reading

Hello!
Thank you for your amazing job!
I used pytorch to handle some image problems, follow this code:

if __name__ == "__main__":
    img_path = './data/SIDD_Val_and_GT_Benchamrk/val/noisy/00017.PNG' # 256, 256, 3

    s1 = time.time()
    img = cv2.imread(img_path, -1) #1

    s2 = time.time()
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) #2

    s3 = time.time()
    img = img.astype(np.float32) #3

    s4 = time.time()
    img = trans.ToTensor()(img).unsqueeze(0) #4

    s5 = time.time()
    img = img.to('cuda:0') #5

    s6 = time.time()

    print('#1', s2 - s1)
    print('#2', s3 - s2)
    print('#3', s4 - s3)
    print('#4', s5 - s4)
    print('#5', s6 - s5)
    print('total time', s6 - s1)

Reults:
     0.0008392333984375 s       #1
     3.719329833984375e-05 s    #2 
     0.00016021728515625 s      #3
     0.0006651878356933594 s    #4
     0.5688984394073486 s       #5 
     0.5705642700195312 s       #total time

I found most time was cost at #5, ‘to(‘cuda:0’)’, which cost about 0.57s on my computer.
But, if I put a random CNN in CUDA before I running above code, the speed of #5 will accelerate so many times, which only cost 0.00017s on my computer.

For example:

class InitNet(nn.Module):
    def __init__(self):
        super(InitNet, self).__init__()
        self.layer = nn.Sequential(*[nn.Conv2d(1, 1, 1)])

    def forward(self, x):
        x = self.layer(x)
        return x

if __name__ == "__main__":
    _ = InitNet().eval().cuda() #0 init a random CNN

    img_path = './data/SIDD_Val_and_GT_Benchamrk/val/noisy/00017.PNG' # 256, 256, 3
    ...

Reults:
     0.0007555484771728516 s      #1
     3.3855438232421875e-05 s     #2 
     0.00019979476928710938 s     #3
     0.000362396240234375 s       #4
     0.00017547607421875 s        #5 
     0.0015270709991455078 s      #total time

My knowledge of pytorch is not enough to make me understand why this happened, so I am very confuse.
Is it a bug in pytorch or did I do something wrong?
And is there any solution to accelerate #5?
Thank you in advance!
Full Code:

import time

import cv2
import numpy as np

import torch.nn as nn
import torchvision.transforms as trans


class InitNet(nn.Module):
    def __init__(self):
        super(InitNet, self).__init__()
        self.layer = nn.Sequential(*[nn.Conv2d(1, 1, 1)])

    def forward(self, x):
        x = self.layer(x)
        return x


if __name__ == "__main__":
    _ = InitNet().eval().cuda() #0 init a random CNN

    img_path = './data/SIDD_Val_and_GT_Benchamrk/val/noisy/00017.PNG' # 256, 256, 3

    s1 = time.time()
    img = cv2.imread(img_path, -1) #1

    s2 = time.time()
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) #2

    s3 = time.time()
    img = img.astype(np.float32) #3

    s4 = time.time()
    img = trans.ToTensor()(img).unsqueeze(0) #4

    s5 = time.time()
    img = img.to('cuda:0') #5

    s6 = time.time()

    print('#1', s2 - s1)
    print('#2', s3 - s2)
    print('#3', s4 - s3)
    print('#4', s5 - s4)
    print('#5', s6 - s5)
    print('total time', s6 - s1)

CUDA operations are executed asynchronously, so you would need to synchronize the code via torch.cuda.synchronize() or use GPU timers to properly profile the code.
The very first call into any CUDA operation will initialize the CUDA context and will thus be slow.
A proper profile thus also needs warmup iterations beside the missing synchronizations.

1 Like