Pytorch tensor inverse slower on GPU than CPU

Hi I observe that inverse operation on GPU is slower than CPU

I am not sure if this is the right way to profile but here is what I have done

>>> import time
>>> gpu_tensor = torch.randn(3,3).cuda()
>>> cpu_tensor = torch.randn(3,3)

>>> def test1():
        s = time.time()
        for i in range(50):
            torch.inverse(cpu_tensor)
        e = time.time()
        print(e - s)

>>> def test2():
        s = time.time()
        for i in range(50):
            torch.inverse(gpu_tensor)
        e = time.time()
        print(e - s)

>>> test1()
>>> 0.000229120254517

>>> test2()
>>> 0.310909032822

Any idea why?

1 Like

If you are timing cuda ops, you should add a torch.cuda.synchronize() before starting and stopping the timer.
The first cuda call needs some time to init CUDA, thus your timing might measure this time as well.#
Also CUDA ops are called asynchronously, so that your main thread can continue its execution while the GPU is busy.
Could you add it to your code and run it again?

1 Like

Ok so I added torch.cuda.synchronize() before the timer.

>>> import time
>>> gpu_tensor = torch.randn(3,3).cuda()
>>> torch.cuda.synchronize()
>>> cpu_tensor = torch.randn(3,3)


>>> def test1():
        s = time.time()
        for i in range(50):
            torch.inverse(cpu_tensor)
        e = time.time()
        print(e - s)

>>> def test2():
        s = time.time()
        for i in range(50):
            torch.inverse(gpu_tensor)
        e = time.time()
        print(e - s)

>>> test1()
>>> 0.000277042388916

>>> test2()
>>> 0.325435876846

I still don’t see any improvement.

1 Like

[3, 3] is too small for gpu to be better

I test [1024,1024] with range(5), GPU is still slower than CPU.

>>> test1()
>>>0.08186817169189453

>>> test2()
>>>0.9191479682922363

Any suggestion?.

3 Likes