Same functional.conv2d in for loop become extremely slow

I use functional.conv2d to do cross-correlation operations. But I found the same function become unacceptably slow after calling it multiple times.

For test and simplicity, I created zero tensors the same size as my data. And run the exact same function in a loop.

The former hundred was done instantly. However, after 500-ish times it takes about 0.3s.

I noticed that the GPU is still 100% usage after the finish of the code and GPU memory wasn’t run out.

I tried to run this function in nn.Module according to this post. But the same problem stays.

Could anyone help me with this?

import torch
import datetime

xa = torch.zeros(1, 1, 1785, 1785, device=torch.device('cuda:0'))
kernel = torch.zeros(1, 1, 129, 129, device=torch.device('cuda:0'))

for i in range(1000):
    start =
    torch.nn.functional.conv2d(xa, kerel, padding='same')

    end =
    print((end - start))

Two quick comments:

  • The code does not take the async nature of CUDA into account, this is also the cause of your observation about GPU utilization. You would need to add torch.cuda.synchronize() before every time taking to make sure you measure the time when all preceding GPU work has been done.
  • It is preferable to use a monotonic clock (time.perf_counter()) for performance measurements,

Best regards


1 Like