Different usage of function "unfold" leads to a totally different time consumption

Suppose X is an input with dimension [B, C, H, W], and convolutional kernel size is 3*3. We can use the function torch.unfold to calculate a tensor Y. Specifically,

Y = torch.unfold(X, 3, stride=1, padding=1, dilation=1)

This is a natural usage of ‘unfold’ function. However, we can calculate the same Y with the following codes:

Y = torch.unfold(X.view(1, B*C, H, W), (H, W), stride=1, padding=1, dilation=1)
Y = Y.view(B, C, H*W, 3*3).permute(0, 1, 3, 2).reshape(B, C*3*3, H*W)

Despite the same result of Y, the two codes have totally different efficiency. When using GPU, the second code is 3 times faster than the first one.

Can anyone explain the strange phenomenon? Also, I really want to konw some tips to write efficient programs that are fast on GPU. For example, I find sometimes torch.matmul(A, B.t()) is much slower than torch.sum(A.view(A.size(0), 1, -1) * B, dim=-1). It really confuses me a lot.

Note: The different efficiency can be observed directly when implementing a large neural network. Also, I have written a program to test their efficiency. The codes are listed below:

input = torch.arange(64*16*30*32, dtype=torch.float)
input = input.view(64, 16, 30, 32).cuda()

t1 = time.time()
output = torch.FloatTensor(64, 16*3*3, 30*32).cuda()
for i in range(0, 200):
    output += torch.nn.functional.unfold(input, 3, padding=1, stride=1, dilation=1)
print(time.time() - t1)

t2 = time.time()
output2 = torch.FloatTensor(64, 16*3*3, 30*32).cuda()
for i in range(0, 200):
    t = torch.nn.functional.unfold(input.view(1, 64 * 16, 30, 32), (30, 32), padding=1, stride=1, dilation=1)
    output2 += t.view(64, 16, 30*32, 3*3).permute(0, 1, 3, 2).reshape(64, 16*3*3, 30*32)
print(time.time() - t2)

The outputs are:


If we change the code to run the second method first, the gap is more significant:


Note that CUDA kernels are executed asynchronously.
If you would like to time them, you should call torch.cuda.synchronize() before starting and stopping the timer.
Otherwise, you might e.g. just time the kernel launch time.

But when I write a network using the first method, it takes 3 hours to run; but when using the second method, it takes only less than 1 hour. I know cuda is asynchronous, and if I use torch.cuda.synchronize() , the first method will be faster. But in application, we will always use asynchronous cuda.

Could there be another bottleneck in your training code?
Since the timings give apparently the opposite results, I would assume the slowdown comes from another part of the code.

I only change the ‘unfold’ code, and all the other codes are the same as before. I guess maybe the reason is due to the GPU cache hit rate? But I am not familiar with GPU storage architecture.

Yeah, this is an unfortunate things. torch.unfold is a view op, while F.unfold always copies. Hence the later is always slower (yet more potent).