Suppose X is an input with dimension [B, C, H, W], and convolutional kernel size is 3*3. We can use the function torch.unfold to calculate a tensor Y. Specifically,
Y = torch.unfold(X, 3, stride=1, padding=1, dilation=1)
This is a natural usage of ‘unfold’ function. However, we can calculate the same Y with the following codes:
Y = torch.unfold(X.view(1, B*C, H, W), (H, W), stride=1, padding=1, dilation=1)
Y = Y.view(B, C, H*W, 3*3).permute(0, 1, 3, 2).reshape(B, C*3*3, H*W)
Despite the same result of Y, the two codes have totally different efficiency. When using GPU, the second code is 3 times faster than the first one.
Can anyone explain the strange phenomenon? Also, I really want to konw some tips to write efficient programs that are fast on GPU. For example, I find sometimes torch.matmul(A, B.t()) is much slower than torch.sum(A.view(A.size(0), 1, -1) * B, dim=-1). It really confuses me a lot.
Note: The different efficiency can be observed directly when implementing a large neural network. Also, I have written a program to test their efficiency. The codes are listed below:
input = torch.arange(64*16*30*32, dtype=torch.float)
input = input.view(64, 16, 30, 32).cuda()
t1 = time.time()
output = torch.FloatTensor(64, 16*3*3, 30*32).cuda()
for i in range(0, 200):
output += torch.nn.functional.unfold(input, 3, padding=1, stride=1, dilation=1)
print(time.time() - t1)
t2 = time.time()
output2 = torch.FloatTensor(64, 16*3*3, 30*32).cuda()
for i in range(0, 200):
t = torch.nn.functional.unfold(input.view(1, 64 * 16, 30, 32), (30, 32), padding=1, stride=1, dilation=1)
output2 += t.view(64, 16, 30*32, 3*3).permute(0, 1, 3, 2).reshape(64, 16*3*3, 30*32)
print(time.time() - t2)
The outputs are:
0.16900038719177246
0.04093480110168457
If we change the code to run the second method first, the gap is more significant:
0.04470324516296387
0.37698936462402344