Hi, I’m experimenting the different memory layouts based on these two documentation:
Convolutional Layers User Guide (from NVIDIA)
CHANNELS LAST MEMORY FORMAT IN PYTORCH (from Pytorch official doc)
I tried to compare the NCHW model with the NHWC model with the following scripts:
from time import time
import torch
import torch.nn as nn
def time_layer(layer, feature, num_iter):
"""Time the total time used in num_iter forwarding."""
tic = time()
for _ in range(num_iter):
_ = layer(feature)
print(time() - tic, "seconds")
N, C, H, W, K = 32, 1024, 7, 7, 1024 # params from the NVIDIA doc
# NCHW tensor & layer
a = torch.empty(N, C, H, W, device="cuda:0")
conv_nchw = nn.Conv2d(C, K, 3, 1, 1).to("cuda:0")
# NHWC tensor & layer
b = torch.empty(N, C, H, W, device="cuda:0", memory_format=torch.channels_last)
conv_nhwc = nn.Conv2d(C, K, 3, 1, 1).to("cuda:0", memory_format=torch.channels_last)
# NCHW kernel & NCHW tensor
time_layer(conv_nchw, a, 1000)
# NCHW kernel & NHWC tensor
time_layer(conv_nchw, b, 1000)
# NHWC kernel & NHWC tensor
time_layer(conv_nhwc, b, 1000)
# NHWC kernel & NCHW tensor
time_layer(conv_nhwc, a, 1000)
And I got the following output (results looked similar in many repeated runs):
0.9735202789306641 seconds # NCHW kernel & NCHW tensor
2.213291645050049 seconds # NCHW kernel & NHWC tensor
2.3461294174194336 seconds # NHWC kernel & NHWC tensor
2.7654671669006348 seconds # NHWC kernel & NCHW tensor
I’m using a TITAN RTX GPU which is supposed to have Tensor Core and Pytorch 1.7.0+cu101
which supports channels_last
format. So, it’s surprising to see that the fastest timing happens with NCHW kernel & NCHW tensor combination (which won’t be as surprising if I don’t have Tensor Core on my GPU because I guess NCHW format was the one that’s optimized). It’s not so surprising with NCHW kernel & NHWC tensor and NHWC kernel & NCHW tensor combinations because mixing up the format is certainly no good to the computation. However, why is NHWC kernel & NHWC tensor not the fastest combination which is supposed to be the most optimized one with Tensor Core?
Am I doing the layout optimization correctly? Am I missing anything?
Follow up question: instead of running all 4 benchmarks in a script, I executed the 4 lines in the python interpreter interactively, line-by-line, and got (results looked similar in many repeated runs):
>>> time_layer(conv_nchw, a, 1000) # NCHW kernel & NCHW tensor
0.9541912078857422 seconds
>>> time_layer(conv_nchw, b, 1000) # NCHW kernel & NHWC tensor
2.034724235534668 seconds
>>> time_layer(conv_nhwc, b, 1000) # NHWC kernel & NHWC tensor
1.7101032733917236 seconds
>>> time_layer(conv_nhwc, a, 1000) # NHWC kernel & NCHW tensor
1.9565918445587158 seconds
Why are the latter 3 timings shorter than those executed in the stream-lined script? Only thing I can think of is that when executed in the interactive interpreter, I made noticeable time gaps between two executions while the script didn’t have such gaps. Are there any nuances related to this?
I you could answer I’d really appreciate the help!