Hi all,
I’m experiencing a huge difference in execution runtime between a convolution run as nn.Conv2d and nn.Sequential() on aarch64 (Nvidia Jetson AGX).
I think this issue is showing up due to the lack of any optimized kernels for the specific convolution layers I’m trying. This issue is only reprodcible on aarch64 processor. But still that couldn’t explain this huge difference in execution times.
To reproduce this issue,
Run a Depthwise Convolution (groups = in_channels) and Pointwise Convolution (kernel size =1) individually as nn.Conv2d() and measure their runtime.
Run the same set of convolutions as a single nn.Sequential
import torch
import time
x = torch.rand([1,78, 56, 56])
#### nn.Conv2d
dconv = torch.nn.Conv2d(78, 78, 3, stride=2, padding=1,groups=78)
pconv = torch.nn.Conv2d(78, 78, 1)
delay = []
for i in range(30):
start = time.perf_counter()
y = dconv(x)
end = time.perf_counter()
delay.append((end-start)*1000)
print(sum(delay)/len(delay))
delay = []
for i in range(30):
start = time.perf_counter()
z = pconv(y)
end = time.perf_counter()
delay.append((end-start)*1000)
print(sum(delay)/len(delay))
##### nn.Module
model = torch.nn.Sequential(torch.nn.Conv2d(78, 78, 3, stride=2, padding=1,groups=78, device='cpu'), torch.nn.Conv2d(78, 78, 1, device='cpu'))
delay = []
for i in range(30):
start = time.perf_counter()
z = model(x)
end = time.perf_counter()
delay.append((end-start)*1000)
print(sum(delay)/len(delay))
There’s almost 15x difference in runtime between convolution executing as individual nn.Conv2Ds and nn.Sequential. The expected output would be to have time(dconv) + time(pconv) almost similar to time(model) in the above example.
Any idea why this is happening? Thanks in advance for your replies.
I have also opened a issue regarding the same at here
Your output seems like its more expected. I’m also running on Xavier. I tried both PyTorch 1.9.0 and 1.8.0 and seeing the same results. (The nn.Sequential() is alway significantly higher)
This is the output I get for conseuctive runs of the program script above.
I also profiled the code (Conv2ds & nn.Sequential using Profiler) and got the following outputs for three Convs.
with torch.profiler.profile(activities=[torch.profiler.ProfilerActivity.CPU]) as prof:
model(x)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=30))
Hi !
Could you let me know the version of PyTorch you tried on?
Is there any other way to narrow down this issue to a bug in PyTorch or some performance drop on my Xavier board.