I believe that conv2D is implemented with some cudnn tricks. But how does it scale so well with batches?
When I run a batch size of 1:
bs = 1
h = 128
w = 256
in_channels = 512
out_channels = 1024
kernel_size = (3,3)
stride = (1,1)
padding = (1,1)
dilation = (1,1)
cuda = True
test = Test(
in_channels=in_channels,
out_channels=out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
dilation=dilation
)
x = torch.ones(bs, in_channels, h, w).float()
weights = torch.ones(out_channels, in_channels, kernel_size[0], kernel_size[1])
bias = 100*torch.ones(out_channels)
print("Input:")
print(x.shape)
if cuda:
test = test.cuda()
x = x.cuda()
weights = weights.cuda()
bias = bias.cuda()
s = time.time()
pytorch = F.conv2d(x,
weights,
bias,
stride=stride,
padding=padding,
dilation=dilation)
pytorch_time = time.time() - s
print('PyTorch:')
print(pytorch.shape)
print('PyTorch Time: ', pytorch_time)
I get:
>> Input:
>> torch.Size([1, 512, 128, 256])
>> PyTorch:
>> torch.Size([1, 1024, 128, 256])
>> PyTorch Time: 0.004613637924194336
Running the same code above with a batch size of 5 gives:
>> Input:
>> torch.Size([5, 512, 128, 256])
>> PyTorch:
>> torch.Size([5, 1024, 128, 256])
>> PyTorch Time: 0.0049724578857421875
The wall-clock time is negligibly different. Looking at the code in the THCUNN library (granted, it’s not cudnn), it appears that the batches are processed in a for-loop (see here). This is the same way Caffe handles it too.
I can’t find any code to suggest that multiple batches are processed (im2col+gemm) together. Everything I find seems to suggest each batch is handled sequentially. Given this, how is there no linear scaling to the running time?