I believe that conv2D is implemented with some cudnn tricks. But how does it scale so well with batches?
When I run a batch size of 1:
bs = 1 h = 128 w = 256 in_channels = 512 out_channels = 1024 kernel_size = (3,3) stride = (1,1) padding = (1,1) dilation = (1,1) cuda = True test = Test( in_channels=in_channels, out_channels=out_channels, kernel_size=kernel_size, stride=stride, padding=padding, dilation=dilation ) x = torch.ones(bs, in_channels, h, w).float() weights = torch.ones(out_channels, in_channels, kernel_size, kernel_size) bias = 100*torch.ones(out_channels) print("Input:") print(x.shape) if cuda: test = test.cuda() x = x.cuda() weights = weights.cuda() bias = bias.cuda() s = time.time() pytorch = F.conv2d(x, weights, bias, stride=stride, padding=padding, dilation=dilation) pytorch_time = time.time() - s print('PyTorch:') print(pytorch.shape) print('PyTorch Time: ', pytorch_time)
>> Input: >> torch.Size([1, 512, 128, 256]) >> PyTorch: >> torch.Size([1, 1024, 128, 256]) >> PyTorch Time: 0.004613637924194336
Running the same code above with a batch size of 5 gives:
>> Input: >> torch.Size([5, 512, 128, 256]) >> PyTorch: >> torch.Size([5, 1024, 128, 256]) >> PyTorch Time: 0.0049724578857421875
The wall-clock time is negligibly different. Looking at the code in the THCUNN library (granted, it’s not cudnn), it appears that the batches are processed in a for-loop (see here). This is the same way Caffe handles it too.
I can’t find any code to suggest that multiple batches are processed (im2col+gemm) together. Everything I find seems to suggest each batch is handled sequentially. Given this, how is there no linear scaling to the running time?