Minibatch implementation in pytorch for object detection

kbtorcher · April 18, 2020, 12:01pm

I am using a Faster R-CNN model with Resnet-50 and FPN backbone. These are the parameters relevant to this question:

Mini-batch size = 6 RGB images
Number of GPUs = 1
Input mini-batch tensor size = torch.Size([6, 3, 768, 1184])

According to the Resnet-50 architecture, the first convolution layer is defined as follows: .conv1 = Conv2d( 3, 64, kernel_size=7, stride=2, padding=3, bias=False, norm=get_norm(norm, out_channels), )

After the model runs this first convolution, conv1 tensor size is torch.Size([6, 64, 384, 592]).

I have four questions:

As per conv1 tensor size, do we have 64 feature maps for each image or are they averaged and shared over the 6 images?
How does the convolution process takes place between the input and conv1, given that we have 6 images?
Do each of the 64 filters apply over RGB channels of an image? Or do they apply directly over the colored pixel values without splitting into respective channels?
I understand that at the end of a forward pass, the loss and gradients are averaged by mini_batch_size. Are the weights updated for each image iteratively?

ptrblck · April 19, 2020, 12:11am

For each image 64 output channels will be created. The activation output shape of the conv layer will be [batch_size, out_channels, height, width].
The majority of PyTorch operations work on batched data, which will apply the desired operation on each sample in a vectorized way.
If you are using the default setup with groups=1, each filter kernel will use all input channels. Have a look at CS231n - Convolutions for a good explanation. I’m not sure, what the norm argument does and it seems to be a custom implementation.
The backward() call only calculates the gradients. The parameter updates will be performed by optimizer.step(), which uses the calculated .grad attributes in all parameters to update them.