Minibatch implementation in pytorch for object detection

I am using a Faster R-CNN model with Resnet-50 and FPN backbone. These are the parameters relevant to this question:

  1. Mini-batch size = 6 RGB images
  2. Number of GPUs = 1
  3. Input mini-batch tensor size = torch.Size([6, 3, 768, 1184])

According to the Resnet-50 architecture, the first convolution layer is defined as follows: .conv1 = Conv2d( 3, 64, kernel_size=7, stride=2, padding=3, bias=False, norm=get_norm(norm, out_channels), )

After the model runs this first convolution, conv1 tensor size is torch.Size([6, 64, 384, 592]).

I have four questions:

  1. As per conv1 tensor size, do we have 64 feature maps for each image or are they averaged and shared over the 6 images?

  2. How does the convolution process takes place between the input and conv1, given that we have 6 images?

  3. Do each of the 64 filters apply over RGB channels of an image? Or do they apply directly over the colored pixel values without splitting into respective channels?

  4. I understand that at the end of a forward pass, the loss and gradients are averaged by mini_batch_size. Are the weights updated for each image iteratively?

  1. For each image 64 output channels will be created. The activation output shape of the conv layer will be [batch_size, out_channels, height, width].

  2. The majority of PyTorch operations work on batched data, which will apply the desired operation on each sample in a vectorized way.

  3. If you are using the default setup with groups=1, each filter kernel will use all input channels. Have a look at CS231n - Convolutions for a good explanation. I’m not sure, what the norm argument does and it seems to be a custom implementation.

  4. The backward() call only calculates the gradients. The parameter updates will be performed by optimizer.step(), which uses the calculated .grad attributes in all parameters to update them.