I am using a Faster R-CNN model with Resnet-50 and FPN backbone. These are the parameters relevant to this question:
- Mini-batch size = 6 RGB images
- Number of GPUs = 1
- Input mini-batch tensor size = torch.Size([6, 3, 768, 1184])
According to the Resnet-50 architecture, the first convolution layer is defined as follows: .conv1 = Conv2d( 3, 64, kernel_size=7, stride=2, padding=3, bias=False, norm=get_norm(norm, out_channels), )
After the model runs this first convolution, conv1 tensor size is torch.Size([6, 64, 384, 592]).
I have four questions:
-
As per conv1 tensor size, do we have 64 feature maps for each image or are they averaged and shared over the 6 images?
-
How does the convolution process takes place between the input and conv1, given that we have 6 images?
-
Do each of the 64 filters apply over RGB channels of an image? Or do they apply directly over the colored pixel values without splitting into respective channels?
-
I understand that at the end of a forward pass, the loss and gradients are averaged by mini_batch_size. Are the weights updated for each image iteratively?