RuntimeError: Given groups=1, weight of size [36, 36, 3, 3], expected input[16, 34, 25, 768] to have 36 channels, but got 34 channels instead

I have seen a number of posts with the above error but I could not transfer those problems to mine.

I have been trying to follow the Object detection tutorial at the following URL : Object Detection Tutorial Pytorch. The mobilenet V2 model as shown in the tutorial runs properly. I have tried running it for a couple of epochs . Now , I wanted to try a different backbone. I set up the SWIN_T model as follows :

 swin_backbone = torchvision.models.swin_t().features
 swin_backbone.out_channels = 36

anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                       aspect_ratios=((0.5, 1.0, 2.0),))
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'],
                                                    output_size=7,
                                                    sampling_ratio=2)

model = FasterRCNN(swin_backbone,
                       num_classes=num_classes,
                       rpn_anchor_generator=anchor_generator,
                       box_roi_pool=roi_pooler

However when I run the the training loop as mentioned in the tutorial for some of the images I get the above mentioned error. I am not really sure how to by pass this or get a solution for the same. I am not really if the problem actually lies in my dataset images or the model itself. How to set the values of out channels in the backbone? Can someone help me with the same ?

Iā€™m not entirely familiar with what modifications are needed to use Swin Transformer with the Faster-RCNN approach, but a simple test e.g.,

>>> swin_backbone(torch.randn(1, 3, 224, 224)).shape
torch.Size([1, 7, 7, 768])

shows that the feature dimension in this case is actually 768, and the last dimension of the shape rather than the first. I suspect that the 34, 25 produced by the backbone are actually spatial dimensions rather than the feature or channel dimensions.

At a minimum, you would need to add a permute to the backbone to change the ordering of the dimensions so that they appear in the order expected by FasterRCNN and change the swin_backbone.out_channels = 36 to 768.

Hello @eqy , thank you for your answer. I get your idea and it might just work. However, for the permutation part , I am not really sure as to where I can permute the output of the swin_transformer backbone before feeding it to the faster RCNN . Could you possibly point that out for me?

You could take a look at either creating a new model that combines the Swin backbone + permute step as a new Sequential module, or by modifying the model definition (given the torchvision example here torchvision.models.swin_transformer ā€” Torchvision main documentation) to add a permute operation before returning the features.

Hey @eqy ,I have been trying the approach to permute the output of the SWIN backbone but it seems to be not so easy with a lot of changes required (at least my first tries suggest so). Would there be an easier way to achieve the same. Say we can decouple the code ā€¦ take the output from SWIN and pass the features to FastRCNN ā€¦ is something like that possible? Could you possibly direct me to some examples/tutorials?

Could you check if something like the following works as a starting point?

import torch
import torch.nn as nn
import torchvision
from torchvision.models.detection.anchor_utils import AnchorGenerator
from torchvision.models.detection.faster_rcnn import FasterRCNN

class MySwin(nn.Module):
  def __init__(self):
    super().__init__()
    self.backbone = torchvision.models.swin_t().features
    self.out_channels = 768

  def forward(self, x):
    return torch.permute(self.backbone(x), (0, 3, 1, 2))

anchor_generator = AnchorGenerator(sizes=((32, 64, 128, 256, 512),),
                                       aspect_ratios=((0.5, 1.0, 2.0),))
roi_pooler = torchvision.ops.MultiScaleRoIAlign(featmap_names=['0'],
                                                    output_size=7,
                                                    sampling_ratio=2)

swin_backbone = MySwin()
model = FasterRCNN(swin_backbone,
                       num_classes=1000, # made up a number here
                       rpn_anchor_generator=anchor_generator,
                       box_roi_pool=roi_pooler)

# test "inference" loop
for i in range(10):
  data = torch.randn(2, 3, 112 + i * 2, 112)
  model.eval()
  out = model(data)
  print("finished", i, out)
1 Like

@eqy I am currently trying to run a training epoch and I am half way through it withoug any problems. Your way of defining the model seems to be working. Thanks a lot.

1 Like

Glad to hear it. Iā€™m curious if the model reaches the accuracy you expect and if there are any additional tweaks needed (e.g., a different optimizer) compared to the more typical convolutional backbones.

@eqy Once I am done experimenting , I would be happy to share the same with you!