Why do I have the error: Given groups=1, weight of size [8, 1024, 1, 1], expected input[8, 304, 9, 40] to have 1024 channels, but got 304 channels

I am working with yolostereo3d for stereo3d object detection (solely stereo camera, no velydone)on kitti dataset with edgeNext as the backbone instead of resNet.

Before changing the backbone from resNet to edgeNext with the same kitti dataset, everything was working fine. However, I started having the below error afterwards:

RuntimeError: Given groups=1, weight of size [8, 1024, 1, 1], expected input[8, 304, 9, 40] to have 1024 channels, but got 304 channels instead

Here is how I changed the backbone:

class YoloStereo3DCore(nn.Module):
    """
        Inference Structure of YoloStereo3D
        Similar to YoloMono3D,
        Left and Right image are fed into the backbone in batch. So they will affect each other with BatchNorm2d.
    """
    def __init__(self, backbone_arguments):
        f = open("/home/zakaseb/Thesis/YoloStereo3D/Stereo3D/Sequence.txt", "a")
        f.write("yolosterero3dCore_init \n")
        f.close()
        super(YoloStereo3DCore, self).__init__()
        self.backbone =edgenext_small(**backbone_arguments) # Resnet, change backbone from here

        base_features = 256 #if backbone_arguments['depth'] > 34 else 64 # meaning which depth of resnet
        self.neck = StereoMerging(base_features) #stereomerging outputs features and depth output.

Here is the edgenext_small()

@BACKBONE_DICT.register_module
def edgenext_small(pretrained=False, **kwargs):
    FPS @ BS=1: 93.84 & @ BS=256: 1785.92 for MobileViT_S
    model = EdgeNeXt(depths=[3, 3, 9, 3], dims=[48, 96, 160, 304], expan_ratio=4,
                     global_block=[0, 1, 1, 1],
                     global_block_type=['None', 'SDTA', 'SDTA', 'SDTA'],
                     use_pos_embd_xca=[False, True, False, False],
                     kernel_sizes=[3, 5, 7, 9],
                     d2_scales=[2, 2, 3, 4],
                     classifier_dropout=0.0)

    return model

It looks like you are specifying some kind of feature dimension for the downsampling layers of EdgeNeXt in the dims parameter, and this creates a mismatch if the layers downstream of the backbone expect a different feature dimension (e.g., 1024). Perhaps you could simply change the value of 304 to 1024 here, though I’m not sure if that has other implications for the efficiency of the model architecture.

Thanks. I’ve done that and I’ve found myself in another masking-related error. So I backtracked thinking those errors were related. But I doubt they are. Thanks anyway!