Modify resnet network to accept input from encoder

I would like to ask if it is possible to do something I have in my mind. I have cifar10 dataset, say (x:images,y:labels). I use an encoder for x, which last layer is a Conv2d(1024, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False) and batchnorm. Can I use the new x from the encoder and original y, to run for example a resnet50 model for classification. How should I modify the first layers of resnet ? I would be very grateful, is someone explain me the process or even how to search for this, for example if there is a keyword that I should look for.
Thanks !

Since your encoder output will have 1024 channels, you would have to make sure that the first conv layer of resnet accepts this number of input channels.
By default the first layer will accept 3 input channeld (RGB image tensor), and you could replace it via:

model = models.resnet50()
conv1 = model.conv1
model.conv1 = nn.Conv2d(
    1024, conv1.out_channels, conv1.kernel_size, conv1.stride, conv1.padding,
    conv.dilation, conv1.groups, conv1.bias)

I’m not sure how large the spatial size of your encoder output is, but rensets were trained on inputs with a resolution of 224x224.
That being said, since an adaptive pooling layer is used before the linear layer, your are flexible regarding the spatial size as long as it’s not too small and is too small for some layers.

Thanks for your answer!
Can you explain me the “the spatial size” of the encoder ? As I can understand you are trying to tell me that resnet are trained for specific input size 224x224, so in order to use a resnet I must have same input size. But as I can understand when we pass an image from an encoder, we do not have any more pixels. For example I have:
x.shape = ([128,3,32,32]), where 128 is batch size, 3 channels and 32x32 pixels each image
When I call rkhs_1 = encode(x) , then rkhs_1.shape = ([128,1024,1,1]) and I want to use this as input to resnet to train a classifier for cifar10.
If I had pixels as input what I am doing for resnet is to change the first layer:

model.conv1 = nn.Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
model.maxpool = nn.Identity()

but now I don’t have pixels, I have only a 1024 vector for each image

An input of [batch_size, 1024, 1, 1] won’t work for a standard resnet, as the conv layers use kernels with more than a single pixel and you are also using pooling layers, which won’t be able to decrease the spatial dimensions.

You could try to

  • reshape the encoder output, and create “fake” spatial dims (not sure, how well this would work),
  • expand the encoder output (and basically repeat the pixels),
  • use another model with e.g. linear layers
  • add transposed convolutions after the encoder to upsample the activation and pass it then to the resnet

I’m sure there are more options. :wink: