Convolutional autoencoder, how to precisely decode (ConvTranspose2d)

I’m trying to code a simple convolution autoencoder for the digit MNIST dataset. My plan is to use it as a denoising autoencoder.

I’m trying to replicate an architecture proposed in a paper. The network architecture looks like this:

Network Layer Activation
Encoder Convolution Relu
Encoder Max Pooling -
Encoder Convolution Relu
Encoder Max Pooling -
---- ---- ----
Decoder Convolution Relu
Decoder Upsampling -
Decoder Convolution Relu
Decoder Upsampling -
Decoder Convolution Sigmoid

Here is the code I have so far, I never performed “deconvolution” so I’m a bit lost.

# Conv network
self.convEncoder = nn.Sequential(
	# Output size of each convolutional layer = [(in_channel + 2 * padding - kernel_size) / stride] + 1
	# In this case output = [(28 + 2 * 1 - 5) / 1] + 1 = 26
	nn.Conv2d(in_channels=1, out_channels=10, kernel_size=5, padding=1, stride=1),
	nn.ReLU(),
	nn.MaxPool2d(kernel_size=2),    # End up with 12 channels of size 13 x 13

	# In this case output = [(13 + 2 * 1 - 5) / 1] + 1 = 11
	nn.Conv2d(in_channels=10, out_channels=24, kernel_size=5, padding=1, stride=1),
	nn.ReLU(),
	nn.MaxPool2d(kernel_size=2),  # End up with 24 channels of size 5 x 5
)

# Not sure what to do here
self.convDecoder = nn.Sequential(
	# ???? 
	nn.ConvTranspose2d(??, ??, ??),	# 24, 10, 5 ???
	nn.ReLU(),
	nn.Upsample(scale_factor=??)	# ???

	nn.ConvTranspose2d(??, ??, ??),	# 10, 1, 5 ???
	nn.ReLU(),
	nn.Upsample(scale_factor=??)	# ???

	nn.ConvTranspose2d(??, ??, ??),	# ???
	nn.Sigmoid()
)

@ptrblck did you ever face this situation?

I haven’t written an autoencoder using your structure and assume you are wondering which setup to use in the transposed convolutions?
If so, you could start by “inverting” the encoder path and use the inverse channel dimensions. The kernel size, stride etc. should most likely be set in a way to reproduce the input spatial size.
If you don’t want to calculate it manually, add Print layers to the model to check the output activation shape and adapt the setup:

class Print(nn.Module):
    def __init__(self):
        super(Print, self).__init__()
        
    def forward(self, x):
        print(x.shape)
        return x

@ptrblck ty for your reply.

In the paper I’m reading, they show the following architecture
AcroRd32_joTWxb6NBl

Do you think by upsampling they are saying that they’re actually using an upsampling function such as nn.Upsample or they are just using ConvTranspose2d and playing with the stride?

I mean, we can achieve the same downsampling without Max Pooling by playing with stride too right? But if they mention it on the architecture, it means they’re applying it right? What would you say given your experience?

I guess the best would be to contact the author to get more informations.

I would guess that UpSampling would refer to an nn.Upsample layer, but also a transposed conv could be used. However, I think the former might be more likely.
Do you see any mentions on the number of parameters? If the authors claim the UpSampling layers don’t use any trainable parameters, it should be an interpolation layer (i.e. nn.Upsample), on the other hand if these layers use parameters it would point towards a transposed conv.