It’s the first time I’m trying my hands at Convolutional Autoencoder network model. In more detail, I would like to replicate the model proposed in the paper “Deconvolutional Paragraph Representation Learning”. While this is about text, my current issues are with the Conv/Deconv setup in general. I principle, I have the model ready and it seems to be training. However, I’ve noticed some kind of stumbling block, I’m not sure how to handle best.
To be as close as possible to the paper, all the Conv
layers use stride=2
as a replacement for pooling/unpooling to improve training time (I’m not arguing of this as a valid alternative here; just sticking to the paper). However, with stride=2
and depending on the kernel sizes and the length of the input sentences seq_len
, the ConvTranspose
layers – initialized matching the parameters as the Conv
layers – might not yield the matching output sizes L_out
. Given the formula for L_out
for Conv
includes as floor
it’s easy to see why this happens.
Currently I handle this by setting output_padding
in the ConvTranspose
layers to 0 or 1 to make up for the “mistakes”. While this works, I have to manually make this adjustment every time I change the kernel sizes or seq_len
. I guess I could calculate the values for output_padding
automatically.
Is there any more straightforward method to somehow ensure that the output size match up or is this simply always up to me to manually get all the numbers right?
If my problem is not quite clear, here’s a minimal example:
import torch
import torch.nn as nn
conv = nn.Conv1d(in_channels=1, out_channels=5, kernel_size=5, stride=2)
deconv = nn.ConvTranspose1d(in_channels=5, out_channels=1, kernel_size=5, stride=2)
inputs = [[[1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0]]]
inputs = torch.tensor(inputs, dtype=torch.float32)
encoded = conv(inputs)
decoded = deconv(encoded)
print("encoded.shape =", encoded.shape)
print("decoded.shape =", decoded.shape)
print("target.shape =", inputs.shape)
=> encoded.shape = torch.Size([1, 5, 13])
=> decoded.shape = torch.Size([1, 1, 29])
=> target.shape = torch.Size([1, 1, 30])
Here, L_out
of deconf
is 29 and does not match with the target size of 30. In case I want to keep kernel_size=5
as in the paper, I have two alternatives to fix that
- add
output_padding=1
toConvTranspose1d
, the output and target size are both 30 - increase the inputs size by 1, so both output and target size are 31