How correctly set up Conv/Deconv model to match dimensions?

vdw · March 29, 2019, 6:39am

It’s the first time I’m trying my hands at Convolutional Autoencoder network model. In more detail, I would like to replicate the model proposed in the paper “Deconvolutional Paragraph Representation Learning”. While this is about text, my current issues are with the Conv/Deconv setup in general. I principle, I have the model ready and it seems to be training. However, I’ve noticed some kind of stumbling block, I’m not sure how to handle best.

To be as close as possible to the paper, all the Conv layers use stride=2 as a replacement for pooling/unpooling to improve training time (I’m not arguing of this as a valid alternative here; just sticking to the paper). However, with stride=2 and depending on the kernel sizes and the length of the input sentences seq_len, the ConvTranspose layers – initialized matching the parameters as the Conv layers – might not yield the matching output sizes L_out. Given the formula for L_out for Conv includes as floor it’s easy to see why this happens.

Currently I handle this by setting output_padding in the ConvTranspose layers to 0 or 1 to make up for the “mistakes”. While this works, I have to manually make this adjustment every time I change the kernel sizes or seq_len. I guess I could calculate the values for output_padding automatically.

Is there any more straightforward method to somehow ensure that the output size match up or is this simply always up to me to manually get all the numbers right?

If my problem is not quite clear, here’s a minimal example:

import torch
import torch.nn as nn

conv = nn.Conv1d(in_channels=1, out_channels=5, kernel_size=5, stride=2)
deconv = nn.ConvTranspose1d(in_channels=5, out_channels=1, kernel_size=5, stride=2)

inputs = [[[1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9,0]]]
inputs = torch.tensor(inputs, dtype=torch.float32)

encoded = conv(inputs)
decoded = deconv(encoded)

print("encoded.shape =", encoded.shape)
print("decoded.shape =", decoded.shape)
print("target.shape  =", inputs.shape)

=> encoded.shape = torch.Size([1, 5, 13])
=> decoded.shape = torch.Size([1, 1, 29])
=> target.shape  = torch.Size([1, 1, 30])

Here, L_out of deconf is 29 and does not match with the target size of 30. In case I want to keep kernel_size=5 as in the paper, I have two alternatives to fix that

add output_padding=1 to ConvTranspose1d, the output and target size are both 30
increase the inputs size by 1, so both output and target size are 31