Pooling vs. downsampling in autoencoder and how to upsample

Hi everyone,

I am building a simple 1-D autoencoder with fully connected networks. Can someone explain to me the pros and cons of (A) using the fully-connected layers themselves to downsample (i.e., set the inputs to 512 and the outputs to 256) versus (B) having the fully connected layer stay the same size (i.e., 512 to 512) and then using a pooling layer to downsample? I feel like choice A would be better and more general because it wouldn’t have the oddities of the pooling layers. So I am not sure when/why the pooling layers are used.

Also, if I do use a pooling operator, I need one that doesn’t store the indices because I plan to use the encoder/decoder separately. So should I use a nn.AvgPool1d? If, so how do I upsample in a simple way?

Thanks in advance!

It’s a matter of learning a downsampling method versus using an existing method (average pooling). It’s hard to say which is better (thus why you might see a bit of both in literature).

I think using average pooling makes a lot of sense in, say, image domains. This allows the model to capture fine details in the early layers and coarse details in the average downsampled features.

I think using a learned downsampling would makes sense when my data doesn’t have spatial connections. That is, values “next” to eachother aren’t related. Average pooling sort of assumes that they do, thus I would probably go with a learned downsampling.

For upsampling, it’s essentially the same scenario, a learned upsampling or an existing upsampling technique. You can use torch.nn.functional.interpolate to upsample something using existing techniques. You can either use a linear layer or a transposed convolution to get a learned upsampling.

Hopes this helps

Thank you for your reply. Let me think about this and I’ll post again if I have any questions.