Convolutional Autoencoder

Hi, im trying to train a convolutional autoencoder over a dataset composed by 20k samples. Each sample is an array of 65536 elements, each one is float value. i want to train the autoencoder to reduce the dimension of the dataset from 65536 → 1024 elements and than use the reduced dataset to train a DNN. The following is the model of the autoencoder

class Conv1dAE(nn.Module):
    def __init__(self):
        super(Conv1dAE, self).__init__()
        # Encoder
        self.encoder = nn.Sequential(
            # Conv1d Layer 1: Input Channels = 1, Output Channels = 16, Kernel Size = 4, Stride = 4
            nn.Conv1d(64, 16, kernel_size=2, stride=2),  
            # Output: [Batch, 16, 16384]
            # Conv1d Layer 2: Input Channels = 16, Output Channels = 32, Kernel Size = 4, Stride = 4
            nn.Conv1d(16, 32, kernel_size=2, stride=2), 
            # Output: [Batch, 32, 4096]
            # Conv1d Layer 3: Input Channels = 32, Output Channels = 64, Kernel Size = 2, Stride = 2
            nn.Conv1d(32, 64, kernel_size=2, stride=2), 
            # Output: [Batch, 64, 1024]
        # Decoder
        self.decoder = nn.Sequential(
            # ConvTranspose1d Layer 1: Input Channels = 64, Output Channels = 32, Kernel Size = 2, Stride = 2
            nn.ConvTranspose1d(64, 32, kernel_size=4, stride=4),  
            # Output: [Batch, 32, 4096]
            # ConvTranspose1d Layer 2: Input Channels = 32, Output Channels = 16, Kernel Size = 4, Stride = 4
            nn.ConvTranspose1d(32, 16, kernel_size=4, stride=4),  
            # Output: [Batch, 16, 16384]
            # ConvTranspose1d Layer 3: Input Channels = 16, Output Channels = 1, Kernel Size = 4, Stride = 4
            nn.ConvTranspose1d(16, 64, kernel_size=4, stride=4),  
            # Output: [Batch, 1, 65536]
    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

and the following is the code that i use to reduce the original dataset

encoder = Conv1dAE()
    encoder.load_state_dict(torch.load('./results/autoencoder_conv_maxpool_1024_2024_06_19_00_07_28/autoencoder_conv_512512_2024_06_19_00_07_28.pth', map_location=torch.device('cpu')))

    encoded_features_train = []
    with torch.no_grad():
        for batch in train_dataloader:
            inputs = batch[0]
            outputs = encoder.encoder(inputs)  # Applica solo l'encoder
    # Concatena le rappresentazioni a bassa dimensionalità in un unico tensore
    encoded_features_train =, dim=0)

    print(f"Encoded train features shape: {encoded_features_train.shape}")

Actually i’m not sure about the “” operation, that i use to obtain again a dataset in one-dimensional form, 'cause the output of the encoder from the autoencoder has more than one dimension.
In case it’s proven wrong, how can i add a FC linear layer between encoder and decoder in order to obain a one dimension output of 1024 elements?

Hi Michele!

But the code you posted below seems to expect 64 channels for each
of your 65536 elements.

First, by having your early layers be convolutions, you are saying that your
input data has some spatial (albeit one-dimensional) structure – perhaps
it’s a time series or audio sample. Is this true?

But then, do you want your reduced dataset to also have similar spatial
structure? When you say that the reduced dataset will be used to train
a DNN (dense neural network), I imagine the input to your DNN to be a
vector of 1024 “features” that don’t really have any particular spatial
structure (and that the first layer of your DNN is a fully-connected Linear).

So which is it? Does the first layer of your autoencoder (and your input data)
have one channel or 64?

If you want the output of the encoder section of your autoencoder to
consist of 1024 features (without spatial structure), keep max-pooling
(and convolving) down to a spatial extent of 1, but with 1024 channels.
(You can then squeeze() away the singlenton spatial dimension to get
input of shape [Batch, 1024] for your DNN.)

(On the other hand, if you want the output of your encoder to have spatial
structure, then max-pool down to a spatial dimension of 1024, as you are
doing, but have your Conv1ds reduce their out_channels to a final value
of one.)


K. Frank

Hi, the dataset is extracted from bytes frequencies of different gile format. In this particular case from two format. So it is a “linear” dataset. Is it wrong to use 64 input channel in the first conv layer? I tried to use a single channel, by pytorch expected 64 instead (maybe due to batch_size i guess). I just want to achieve a sort of smart dimensionality reduction from 65536 element to 1024, no more. Could you show me a snippet with the last conv layer of the encoder?

Hi Michele!

When you say “linear” dataset, do you mean the following?

Consider a data sample that is a single vector, v, of length 65536, so of
shape [65536] (no batch nor channel dimension).

Do you mean that some single element, say v[100], is more closely related
to v[101] (or v[95] or v[110]) than it is to, say, v[10000], because it is
closer to v[101] in “index space?” (Or are the indices of the elements of
v essentially arbitrary, so that v[100] is no more closely related to v[101]
than it is to v[10000]?)

If your input data doesn’t have 64 channels, then your first Conv1d layer
should not have 64 channels.

Having a batch_size, but no explicit channels dimension could be your

A Conv1d expects either a two-dimensional input ([nChannels, length])
or a three-dimensional input ([nBatch, nChannels, length]), but the
input has to have an nChannels dimension, even if the number of channels
is one. If you pass in input of shape [nBatch = 64, length = 65536],
the Conv1d will treat your batch dimension as a channels dimension (and
will raise an error unless the Conv1d has in_channels = 64).

To be clear, in such a case you should be passing in a three-dimensional
input of shape [nBatch = 64, nChannels = 1, length = 65536],
with an explicit channels dimension (of size one). Note, you can use
unsqueeze (1) to add such a singleton channels dimension if your
input doesn’t already have a channels dimension.

To repeat the question I asked in my previous post, do you want the
reduced-dimensionality output of your encoder (of length 1024) to be
“linear” in the sense discussed above? If you intend to pass that output
into a fully-connected Linear layer of a DNN, then any “linear” structure
it might have would be irrelevant because the ordering of the columns of
the weight matrix of a Linear is essentially arbitrary.

That is, should the length-1024 output of the encoder reflect the “linear”
structure of its length-65536 input, but just “reduced in dimensionality”
by a factor of 64? Or should it just consist of 1024 “non-spatial” features
that have no particular “linear” structure?


K. Frank

Ok let me explain. Each array of 65536 elements contains the bigram of a chunk of a file. 65536 came from all the possible pair of bytes, from 00-00 to FF-FF. I compute the frequency of esch pair and then each pair is converted from hex to dec and this value represents the index in which i’ll store the frequency in the array. Given that, there is no relation between each value of the array. The encode should output a one dimension array of 1024 elements. Thats it

Hi Michele!

Okay, in that case you do not want to use convolution layers – that’s
not how convolutional layers work.

I assume that your goal is to train your encoder somehow to get the
length-1024 output and that you’re using an autoencoder so that you
can train the encoder by using some sort of unsupervised training to
train the whole autoencoder.

You can certainly work with a non-convolutional autoencoder.

Your encoder could look something like:

torch.nn.Sequential (
    torch.nn.Linear (65536, 32768),
    torch.nn.ReLU(),   # could also be something like Sigmoid or Tanh
    torch.nn.Linear (32768, 16384),
    torch.nn.Linear (4096, 2048),
    torch.nn.Linear (2048, 1024)

It will take in a single sample of shape [65536] and output a single result
of shape [1024] or take in a batch of shape [nBatch, 65536] and output
a batch of results of shape [nBatch, 1024].

Your decoder would basically be the same thing in reverse order with a
sequence of Linears that increase the number of features from 1024
to 65536. (There’s nothing magic about changing the number features
by a factor of two with each Linear – you just need to get from 65536
down to 1024 and then back up to 65536 with enough layers that you
can perform an adequate encoding-decoding computation.)


K. Frank

1 Like

Ok very clear! The only reason that move me from a linear autoencoder to a convolutional one is memory consumption. A linear autoencoder with such layers (actually double by encoder and decoder) took quite a lot of memory (gpu o ram) to train

Hi Michele!

Because the 65536 elements of an input data sample are independent
of one another, you pretty much need your first Linear layer to have
in_features = 65536 so that you can train an independent set of weights
for each independent input element.

But beyond that you have a lot of flexibility.

It would be logically reasonable to do something like:

torch.nn.Sequential (
    torch.nn.Linear (65536, 2048),
    torch.nn.Linear (2048, 2048),
    torch.nn.Linear (2048, 1024)

(or any such variations on this general theme).

By having your first Linear drop the number of features to a much
smaller number, you will, obviously, have a much smaller model.

I don’t have any intuition about the trade-off between the size of your
model and its trainability and how well your encoder will work, but I
don’t have any argument that such an approach couldn’t work well.
For example, for a similar total number of parameters, you could have
fewer larger intermediate layers or more smaller intermediate layers.
I don’t have an opinion about would be likely to work better for you.

(I view a lot of these kind of questions as the “black magic” of neural
networks – sometimes you just have try a handful of such variations
and see which trade-offs work best for your use case.)

Good luck!

K. Frank

Yeah i know. I just tried convolution and hoped it works, but this time is not the case. Thx!