Why is deeper network not training as well?

Hi everyone,

I’m new to deep learning and started by implementing an autoencoder for time-series data, which seemed simple enough, or so I thought. However, the model performance gets worse (even on training data) as I make the model deeper, which doesn’t make any sense to me. Here’s my first autoencoder (model 1), implemented in PyTorch:

# model 1
class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()
        # encoder
        self.enc1 = nn.Linear(in_features=512, out_features=256)
        self.enc2 = nn.Linear(in_features=256, out_features=128)
        self.enc3 = nn.Linear(in_features=128, out_features=64)
 
        # decoder 
        self.dec1 = nn.Linear(in_features=64, out_features=128)
        self.dec2 = nn.Linear(in_features=128, out_features=256)
        self.dec3 = nn.Linear(in_features=256, out_features=512)

    def forward(self, x):
        # encoder
        x = F.relu(self.enc1(x))
        x = F.relu(self.enc2(x))
        x = F.relu(self.enc3(x))
        
        # decoder
        x = F.relu(self.dec1(x))
        x = F.relu(self.dec2(x))
        x = self.dec3(x) # no RELU on the last one
        return x

Pretty straightforward… Just a vanilla autoencoder using fully-connected layers. I am able to train this model with a training set of over 200k examples using MSELoss() and Adam optimizer (LR = 1e-3). But the loss (even for the training set) doesn’t go down as low as I want it to go, so I made the model just one layer deeper at each stage (model 2) to see if it would train better:

# model 2
class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()
        # encoder
        self.enc0 = nn.Linear(in_features=512, out_features=512)
        self.enc1 = nn.Linear(in_features=512, out_features=256)
        self.enc2 = nn.Linear(in_features=256, out_features=256)
        self.enc3 = nn.Linear(in_features=256, out_features=128)
        self.enc4 = nn.Linear(in_features=128, out_features=128)
        self.enc5 = nn.Linear(in_features=128, out_features=64)
        self.enc6 = nn.Linear(in_features=64, out_features=64)

        # decoder 
        self.dec0 = nn.Linear(in_features=64, out_features=64)
        self.dec1 = nn.Linear(in_features=64, out_features=128)
        self.dec2 = nn.Linear(in_features=128, out_features=128)
        self.dec3 = nn.Linear(in_features=128, out_features=256)
        self.dec4 = nn.Linear(in_features=256, out_features=256)
        self.dec5 = nn.Linear(in_features=256, out_features=512)
        self.dec6 = nn.Linear(in_features=512, out_features=512)

    def forward(self, x):
        # encoder
        x = F.relu(self.enc0(x))
        x = F.relu(self.enc1(x))
        x = F.relu(self.enc2(x))
        x = F.relu(self.enc3(x))
        x = F.relu(self.enc4(x))
        x = F.relu(self.enc5(x))
        x = F.relu(self.enc6(x))
        
        # decoder
        x = F.relu(self.dec0(x))
        x = F.relu(self.dec1(X))
        x = F.relu(self.dec2(x))
        x = F.relu(self.dec3(x))
        x = F.relu(self.dec4(x))
        x = F.relu(self.dec5(x))
        x = self.dec6(x) # no RELU on the last one

        return x

As you can see, all I am doing is adding an extra fully-connected layer to each stage of the auto-encoder. This is a very simple change that I thought would improve performance. After all, the whole point of deeper networks is that they can learn useful features that encode the signal better. So I was very surprised when I trained this and it ended up working much worse:

Note that this convergence plot shows the loss on the TRAINING SET, so we’re not even talking test set here so overfitting is not the issue. I tried increasing the learning rate and also letting it run longer, but it doesn’t seem to get out of this local minimum. When I repeat the experiment, it keeps converging to that minimum, almost as if this is the best it can do. This is very surprising because I thought in the very least model 2 would be NO WORSE than model 1, since the additional layers could simply learn an identity function, so the performance of model 2 should be better than (or at least equal to) that of model 1.

So I’m almost embarrassed to ask: what am I doing wrong here? I’ve been reading about the problems with deep networks and am not sure if those problems apply here. For example, vanishing gradients can be a problem when the networks get too deep, but the fact that I am using ReLUs and that my network is not THAT deep somehow ameliorates that issue, I would think.

Any suggestions? Sorry about the total n00b question, but I feel that if I can’t even get a simple autoencoder working I’m in trouble. Thanks in advance for your help!

1 Like

Hi,
I strongly suggest you to use normalization. The layers don’t have normalized data being fed to them and this as such has slowed down their training–try using BatchNorm1d(channels) in between the linear layers.

Thank you for suggestion! I did try BatchNorm after every layer and while it got slightly better it still doesn’t match the results of the shallower network. Here are the new convergence curves:

You can see that the new BatchNorm curve (model 3) is better than model 2 but still not as good as the shallower model #1. I am not sure why! Here’s my code, just to confirm that I am doing the BatchNorm correctly:

# model 3 - BatchNorm
class Autoencoder(nn.Module):
    def __init__(self):
        super(Autoencoder, self).__init__()
        # encoder
        self.enc0 = nn.Linear(in_features=512, out_features=512)
        self.e0 = nn.BatchNorm1d(392, affine=True)
        self.enc1 = nn.Linear(in_features=512, out_features=256)
        self.e1 = nn.BatchNorm1d(256, affine=True)
        self.enc2 = nn.Linear(in_features=256, out_features=256)
        self.e2 = nn.BatchNorm1d(256, affine=True)
        self.enc3 = nn.Linear(in_features=256, out_features=128)
        self.e3 = nn.BatchNorm1d(128, affine=True)
        self.enc4 = nn.Linear(in_features=128, out_features=128)
        self.e4 = nn.BatchNorm1d(128, affine=True)
        self.enc5 = nn.Linear(in_features=128, out_features=64)
        self.e5 = nn.BatchNorm1d(64, affine=True)
        self.enc6 = nn.Linear(in_features=64, out_features=64)
        self.e6 = nn.BatchNorm1d(64, affine=True)
 
        # decoder 
        self.dec0 = nn.Linear(in_features=64, out_features=64)
        self.d0 = nn.BatchNorm1d(64, affine=True)
        self.dec1 = nn.Linear(in_features=64, out_features=128)
        self.d1 = nn.BatchNorm1d(128, affine=True)
        self.dec2 = nn.Linear(in_features=128, out_features=128)
        self.d2 = nn.BatchNorm1d(128, affine=True)
        self.dec3 = nn.Linear(in_features=128, out_features=256)
        self.d3 = nn.BatchNorm1d(256, affine=True)
        self.dec4 = nn.Linear(in_features=256, out_features=256)
        self.d4 = nn.BatchNorm1d(256, affine=True)
        self.dec5 = nn.Linear(in_features=256, out_features=512)
        self.d5 = nn.BatchNorm1d(392, affine=True)
        self.dec6 = nn.Linear(in_features=512, out_features=512)

    def forward(self, x):
        # encoder
        x = F.relu(self.enc0(x))
        x = self.e0(x)
        x = F.relu(self.enc1(x))
        x = self.e1(x)
        x = F.relu(self.enc2(x))
        x = self.e2(x)
        x = F.relu(self.enc3(x))
        x = self.e3(x)
        x = F.relu(self.enc4(x))
        x = self.e4(x)
        x = F.relu(self.enc5(x))
        x = self.e5(x)
        x = F.relu(self.enc6(x))
        x = self.e6(x)
        
        # decoder
        x = F.relu(self.dec0(x))
        x = self.d0(x)
        x = F.relu(self.dec1(x))
        x = self.d1(x)
        x = F.relu(self.dec2(x))
        x = self.d2(x)
        x = F.relu(self.dec3(x))
        x = self.d3(x)
        x = F.relu(self.dec4(x))
        x = self.d4(x)
        x = F.relu(self.dec5(x))
        x = self.d5(x)
        x = self.dec6(x) # no RELU on the last one

        return x

It’s very strange. I thought it would work better! What am I doing wrong?

Use maxpooling or add a skip connection. That should do the trick.

Hmm… So instead of using a fully-connected layer to go from, say, 256 to 128, I would do it using max-pooling? So keep two layers that go from 256 -> 256 and then maxpool down to 128? I will try that. Thanks for the tip!

Ok. So I started implementing maxpooling but ran into a problem on how to upsample back up in the decoder. I tried using unmaxpool1d() but then realized that I need to have the indices from the maxpooling steps in the encoder, which doesn’t work for my application.

You see, once the network is trained I want to use the front end encoder and decoder separately so that I can do things like explore other points in the latent space (kind of like a VAE), build another network that outputs new latent vectors that are meaningful (like a GAN), or build other backends that do things with the latent vector that has been “compressed” from an input somewhere else. So I can’t transfer indices between the encoder and decoder because I don’t plan to keep them together once the system is trained (skip connections like U-Net don’t work either, for this reason).

So I’m not sure what to do. Is there a way to do this WITHOUT needing additional information (other than the latent vector) to go from the encoder to the decoder? Thanks in advance for your help!

Don’t you think this is like going too far for linear layers? In case you think about continuing with linear layers, you can use residual connections as in DenseNets or ResNets – try connecting blocks near the output with deeper blocks. It helps with the vanishing gradient problem and also helps to increase the model complexity in a nice way.

PS–never think you are too n00b to ask a simple question in deep learning. Many people mindlessly add layers. This just shows you do not :slight_smile:

Thanks for the words of encouragement! :slightly_smiling_face: I will experiment a bit with residual connections, etc.