Pytorch nn.Module: Surprising model improvement by changing init

Kath_Pra · June 22, 2022, 4:54pm

Hello everyone,
I am currently running experiments with a model, trying to improve its performance. I noticed that the model accuracy increases, when I add one new layer (convf) in the init function of my class. This is the case, even though I never use this layer in my forward function, so the data should never get in contact with it.

Can someone explain to me, why this is the case?

Many thanks in advance,
Katharina

My code:

class four_unit_tcn(nn.Module):

def __init__(self, in_channels, out_channels, kernel_size=5, stride=1):
    super(four_unit_tcn, self).__init__()
    pad = int((kernel_size - 1) / 2)
    self.conv = nn.Conv2d(in_channels, out_channels, kernel_size=(kernel_size, 1), padding=(pad, 
                       0), stride=(stride, 1))
    **self.convf = nn.Conv2d(in_channels, out_channels, kernel_size=1, padding=0, stride=(stride, 1))**
    self.bn = nn.BatchNorm2d(out_channels)
    self.bnf = nn.BatchNorm2d(out_channels)
    self.relu = nn.ReLU(inplace=True)
    conv_init(self.conv)
    bn_init(self.bn, 1)

def forward(self, x):
    fourier_input = torch.clone(x)
    fourier_2 = torch.fft.fft(fourier_input,dim=2)
    fourier_2 = fourier_2.abs() 
    fourier_2 = self.bnf(fourier_input)
    x = self.bn(self.conv(x))
    x = torch.cat([x,fourier_2], dim =2)
    return x

Andrei_Cristea · June 22, 2022, 5:02pm

Can you show some data on performance with and without self.convf in __init__? Maybe train 10 times with and without for a set amount of epochs and record the distribution of validation losses in both cases so we can compare.

It would be very surprising for this to make a difference, and I’m guessing what you saw was just coincidental due to the stochastic nature of (randomness inherent in) training.

Kath_Pra · June 22, 2022, 6:19pm

@Andrei_Cristea thank you for your quick response! It should not affect the model, I agree with you!

def conv_init(conv):

      if conv.weight is not None:
            nn.init.kaiming_normal_(conv.weight, mode='fan_out')
      if conv.bias is not None:
            nn.init.constant_(conv.bias, 0)

I am initializing the model weights to reduce to the amount of randomness during training. Below I posted the result that I repeatedly got from training, after 10 epochs. I am training for 65 epochs in total where the model incl. convf achieves 0.841 accuracy in comparison to 0.837 of the model w/o convf.

Without the convf layer the results of epoch 10 is this:
Training epoch: 10
Mean training loss: 0.8802. Mean training acc: 73.49%.
Time consumption: [Data]04%, [Network]95%
Eval epoch: 10
Mean test loss of 796 batches: 1.0756102995806602.
Top1: 68.88%
Top5: 91.26%

With the convf layer in init but not in forward the results of epoch 10 is this:
Training epoch: 10
Mean training loss: 0.8689. Mean training acc: 74.03%.
Time consumption: [Data]02%, [Network]96%
Eval epoch: 10
Mean test loss of 796 batches: 1.1364069590017425.
Top1: 67.30%
Top5: 91.09%

This is (part of) my environment:

Andrei_Cristea · June 22, 2022, 6:37pm

Could you repeat this experiment a number of times (in a loop, storing just the final losses and accuracies for comparison)? I don’t know your exact data but it seems plausible to me that there is some amount of randomness in the performance such that you will see sim-to-sim random variation in the losses and accuracies, on the order of this difference. In other words, can we rule out that this difference is just random, and not “statistically insignificant”?

Also, it looks to me like the training loss is better with the convf layer, but the test loss (and test accuracy) is worse:

training loss goes from 0.8802 to 0.8689, so it improves
test loss goes from 1.0756 to 1.1364, so it worsens

Kath_Pra · June 23, 2022, 8:34am

I repeated the experiments for multiple times and each model produced exactly the same results due to fixed seed during random initilization.

My thesis supervisor helped me to understand the matter: Even though the seed is fixed, the initalization of parameters differs, when you increase the number of parameters. This explains the difference in the training behaviour and final result.

Thank you a lot for your support!