Model only returns "untrained" predictions that tend to 0 when in eval mode

Oliver · March 12, 2023, 3:40pm

Hi,

So I trained a Gan Style model to return images and it works well (imho), in the convolution blocks I use dropout (p=0.2).

When I return
pred = generator(input)
I get what I want.

However when I put the generator in eval mode first, i.e.
generator.eval() pred = generator(input)

pred is basically just noise, though the output is very small (< 0.3).

I am a little confused, dropout shouldn’t affect it, since the eval just turns it off.
Other than that I only have upsampling, batchnormalization, maxpooling and activation (relu and sigmoid) components.
(tried removing batchnormalization, same thing)

Could it be the Upsample function?

I use it in U-net like fashion to slowly bring the size up. Should I use transpose convolutions instead?

I think the longer the model trains, the more the predictions in eval mode go to 0.
I.e. as n -> infinity, pred -> 0, but its absolutely fine in train mode.

Many thanks!

ptrblck · March 12, 2023, 6:21pm

Most likely the batchnorm layers are causing this effect and you could double check it by calling .train() on these layers only after calling model.eval().
Sometimes changing the momentum of the batchnorm layer or turning the running stats off might help.

Oliver · March 12, 2023, 6:58pm

thanks a lot, I thought that too, but I removed them and am still seeing this effect.
It starts out by being randomly between 0 and 1 but the more steps it is trained the closer the prediction goes to 0 in eval.

ptrblck · March 12, 2023, 7:01pm

Could you post your model definition, please?

Oliver · March 12, 2023, 7:05pm

It’s rather big, the main components are contractions and expansions:
So I have a few contractions, and then a few expansion in the end leading to an output that is fed through a sigmoid.

class ContractingBlock(nn.Module):

    def __init__(self, input_channels, use_dropout=False, use_bn=True):
        super(ContractingBlock, self).__init__()
        self.conv1 = nn.Conv2d(input_channels, input_channels * 2, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(input_channels * 2, input_channels * 2, kernel_size=3, padding=1)
        self.activation = nn.LeakyReLU(0.2)
        self.maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
        if use_bn:
            self.batchnorm = nn.BatchNorm2d(input_channels * 2)
        self.use_bn = use_bn
        if use_dropout:
            self.dropout = nn.Dropout()
        self.use_dropout = use_dropout

    def forward(self, x):
=
        x = self.conv1(x)
        if self.use_bn:
            x = self.batchnorm(x)
        if self.use_dropout:
            x = self.dropout(x)
        x = self.activation(x)
        x = self.conv2(x)
        if self.use_bn:
            x = self.batchnorm(x)
        if self.use_dropout:
            x = self.dropout(x)
        x = self.activation(x)
        x = self.maxpool(x)
        return x

class ExpandingBlock(nn.Module):
    def __init__(self, input_channels, use_dropout=False, use_bn=True):
        super(ExpandingBlock, self).__init__()
        self.upsample = nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True)
        self.conv1 = nn.Conv2d(input_channels, input_channels // 2, kernel_size=2)
        self.conv2 = nn.Conv2d(input_channels, input_channels // 2, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(input_channels // 2, input_channels // 2, kernel_size=2, padding=1)
        if use_bn:
            self.batchnorm = nn.BatchNorm2d(input_channels // 2)
        self.use_bn = use_bn
        self.activation = nn.ReLU()
        if use_dropout:
            self.dropout = nn.Dropout()
        self.use_dropout = use_dropout

    def forward(self, x, skip_con_x):
        x = self.upsample(x)
        x = self.conv1(x)
        skip_con_x = crop(skip_con_x, x.shape)
        x = torch.cat([x, skip_con_x], axis=1)
        x = self.conv2(x)
        if self.use_bn:
            x = self.batchnorm(x)
        if self.use_dropout:
            x = self.dropout(x)
        x = self.activation(x)
        x = self.conv3(x)
        if self.use_bn:
            x = self.batchnorm(x)
        if self.use_dropout:
            x = self.dropout(x)
        x = self.activation(x)
        return x

So when I set use_bn=False I still get the same behaviour.

ptrblck · March 12, 2023, 7:20pm

I assume you are retraining the model with use_bn=False and once it converged you are still seeing static outputs after calling model.eval()?

Oliver · March 12, 2023, 7:26pm

Yes, though if I put it back to model.train and just feed it to get pred = model(image) it’s fine.

I also tried it in lightning and I have the same behaviour in validation_step.

Thanks a lot for reviewing this with me!

ptrblck · March 12, 2023, 7:29pm

That’s interesting as it seems the dropout layers are causing this effect then, which I haven’t seen before.
Is calling model.eval() and .train() on all nn.Dropout layers also reproducing the effect?

Oliver · March 12, 2023, 7:33pm

I’ve actually tried with turning off all dropout layers too and I still get the same.
Hence I thought it might be the upsampling.
Or, there is some layering that causes model.eval() to not trickle through to the individual blocks below.

When you say .train() on all nn.Droupout do you mean I should manually call it on each of the individual dropout activations? I haven’t tried that. Nor did I try that on the batchnormalization layers.

All I did was call model.eval() or model.train()

ptrblck · March 13, 2023, 5:20pm

I’m not sure why you think this behavior would be caused by an upsampling layer as it won’t change its behavior between train and eval mode or what’s the reason?

Yes. Call model.eval() first, then iterate the model.named_modules() calling .train() on each dropout layer as another test.

Oliver · March 13, 2023, 5:24pm

ok, I’ll try that. thanks!

I was thinking that (it might be the upsampling) because I have nothing else to test as I did completely without batchnorm and dropout each.

ptrblck · March 13, 2023, 5:28pm

Just to clarify: did you also train your model completely without batchnorm and dropout layers and were still seeing a change in predictions after model.eval() was called?
If so, we should try to narrow down which other layer is changing its behavior during the validation run.

Oliver · March 13, 2023, 7:13pm

yes, though I didn’t try to turn both off. I tried once without any dropout and got the same ( near all zeros after .eval()) and once without any batchnorm. Though I haven’t tried without either. Maybe I’ll try that too .

Oliver · March 18, 2023, 7:51am

hi, I tried without both and iterating why calling .train().

I still have the behaviour, but I think I may have an idea now,
do you think it is possible that this is because I have different sets of images and I am first dividing them by 255.0 and then normalizing them to mean = (0.5, 0.5, 0.5) and std = (0.5, 0.5, 0.5) in the dataset?

I very much appreciate the help, I apologize for not having been able to try sooner.

ptrblck · March 18, 2023, 6:14pm

So you removed all batchnorm and dropout layers, and are still seeing different outputs for model.train() and model.eval()?
If so, this would indicate another layer’s execution depends on the batch size, which we haven’t narrowed down yet.

Could you describe which image sets are used? I assumed you are using the same input data to compare the output of the model after calling .train() and .eval().

Oliver · March 19, 2023, 8:10am

Thank you very much.

I am sorry I don’t understand this part, how does a layer other than batchnorm depend on the batch size?
It is possible that because I don’t know this that I am missing an obvious thing.

For the different sets,
I mix them all together and split them into train and evaluation, though they come from different datasets.
It’s a style transfer project (GAN).
So they are the same input data, but not exactly, since I split them in the beginning.
Do you think it possible that it could overfit? The dataset is quite large, about 30k images.

ptrblck · March 19, 2023, 8:51am

I don’t know which layers you are using, but since you have removed all batchnorm and dropout layer and are still seeing differences between train and eval mode I guess other (maybe custom) layers are using the internal self.training flag to change their behavior. You would have to narrow down which layer is creating the mismatches by e.g. using forward hooks which would allow you to compare each intermediate activation separately.

I’m not concerned about overfitting and the general training at the moment, but more interested in the unexpected difference in the output of your model using a) exactly the same input data, b) without any usage of batchnorm and dropout layers, and c) since the difference is apparently caused by switching between train and eval mode.

Oliver · March 19, 2023, 9:00am

Thank you very much.

Aside from the ones above I only use a sigmoid Activation and that only through torch.functional in the forward pass.

After your last comment I was wodnering if it is possible it could overfit, i.e. learn the training data to a point where it memorizes it and just gives blanks for all other inputs, this explains how the output slowly goes from around 0.3/0.4 to 0.01 the longer it trains.

I will try one more thing, I’ll completely remove the batchnorms and dropouts, instead of only setting the flag to False, see how that goes.

Thanks for your patient help!

Oliver · March 19, 2023, 3:43pm

Hmm, ok, so update:

deleted the lines of bn and train completely and similar behaviour but not quite, so I think it’s likely there.
The new behaviour is that that the output is near zero to start with but in both eval and train mode. Then it goes to 0 completely. I.e. it only outputs images of mean == 0.

This behaviour is so strange in that I am quite certain now that I’ve introduced a bug somewhere.

So, other than being a bit slower and undeterministic (dropout), is there any other downside to just predicting and using the model only in .train() mode? As it works quite well there

ptrblck · March 19, 2023, 8:15pm

I assumed you already did this.

I’m not aware of more drawbacks than the ones you already mentioned:

predictions for exactly the same input sample will have randomness
you are wasting compute
dropout is often creating worse performing models as the actual model capacity is smaller.