Loss explodes is equal to overfitting?

Christian_Ses · June 15, 2019, 3:28pm

For semantic segmentation I trained a network using a learning rate of 5*10**-4, mini-batchsize 2 and a Momentum of 0.8 with image size 256x384. The only custom transformation applied is a center crop. The loss after 20 Epochs looks like the following:

Question:

Why is there a peak in the loss function, when there is no custom transformation applied? Is the network to small?

LeviViana · June 15, 2019, 7:57pm

It doesn’t look like over-fitting to me. Its hard to tell without much information. Does it happens every time when you train the model ?

Christian_Ses · June 19, 2019, 8:56am

Yes it happens ever time training the model.

When the mini batch is greater than 2 it happens even more.

LeviViana · June 19, 2019, 9:08am

Indeed, it is quite odd. My guess is that it will require a deep debug. BTW, did you check that the train and validation data-sets have the same distribution ?

Otherwise, I’d pick two consecutive checkpoints: a good and a bad. Then, I’d pick a sample were both predict good results and a sample where only the good predicts the good results. Then, I’d look at the last activations, and see the differences. It may take a lot of time, and maybe the solution wouldn’t be straightforward.

Oli · June 19, 2019, 9:22am

What kind of network are you using? Does the network behave differently in train() and valid() mode? How big is your validation dataset?

barthelemymp · June 19, 2019, 10:06am

Looks like like loss evolution of Elman network … What s is your model?
Problem was back then solved by gates (GRU)

Christian_Ses · June 20, 2019, 8:18am

Im using a semantic segmentation network, which is also incorporating the motion state to differentiate between moving and static cars.

I used the cityscapes motion dataset [deepmotion], which provides 2975 training images and 500 validation images. Because I train only small images, I splitted the images in 3 overlapping frames, to get 3 times the amount of training images.

Yes the network behave totally different in train() and valid() mode. Is there a reason for it ? I used some dropout layers.

Christian_Ses · June 20, 2019, 8:37am

The model is looking like the following:

Im preprocessing the optical flow with FlowGenerationNetwork (PWCNet) and the segmentation features (with Deeplab) and inputting it in the final network, for calculating the Lovasz loss.

…

Christian_Ses · June 20, 2019, 8:46am

I also switched some training images with the validation set and trained it. It gave me more or less the same results.

Thats a good idea. I have already plotted some results with high loss. Usually the prediction is biased towards the biggest class, which is the background. (I have just 3 classes: moving car, static car and background, see illustration from [DeepMotion])

Oli · June 20, 2019, 9:29am

Ok, 500 images is large enough of a validation set to not produce those big fluctuations. People sometimes leave the model in train() mode during evaluation and a batch size of 1, so the batch norm layers aren’t happy. I don’t think I can offer any insightful help, sorry. Good luck

LeviViana · June 20, 2019, 9:39am

I imagine you are using Cross-Entropy loss somewhere. You could try to balance the class importance on the loss by setting different weights. Maybe this thread could help a bit.

Christian_Ses · June 20, 2019, 11:24am

When I was using the cross-entropy loss, it was even more fluctuating. This is why Iam using the Lovasz loss, which is taking the IoU (L = 1 - IoUc). [Lovasz Softmax Paper] [Lovasz Overview Slide]

Should I also balance the classes for the Lovasz loss function ?

Christian_Ses · June 20, 2019, 11:29am

In training mode I use a batch size of 12 and in evaluation mode I use a batch size of 1. Is this a problem? But im always calculating the mean of the whole validation epoch for plotting the epochs.

Oli · June 20, 2019, 11:42am

No it shouldn’t be a problem as long as you do model.train() and model.eval()

Christian_Ses · June 22, 2019, 3:17pm

One more question, when I want to further train the pertained model with an even smaller dataset with 700 images in the training dataset and 500 images in the validation. Is better to choose a really small learning rate in comparison to the first training?

I guess weight decay should be also much higher.

jmlb · June 22, 2019, 5:38pm

Are you using any data augmentation? If you do, turn off all augmentation schemes, and train exclusively on natural images. Goal is to make sure there is nothing wrong with the data augmentation script.
Also, visualize the images in your batch (input and output): to make sure that the images that are fed to the model are consistent with what you expect.
I have experienced similar “noisy” validation loss, and it originated either from a bug in the data augmentation, or in the batch generator.

Christian_Ses · June 25, 2019, 7:42am

Thanks for your information. I already did a training without data augmentation.

But its not looking much better.

I used SGD and CosinWithRestarts. Is it better to use other values tmax or a different scheduler?


optimizer = optim.SGD(params, 0.001, momentum=0.95, weight_decay=0.001)
scheduler = CosineWithRestarts(optimizer, t_max=30)

henrique · July 2, 2019, 4:59pm

Sorry for the stupid question, maybe you have also already solved it, but are you really sure you are averaging the validation for all 500 images every epoch?
Your training loss should be way noisier than a mean of all validation images.

Unless there is a problem on your Lovasz loss…
Have you tried to optimize a smoothed IoU or DICE loss instead?