UNET trained on DAVIS_2017 dataset struggles after 10 epochs

Hello all,
I am trying to train the UNET on the DAVIS 2017 dataset.
I am facing some issues because after 10 epochs the model is still struggling to output correct masks.
In particular these are some of the outputs:

I am truly struggling to identify the issue. Could the problem be that even if the dataset is composed by many classes I am considering them as just one class and hence I am expecting it to output “1” for each target mask without considering classes? Could it be the root cause or should I further investigate in the training loop etc? I am trying to train it for other 10 epochs but I have the feeling that under-fitting it is not the problem here.
I know that this question could be quite vague but I am really struggling to identify the issue.
Here below the link to my project on Github.
Thanks a lot!

Hi Alessandro!

10 epochs of training does not sound like a lot to me. (As always, this
will depend on the details of your use case.)

Networks for reasonably complicated image-processing tasks are often
trained on hundreds or thousands (or more!) epochs.

I would suggest training on a training dataset, and then tracking the
performance on a separate validation dataset. You would certainly
want to track the loss on both datasets, and you could also look at
other performance metrics (such as accuracy).

People most commonly compute the loss (and other metrics) for both
datasets after each epoch.

As long as your validation-dataset loss is still going down, your network
is still learning features, etc., relevant to your real problem. This is true
even if your validation-dataset loss is significantly higher than your
training-dataset loss – what matters is that the validation-dataset loss
is still going down.

(If your validation-dataset loss starts going up, even though your
training-dataset loss is still going down, you have likely started to
“overfit” and further training, without changing something else about
what you are doing, won’t help – and will likely degrade – your
real-world performance.)

You could argue this either way:

My intuition tells me that if all you want to do is distinguish “foreground”
from “background” (i.e., all non-background classes are collapsed into
a single foreground class), then it will be easier to train with one single
foreground class, and you will get better performance (on this simplified
problem). The idea is that this problem easier and that your network
won’t be “wasting effort” on details that aren’t relevant (to your simplified
performance metric).

On the other hand, if you collapse all of your non-background classes
into a single foreground class, you would be “hiding” information from
the network. Perhaps the information in differences between those classes
helps the network better learn the most important features, even for
merely distinguishing foreground from background.
So maybe the network
will train better if you keep the multiple foreground classes separate.

Well, yes, a bug could always be the cause of your issues, so it’s always
worthwhile double-checking your code. Having said that, although I can’t
rule out a bug, I don’t see obvious symptoms of a bug in the results you’ve


K. Frank