Loss not Decreasing while training UNET

Jaskeerat · April 16, 2025, 5:26pm

I have built Unet for image segmentation from scratch and the dataset i am training it on is Cityscapes that is available on their original website. Hyper parameters are:
Batch_size = 16
Learning_rate = 1e-2
Optimizer = Adam
Loss_functions = DiceLoss + CrossEntropyLoss
Original Image and mask were of 1024 x 2048 i transformed them to 128 x 256.

But the main problem is that the model have decreasing loss for first few epochs through dataset but after 2 epochs or in the 2nd epochs it keeps on fluctuating in the range 0.8-1.0 and from that never decreases. I have use torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer , mode=‘min’ , factor=0.1, patience=2). I am thinking that the issue might be with the size of image i am currently using the transform function for image and mask are:

transform_fn = transform.Compose([
transform.Resize((128, 256)),
transform.ToTensor()
# transform.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])

target_transform_fn = transform.Compose([
transform.Resize((128, 256), interpolation=Image.NEAREST),
transform.PILToTensor()
])

It will be helpful if anyone can tell me how can i overcome this problem.

KFrank · April 21, 2025, 1:30am

Hi Jaskeerat!

Start by seeing if you can overfit. Try training on a single batch or a small number of
batches (or even on a single image). You should be able to train to the point that you
get nearly perfect (but useless) predictions. If you can’t overfit, you have some bug in
your model or training.

You haven’t said how many optimizer steps / epochs you have run. You ought to be able
to overfit in a reasonable amount of time, but that may still be many – perhaps thousands
or more – of epochs.

If you can overfit, start increasing the size of your training set. At some point you should
transition from overfitting to “real” training. But be prepared to train for a potentially large
number of epochs.

I would also recommend that you start with plain-vanilla SGD as your optimizer and with
pure CrossEntropyLoss (no Dice loss) as your loss criterion. These are simpler and easier
to reason about.

Best.

K. Frank

Jaskeerat · April 27, 2025, 4:54pm

Hi KFrank,
The optimizer i’m using is Adam on learning rate of 0.001 and right now i’m training at 1 Epoch and then analyzing the loss i will adjust learning rate manually. The trend of loss right now is :
Loss at 1th batch is 2.2865967750549316
Loss at 11th batch is 1.6623344421386719
Loss at 21th batch is 1.4776408672332764
Loss at 31th batch is 1.3324888944625854
Loss at 41th batch is 1.3063474893569946
Loss at 51th batch is 1.3136686086654663
Loss at 61th batch is 1.1328212022781372
Loss at 71th batch is 1.2527915239334106
Loss at 81th batch is 1.24680757522583
Loss at 91th batch is 1.1866313219070435
Loss at 101th batch is 1.0283044576644897
Loss at 111th batch is 1.082072138786316
Loss at 121th batch is 1.1612353324890137
Loss at 131th batch is 1.1107628345489502
Loss at 141th batch is 1.1223273277282715
Loss at 151th batch is 1.1529223918914795
Loss at 161th batch is 1.0620512962341309
Loss at 171th batch is 1.0563969612121582
Loss at 181th batch is 1.0337893962860107
Loss at 191th batch is 1.149621844291687
Loss at 201th batch is 1.0832343101501465
Loss at 211th batch is 1.0245585441589355
Loss at 221th batch is 1.0651346445083618
Loss at 231th batch is 0.8998237252235413
Loss at 241th batch is 1.0268874168395996

ajayrkumar · May 26, 2025, 7:08pm

Adding on to the debugging steps suggested by @KFrank,
I would also suggest you to re-check your training loop. This is a common issue encountered when the lr.step() is called after completion of each batch, instead of each epoch. If that was the case, before the model could learn something useful, the learning rate is reduced way too much.