Training loss plateau ? fix?

my training looks like

Step Training Loss
50 16.211700
100 14.648300
150 11.676700
200 8.995300
250 8.248700
300 7.802700
350 7.363100
400 7.255300
450 7.258900
500 7.198600
550 7.236600
600 7.211900
650 7.203300
700 7.225800
750 7.187400
800 7.212800
850 7.229400
900 7.201000
950 7.176700
1000 7.199600
1050 7.169000
1100 7.237900
1150 7.224100
1200 7.103900
1250 7.191000
1300 7.173000
1350 7.203800
1400 7.213800
1450 7.243000

learning rate 1e-5 , AdamW , batch=2, weight decay =0.01

@Sourabh_Yadav unfortunately there is no one solution that will solve your problem. You have to create creative and experiment a few things

  1. Start with lr=0.01 and then decrease it by order of 10 gradually to understand whether the minima is being found
  2. batch_size=2 is a big red flag. Your model is not getting enough data points to update weights in the back prop. You can start with a higher batch size and slowly decrease it once you have smoothened out the convergence

There are many more factors and you can find many articles and discussion threads for the same

i increased the batch size to 4 , lr = 2e-6 and grad. accum. steps = 2., epochs = 3

Step Training Loss
50 16.398500
100 15.975300
150 15.603200
200 14.988700
250 14.447100
300 13.770100
350 13.009500
400 12.087300
450 11.153500
500 10.203200
550 9.561700
600 9.059600
650 8.952400
700 8.840600
750 8.856600
800 8.853000
850 8.733600
900 8.694100
950 8.581500
1000 8.483000
1050 8.385400
1100 8.373000
1150 8.319600
1200 8.273000
1250 8.174100
1300 8.151100
1350 8.147100
1400 8.004400
1450 7.921600
1500 7.688300
1550 7.429600
1600 7.525600
1650 7.478100
1700 7.479400
1750 7.489100
1800 7.443700
1850 7.376700
1900 7.426600
1950 7.405100
2000 7.407100
2050 7.402200

should I increase more but then I m also restricted on computation power and its multimodal model (image_text to text)
→ Also what will be good loss to stop
→ and will increasing training data be helpful ?

  • Good loss to stop would be determined by your accuracy metrics on a separate test data
    It is quite possible with the loss at 7, you are actually getting good predictions on your test dta
  • Increasing training data is always useful, but make sure that you are not over fitting