@Sourabh_Yadav unfortunately there is no one solution that will solve your problem. You have to create creative and experiment a few things
Start with lr=0.01 and then decrease it by order of 10 gradually to understand whether the minima is being found
batch_size=2 is a big red flag. Your model is not getting enough data points to update weights in the back prop. You can start with a higher batch size and slowly decrease it once you have smoothened out the convergence
There are many more factors and you can find many articles and discussion threads for the same
i increased the batch size to 4 , lr = 2e-6 and grad. accum. steps = 2., epochs = 3
Step
Training Loss
50
16.398500
100
15.975300
150
15.603200
200
14.988700
250
14.447100
300
13.770100
350
13.009500
400
12.087300
450
11.153500
500
10.203200
550
9.561700
600
9.059600
650
8.952400
700
8.840600
750
8.856600
800
8.853000
850
8.733600
900
8.694100
950
8.581500
1000
8.483000
1050
8.385400
1100
8.373000
1150
8.319600
1200
8.273000
1250
8.174100
1300
8.151100
1350
8.147100
1400
8.004400
1450
7.921600
1500
7.688300
1550
7.429600
1600
7.525600
1650
7.478100
1700
7.479400
1750
7.489100
1800
7.443700
1850
7.376700
1900
7.426600
1950
7.405100
2000
7.407100
2050
7.402200
should I increase more but then I m also restricted on computation power and its multimodal model (image_text to text)
→ Also what will be good loss to stop
→ and will increasing training data be helpful ?
Good loss to stop would be determined by your accuracy metrics on a separate test data
It is quite possible with the loss at 7, you are actually getting good predictions on your test dta
Increasing training data is always useful, but make sure that you are not over fitting