Hey @ptrblck, I am finetuning a pretrained model over my custom dataset which takes paired sentence as input and classifies them as 1 if first sentence is answer of second sentence else it classifies them as zero.
I am not creating mini-batch and taking only one training pair at a time for training (stochastic way) and I find that the loss is fluctuating in the range of 0.4 to 1.1 and I have tried almost all the loss functions. Can you help with this?
I have tried learning rate varying from 0.01 to 0.0001 and also tried to use learning rate decay. It seems as it is stuck in some local optima but at the same time the amount of changes I have done in hyperparameter tuning, tells me that problem is something else and not local optimal trap.