Hi, I am finetuning Roberta-base with COLA dataset. And the settings are similar to fairseq guidance.
I changed weight-decay from 0.1 to 1e-4, otherwise it can not be tuned to an acceptable result.
However, The training curve shows that it is overfit in the last 5 epochs in total 10 epoches.
The log is shown below
step:0 loss_train:0.5412032071794017 acc1_train:0.7495318352059925 loss_val:0.4549379467820892 acc1_val:0.8009615384615385 time_per_batch:0.31866262914536153 matthew:{'matthews_correlation': 0.5053519875589046}
step:1 loss_train:0.3795412516514404 acc1_train:0.8403558052434457 loss_val:0.4143909378048892 acc1_val:0.8355769230769231 time_per_batch:0.3227173107840149 matthew:{'matthews_correlation': 0.5968609585806229}
step:2 loss_train:0.24929316284148936 acc1_train:0.901685393258427 loss_val:0.4335532639408484 acc1_val:0.8413461538461539 time_per_batch:0.3201296655426311 matthew:{'matthews_correlation': 0.6119568895078864}
step:3 loss_train:0.19517714867877603 acc1_train:0.9250936329588015 loss_val:0.4370046470475455 acc1_val:0.8509615384615384 time_per_batch:0.3200738952400979 matthew:{'matthews_correlation': 0.6416536225426047}
step:4 loss_train:0.13300421790870373 acc1_train:0.9536516853932584 loss_val:0.6774607508492548 acc1_val:0.8173076923076923 time_per_batch:0.3177395130364636 matthew:{'matthews_correlation': 0.5480627318306406}
step:5 loss_train:0.09997520692213108 acc1_train:0.9681647940074907 loss_val:0.6605808303078028 acc1_val:0.8221153846153846 time_per_batch:0.3189116533329424 matthew:{'matthews_correlation': 0.5611387478013941}
step:6 loss_train:0.08848711905485605 acc1_train:0.9719101123595506 loss_val:0.6310205688335163 acc1_val:0.8336538461538462 time_per_batch:0.31638000073950834 matthew:{'matthews_correlation': 0.5930335375794208}
step:7 loss_train:0.04765812121168187 acc1_train:0.9836142322097379 loss_val:0.6625456401776039 acc1_val:0.8317307692307693 time_per_batch:0.33055903402607095 matthew:{'matthews_correlation': 0.5877175973073943}
You can see the best acc gets at 85.09 at 4th epoc, But after that the validation accuracy gets really low. Is that normal in finetuning NLP models? Could I avoid it, since training accuracy is still growing but validation accuracy is decreasing.