Overfit when using settings in fairseq finetuning NLP models

Maxwell_Albert · June 3, 2022, 6:43am

Hi, I am finetuning Roberta-base with COLA dataset. And the settings are similar to fairseq guidance.

facebookresearch/fairseq/blob/main/examples/roberta/config/finetuning/cola.yaml

# @package _group_

common:
  fp16: true
  fp16_init_scale: 4
  threshold_loss_scale: 1
  fp16_scale_window: 128
  log_format: json
  log_interval: 200

task:
  _name: sentence_prediction
  data: ???
  init_token: 0
  separator_token: 2
  num_classes: 2
  max_positions: 512

checkpoint:
  restore_file: ???

This file has been truncated. show original

I changed weight-decay from 0.1 to 1e-4, otherwise it can not be tuned to an acceptable result.
However, The training curve shows that it is overfit in the last 5 epochs in total 10 epoches.
The log is shown below

step:0  loss_train:0.5412032071794017  acc1_train:0.7495318352059925  loss_val:0.4549379467820892  acc1_val:0.8009615384615385  time_per_batch:0.31866262914536153  matthew:{'matthews_correlation': 0.5053519875589046}
step:1  loss_train:0.3795412516514404  acc1_train:0.8403558052434457  loss_val:0.4143909378048892  acc1_val:0.8355769230769231  time_per_batch:0.3227173107840149  matthew:{'matthews_correlation': 0.5968609585806229}
step:2  loss_train:0.24929316284148936  acc1_train:0.901685393258427  loss_val:0.4335532639408484  acc1_val:0.8413461538461539  time_per_batch:0.3201296655426311  matthew:{'matthews_correlation': 0.6119568895078864}
step:3  loss_train:0.19517714867877603  acc1_train:0.9250936329588015  loss_val:0.4370046470475455  acc1_val:0.8509615384615384  time_per_batch:0.3200738952400979  matthew:{'matthews_correlation': 0.6416536225426047}
step:4  loss_train:0.13300421790870373  acc1_train:0.9536516853932584  loss_val:0.6774607508492548  acc1_val:0.8173076923076923  time_per_batch:0.3177395130364636  matthew:{'matthews_correlation': 0.5480627318306406}
step:5  loss_train:0.09997520692213108  acc1_train:0.9681647940074907  loss_val:0.6605808303078028  acc1_val:0.8221153846153846  time_per_batch:0.3189116533329424  matthew:{'matthews_correlation': 0.5611387478013941}
step:6  loss_train:0.08848711905485605  acc1_train:0.9719101123595506  loss_val:0.6310205688335163  acc1_val:0.8336538461538462  time_per_batch:0.31638000073950834  matthew:{'matthews_correlation': 0.5930335375794208}
step:7  loss_train:0.04765812121168187  acc1_train:0.9836142322097379  loss_val:0.6625456401776039  acc1_val:0.8317307692307693  time_per_batch:0.33055903402607095  matthew:{'matthews_correlation': 0.5877175973073943}

You can see the best acc gets at 85.09 at 4th epoc, But after that the validation accuracy gets really low. Is that normal in finetuning NLP models? Could I avoid it, since training accuracy is still growing but validation accuracy is decreasing.