Unexpected results with pre-training model - discussion

Hi all :slight_smile:

My project is a Grammar correction based on multi-head attention model. I have two datasets the first one is synthetic data in 3 GB and the second on the original data in 200 MB. The training strategy as:

  1. Load the pre-training data and start training to save the best version.

  2. Change the learning rate and the number of epochs, load the pre-training model and continue training the model with the original dataset.

I have a confusing problem related to the results when training the model based on pre-training model, which I get unexpected results (below score) compare with the same model that trained from scratch without pre-training. I have no explanation for this problem taking into account the different sizes between both datasets.

Any suggestion to overcome this issue?

Kind regards,
Aiman Solyman