BERT Pre-training

Agatha · July 2, 2021, 2:08am

Hi,
From BERT paper, they say

To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for 90% (900k) of the steps. Then, we train the rest 10% (100k) of the steps of sequence of 512 to learn the positional embeddings.

Does this mean,
The model was first trained with a seq_length of 128, then the second train (of 512 seq_length) use the checkpoint of the first training?