From BERT paper, they say
To speed up pretraing in our experiments, we pre-train the model with sequence length of 128 for 90% (900k) of the steps. Then, we train the rest 10% (100k) of the steps of sequence of 512 to learn the positional embeddings.
Does this mean,
The model was first trained with a seq_length of 128, then the second train (of 512 seq_length) use the checkpoint of the first training?