I have recently completed the first round of pre-training with HuBERT (torchaudio example), using the provided learning sample, and have finished preprocessing for the second iteration. I have a question regarding the pre-training for this second iteration.
In the pre-training phase of HuBERT’s second iteration, it seems like the checkpoint from the first iteration isn’t being used. Does this suggest that the second iteration trains from scratch, without initializing from the checkpoint of the first iteration?
from Hubert Paper:
We train the BASE model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250k steps, while the second iteration is trained for 400k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. Training for 100k steps takes about 9.5 hours.
Not sure if it is relevant anymore or just too late. From my shallow understanding, the checkpoint cannot be used in this case if you follow the paper as the cluster number increased from 100 to 500 after the 1st iteration (or training); yet, the paper mentioned that it used the 6-th transformer layer output as labels instead of those from MFCC in the 1st iteration. So, I interpreted the 2nd iteration not using the checkpoint but the label output. (Yes, training from scratch in the 2nd iteration…)
Recently came across this. This seems to be the case also for WavLM [2110.13900] WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing, where they took Iter-1 HuBERT ckpt to generate targets to train WavLM-base for 400k steps; and similarly iter-2 HuBERT ckpt to train WavLM-Large, and mention they can be seen as the 3rd iteration. So I would assume for both models, iterations are considered rather ‘independent’ in a way that the checkpoint from previous iterations are not loaded. Code-wise, although we can still load the checkpoint by setting strict=False, so the last projection layer is randomly initialized. It might introduce an issue whether to use the same optimizers and schedulers for frontend and this last layer. So all together, I would suppose checkpoints from previous iterations are not loaded.