HuBERT Pre-training for the Second Iteration without Previous Checkpoints?

jef · September 4, 2023, 6:08am

I have recently completed the first round of pre-training with HuBERT (torchaudio example), using the provided learning sample, and have finished preprocessing for the second iteration. I have a question regarding the pre-training for this second iteration.

In the pre-training phase of HuBERT’s second iteration, it seems like the checkpoint from the first iteration isn’t being used. Does this suggest that the second iteration trains from scratch, without initializing from the checkpoint of the first iteration?

from Hubert Paper:

We train the BASE model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250k steps, while the second iteration is trained for 400k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. Training for 100k steps takes about 9.5 hours.

bmkor · January 1, 2025, 3:57am

Not sure if it is relevant anymore or just too late. From my shallow understanding, the checkpoint cannot be used in this case if you follow the paper as the cluster number increased from 100 to 500 after the 1st iteration (or training); yet, the paper mentioned that it used the 6-th transformer layer output as labels instead of those from MFCC in the 1st iteration. So, I interpreted the 2nd iteration not using the checkpoint but the label output. (Yes, training from scratch in the 2nd iteration…)