I have recently completed the first round of pre-training with HuBERT (torchaudio example), using the provided learning sample, and have finished preprocessing for the second iteration. I have a question regarding the pre-training for this second iteration.
In the pre-training phase of HuBERT’s second iteration, it seems like the checkpoint from the first iteration isn’t being used. Does this suggest that the second iteration trains from scratch, without initializing from the checkpoint of the first iteration?
from Hubert Paper:
We train the BASE model for two iterations on the 960 hours of LibriSpeech audio on 32 GPUs, with a batch size of at most 87.5 seconds of audio per GPU. The first iteration is trained for 250k steps, while the second iteration is trained for 400k steps using labels generated by clustering the 6-th transformer layer output of the first iteration model. Training for 100k steps takes about 9.5 hours.