The example waveform in the forced alignment tutorial has 54k frames, and is sampled in 16 kHz. It renders an emission tensor of size [1, 169, 28]
if I remember correctly. There are 28 labels, and seemingly 169 binned (?) frames, as in the Time label on the x axis on the plots.
I need to convert the transformed frames to “actual” sampled frames in my pipeline. Asking Copilot returns suggestions about window size and stride. Does anybody know if these frame parameters are fixed, and the actual values?