Wav2Vec2FABundle emission frame size

The example waveform in the forced alignment tutorial has 54k frames, and is sampled in 16 kHz. It renders an emission tensor of size [1, 169, 28] if I remember correctly. There are 28 labels, and seemingly 169 binned (?) frames, as in the Time label on the x axis on the plots.

I need to convert the transformed frames to “actual” sampled frames in my pipeline. Asking Copilot returns suggestions about window size and stride. Does anybody know if these frame parameters are fixed, and the actual values?

Answering myself. After looking closer at the tutorial and the display_segments function in the Audio samples section, I realize that the frame ratio is size of waveform / number of rows in emission matrix.