Hey,
So I’m trying to implement the DeepSpeech2 architecture using PyTorch but I’m currently having trouble with the CTCLoss. The documentation says, that the input to the CTCLoss should be logarithmic probabilites for each timestep, a tensor of the token sequences, the lengths of the spectrograms (which are the inputs to the model) and the lengths of the token sequences.
The DeepSpeech2 architecture has 2 convolutional layers with a stride > 1. This means the timesteps of my spectrograms will be reduced by the convolutional layers. I’m pretty sure this is the reason im getting the following error.
RuntimeError: Expected input_lengths to have value at most 317, but got value 633 (while checking arguments for ctc_loss_allocate_outputs)
So my spectrogram originally has 633 timesteps but after the forward pass, there are only 317 timesteps. I’m pretty sure the documentation says, I have to use the lenghts of the input tensors and not the lenghts of the output of my model and since my model reduces the amount of timesteps im getting the error. Do you know how I can fix this issue?