I have an issue/question about
nn.CTCLoss. I am experimenting with the library to implement speech recognition by creating a network with GRUs being attached to CNNs and the rest with the loss chosen to be CTCLoss as I am dealing with labels where characters are likely to be not aligned with the audio content.
The issue is that while every audio track can be padded with black color for the spectrogram on the x axis(time), I am not so sure what should I do with the labels as they have varying size too for each track.
If I am not mistaken the labels input can receive varying size S(not specified in docs) for each audio source, but if so, how would I create TensorDatasets and respectively DataLoaders if I can’t pad the sequence of characters mapped to ints if the label blank is not permitted?
targets: Tensor of size (N, S)
Or should I pad with another character other then blank, because this below won’t work at all?
train_dataset = TensorDataset(torch.from_numpy(features), torch.from_numpy(targets))
Where targets are something like that:
targets = np.array([[0, 1, 5, 3, 9], [21, 3, 6]]), that does contain normalized ints mapped to their ascii code which doesn’t make sense from numpy perspective(the array itself) and doesn’t make sense also to pad with blank the targets just to filter them out at each loop within the looped batch.