Varying size for labels while creating TensorDatasets for network implementation with CTCLoss

I have an issue/question about nn.CTCLoss. I am experimenting with the library to implement speech recognition by creating a network with GRUs being attached to CNNs and the rest with the loss chosen to be CTCLoss as I am dealing with labels where characters are likely to be not aligned with the audio content.

The issue is that while every audio track can be padded with black color for the spectrogram on the x axis(time), I am not so sure what should I do with the labels as they have varying size too for each track.

If I am not mistaken the labels input can receive varying size S(not specified in docs) for each audio source, but if so, how would I create TensorDatasets and respectively DataLoaders if I can’t pad the sequence of characters mapped to ints if the label blank is not permitted?

From docs: targets: Tensor of size (N, S)

Or should I pad with another character other then blank, because this below won’t work at all?

train_dataset = TensorDataset(torch.from_numpy(features), torch.from_numpy(targets))

Where targets are something like that: targets = np.array([[0, 1, 5, 3, 9], [21, 3, 6]]), that does contain normalized ints mapped to their ascii code which doesn’t make sense from numpy perspective(the array itself) and doesn’t make sense also to pad with blank the targets just to filter them out at each loop within the looped batch.

I finally understood, after going-over some projects implementing something that I am trying to do too and with the docs help:

There is the 4th input tensor to be supplied:

target_lengths: Tuple or tensor of size (N).

Lengths of the targets

Thus, I can pad at the end with whatever I want while specifying the original size of targets/labels before padding.

For variable target lengths, you are spot-on. Note that variable input lengths had a bug in PyTorch 1.0 that will be solved in PyTorch 1.0.1.

Best regards


1 Like