Training an ASR Model with NLLLoss or CrossEntropyLoss

I’m new to ASR models. Could someone kindly help me understand how to train a simple asr model?

Background:
I have created an attention-based ASR model with an encoder that takes in mel-spec as inputs, with shape (B, C, T) or (16, 80, 873), the decoder is passed a sequence of IDs corresponding to the symbols in the text of shape (B, T) or (16, 168) this is embedded with nn.Embedding( n_symbols, 256) -> (16, 168, 256).

After some RNN layers the model outputs a tensor of shape (16, 168, 256).


My lack of knowledge comes in the Loss calculation for ASR models. I have tries both NLLLoss and CrossEntropyLoss with little success and many different errors:

CrossEntropyLoss & NLLLoss -> RuntimeError: multi-target not supported at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:15

Following another thread HERE

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Hi,

for ASR models, I would usually go for the connectionist temporal classification loss (CTC loss).

Regards,
Unity05

Thank you, I’ll look into it.

Could you elaborate on why you use CTCLoss?

It is really usefull you get for instance duplicates from the horizontal time stamps and so on because CTC uses blank characters (i.e. heelll[blank]llllloooo is the same as hellll[blank]lo would get encoded to hello but hello would not.).
The loss (CTCLoss) is computed as illustrated here (the blank labels are not removed for loss calculation):
(https://miro.medium.com/max/1200/1*1_5KnLvaTkGUFoyat2jHcQ.png)
During inference, you get the prediction by best path decoding.

1 Like

Thank you very much @Unity05!

That’s super helpful.

1 Like

Thanks, and yeah, CTC is quite nice. :grinning_face_with_smiling_eyes: