Training an ASR Model with NLLLoss or CrossEntropyLoss

knoriy · May 18, 2021, 3:10pm

I’m new to ASR models. Could someone kindly help me understand how to train a simple asr model?

Background:
I have created an attention-based ASR model with an encoder that takes in mel-spec as inputs, with shape (B, C, T) or (16, 80, 873), the decoder is passed a sequence of IDs corresponding to the symbols in the text of shape (B, T) or (16, 168) this is embedded with nn.Embedding( n_symbols, 256) -> (16, 168, 256).

After some RNN layers the model outputs a tensor of shape (16, 168, 256).

My lack of knowledge comes in the Loss calculation for ASR models. I have tries both NLLLoss and CrossEntropyLoss with little success and many different errors:

CrossEntropyLoss & NLLLoss -> RuntimeError: multi-target not supported at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:15

Following another thread HERE

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Unity05 · May 18, 2021, 3:41pm

Hi,

for ASR models, I would usually go for the connectionist temporal classification loss (CTC loss).

Regards,
Unity05

knoriy · May 19, 2021, 2:32pm

Thank you, I’ll look into it.

Could you elaborate on why you use CTCLoss?

Unity05 · May 19, 2021, 3:45pm

It is really usefull you get for instance duplicates from the horizontal time stamps and so on because CTC uses blank characters (i.e. heelll[blank]llllloooo is the same as hellll[blank]lo would get encoded to hello but hello would not.).
The loss (CTCLoss) is computed as illustrated here (the blank labels are not removed for loss calculation):
(https://miro.medium.com/max/1200/1*1_5KnLvaTkGUFoyat2jHcQ.png)
During inference, you get the prediction by best path decoding.

knoriy · May 19, 2021, 3:54pm

Thank you very much @Unity05!

That’s super helpful.

Unity05 · May 19, 2021, 3:56pm

Thanks, and yeah, CTC is quite nice.