I’m new to ASR models. Could someone kindly help me understand how to train a simple asr model?
Background:
I have created an attention-based ASR model with an encoder that takes in mel-spec as inputs, with shape (B, C, T) or (16, 80, 873), the decoder is passed a sequence of IDs corresponding to the symbols in the text of shape (B, T) or (16, 168) this is embedded with nn.Embedding( n_symbols, 256) -> (16, 168, 256).
After some RNN layers the model outputs a tensor of shape (16, 168, 256).
My lack of knowledge comes in the Loss calculation for ASR models. I have tries both NLLLoss and CrossEntropyLoss with little success and many different errors:
CrossEntropyLoss & NLLLoss -> RuntimeError: multi-target not supported at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:15
It is really usefull you get for instance duplicates from the horizontal time stamps and so on because CTC uses blank characters (i.e. heelll[blank]llllloooo is the same as hellll[blank]lo would get encoded to hello but hello would not.).
The loss (CTCLoss) is computed as illustrated here (the blank labels are not removed for loss calculation):
(https://miro.medium.com/max/1200/1*1_5KnLvaTkGUFoyat2jHcQ.png)
During inference, you get the prediction by best path decoding.