I’m new to ASR models. Could someone kindly help me understand how to train a simple asr model?
I have created an attention-based ASR model with an encoder that takes in mel-spec as inputs, with shape
(B, C, T) or
(16, 80, 873), the decoder is passed a sequence of IDs corresponding to the symbols in the text of shape
(B, T) or
(16, 168) this is embedded with
nn.Embedding( n_symbols, 256) -> (16, 168, 256).
After some RNN layers the model outputs a tensor of shape
(16, 168, 256).
My lack of knowledge comes in the Loss calculation for ASR models. I have tries both NLLLoss and CrossEntropyLoss with little success and many different errors:
CrossEntropyLoss & NLLLoss -> RuntimeError: multi-target not supported at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:15
Following another thread HERE
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn