I’m new to ASR models. Could someone kindly help me understand how to train a simple asr model?
Background:
I have created an attention-based ASR model with an encoder that takes in mel-spec as inputs, with shape (B, C, T)
or (16, 80, 873)
, the decoder is passed a sequence of IDs corresponding to the symbols in the text of shape (B, T)
or (16, 168)
this is embedded with nn.Embedding( n_symbols, 256) -> (16, 168, 256)
.
After some RNN layers the model outputs a tensor of shape (16, 168, 256)
.
My lack of knowledge comes in the Loss calculation for ASR models. I have tries both NLLLoss and CrossEntropyLoss with little success and many different errors:
CrossEntropyLoss & NLLLoss -> RuntimeError: multi-target not supported at /pytorch/aten/src/THCUNN/generic/ClassNLLCriterion.cu:15
Following another thread HERE
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn