Hello!
Thank you for your implementation of Emformer.
However, it is not clear for me what should be the variable ‘lengths’ equal to in the implementation?
Should it be lengths = utterance.length + right_context.length or just lengths = utterance.length?
And if the latter, where should the right context be located if I feed the batch of waveforms: after the paddings or before?