Unclear variable lengths in Emformer


Thank you for your implementation of Emformer.
However, it is not clear for me what should be the variable ‘lengths’ equal to in the implementation?

Should it be lengths = utterance.length + right_context.length or just lengths = utterance.length?
And if the latter, where should the right context be located if I feed the batch of waveforms: after the paddings or before?