I am working on an ASR project, where I use a model from HuggingFace (wav2vec2). My goal for now is to move the training process to PyTorch, so I am trying to create everything that HuggingFace’s Trainer() class gives.
One of these utilities is the ability to group batches by length and combine this with dynamic padding (via a data collator). To be honest however, I am not sure how to even begin this. Would I need to create a custom Dataloader class and alter it, so that every time it gives me batch sizes of lengths as close as possible?
The input are 1-D arrays that represent the raw waveform of a .wav file. An idea I had, was to somehow sort the data from shortest to longest (or the opposite), and each time extract
batch_size samples from them. This way, the first batch will consist of samples with the biggest lengths, the second batch will have the second biggest lengths, etc.
Nevertheless, I am not sure how to approach this implementation. I also searched online but did not manage to find something already implemented. Any advice will be greatly appreciated.
Thanks in advance.
EDIT: Now that I think about it, could I possibly do it somehow in the data collator function?