Preparing multi-track audio data for Dataset and DataLoader

I’m currently creating a Wave-U-Net model for automatic mixing of audio and I’m stuck on building the dataset. The data I’m using is comprised of 8 audio tracks (corresponding to separate pieces in a drum kit) and a mixdown of these tracks (some recordings only have 7 audio tracks so in those instances I’ve created a tensor of zero values for silence – is this the right way to go about this?).

I’ve written the tensors to hdf format leaving me with a test.hdf5, val.hdf5 and train.hdf5. The audio tracks range from ~10s to 60s long, so my main question is; do I need to pad all of the data such that every single audio track is as long as the longest one (60s), or do I just need to verify that the tracks and mix that make up each recording are as long as each other? Are there any other additional steps needed in preparing my data? (e.g. do I need to stack tensors from the multi-tracks or something?)

Any help is appreciated

The default batch fun (the function that creates batches out of the individual output of each worker from the dataloader) stack tensors.
This means all your tensors within the same category need to be the same length (ie, you need to pad)

WRT steps, it depends a lot on the problem, if you want to mix automatically I guess you may want to have control over the loudness of each track?

Hi @JuanFMontesinos, I’ve padded shorter tracks and cut longer ones so that all tracks in the dataset are 20s long to avoid long data preparation and training times. I’m not sure if I’m approaching the task the wrong way and was wondering if you could clarify some things. Currently I have converted all tracks to melspectrograms and stacked them together into one input tensor of size (32, 8, 64, 1536) and a target tensor (32, 2, 64, 1536) - is combining into one input tensor the correct way to format a mixing task or would separate tensors for each track as input be more suited? (I don’t know how multiple inputs work with torch networks or how to implement them)

Well first of all if you are using wave-unet u should be using raw waveforms.

Also, I find 20s of audio too large (does it fit the in the gpu?. There are very few papers working with such duration (maybe it’s ur case). I don’t understand what your target is, soz.

Lastly, for multiple inputs there are many approaches. You can pass those many inputs as channels (which sounds good to have to show the network all the info at once).