Audio Dataset - Load large file into memory in background

Hi everyone,

I have a specific thing I want to achieve, and was wondering if it’s already possible using the current Dataset/DataLoader implementation. I saw a PR for a ChunkDataset API that may serve my needs, but it isn’t there yet.

Data: Audio, with lots of small (1-10s) sound files.
I want to process this audio in terms of frames, but also incorporate a hop parameter (ie. take the first 1024 samples, then the next frame will start at 256 instead of 1024)

What I want to do is concatenate all the short audio examples into long .wav files, of which two can fit into memory. I wrote code to index individual sounds and their respective frames and it works well.
The idea is to serve frames from one (long, e.g. 1GB) .wav file, and have another one loaded in the background. When all frames from the first file have been served, I replace the “current” file with the one that was loaded in the background, and load a new file.

Everything works, except for the fact that the IO on loading a new file will block the getitem call, interrupting training. I was thinking of some async io structure, but lack some experience there in getting it to interop with the Dataset/Loader classes.
How can I do a non-blocking IO call to replace the current “buffered” file while keep serving frames?

One way to manage async iterators is to use a background iterator to prefetch ahead of time. But the exact setup depends on what you are trying to do of course :slight_smile:

If you start with lots the small files (for a total of 1 GB), then you could create a dataset that reads them on __getitem__. You could either then cache the dataset in memory after loading, using a cache. And/or you could use a background iterator to prefetch files ahead of time. The downside is lots of random disk seeks, but only on first read.

If you have a single large (1 GB) data file with offsets, you could pay the price of loading it once in memory, and then the dataset simply knows about the offsets on __getitem__. To load the 1 GB async, you would need to read per block as an iterator, and you could use the background_iterator to push that in the background. You could still use the cache to keep the data in memory. The benefit would be faster starting time at the beginning since you don’t wait for the whole file to be loaded.

Side note: You could also create a virtual RAM disk and copy the file(s) there once. Then everything after is done from RAM after that, so you could do lots of small file from there, or one big one. Reads are then fast.

3 Likes

Thanks for the reply!

I want to pay the price of loading the file once in memory and using the offsets (which I already have done), but the use case requires many of these large files, ie. the dataset could consist of 100 files of 1GB each. So what I imagined was always loading the next 1GB while the current 1GB is being served.
I don’t mind the penalty of waiting to load the first file, as long as subsequent loads do not affect training times - I want to make sure the GPU is utilised fully.

I will have a look at the bg_iterator, which may work!

edit: How would I have found audio/utils without going through the source/getting this recommendation?
Is there perhaps some documentation I missed?

edit 2: bg_iterator did it! Fairly simple too, just had to create a generator for all big files and specify the generator with maxsize=1 so it will buffer the next item. All my other logic still works, generating the correct frames/batches. Thanks!

1 Like

Thanks for pointing out that the torchaudio documentation needs to be updated to highlight bg_iterator :slight_smile:

Created an issue to track that