Training model with large dataset on a GPU with insufficient Memory (effeciently)

Hi.
I am working with audio data. The code that I am running takes the X and label data from the dataloader, sends it to GPU with X.cuda() and label.cuda() and then pass them through the forward graph. The dataset size in .npy files is around 8GB. My machine is RTX 2060 which has 6 gb memory. So if i run it on my GPU, it processes some batches and runs out of memory, although the code runs fine on Colab, which has tesla T4 with 15 GB memory. The reason for this behavior is I am guessing, by the first iteration, the code tries to force all the tensors on the gpu, but the size of tensors is greater than GPU RAM, That’s why cuda runs out of memory.

Here is what I want to do. I want to load A tensor to the GPU, compute the forward and backward graph, and then send it back to CPU again. But I know that these operations like data.cuda() andn data.cpu() are expensive. So I want them to be parallel to the forward and backward pass (model(input) to be specific).

Is it possible? If so, how?
Thanks!

Write a custom data loader to do lazy loading, that is, at __init__ just load the list of data ( this could be path information, row/column/line information of each data. The actually .npy data is not loaded yet and with the help of that list, load data batch-wise using the __getitem__.

Check the link below, they do something very similar!

1 Like

Yes but Lazy loading is in case where dataset is huge. My dataset is not exactly huge though, it just doesn’t fit on the gpu. So what I need is asynchronous CPU-to-GPU data transfer and vice verca. I don’t think that is the case in the link you mentioned.

Even if you don’t have a large dataset but still can’t fit it in a GPU, you can use lazy loading to mitigate this issue.

One thing I didn’t notice properly from your question earlier is this line,

…it processes some batches and runs out of memory…"

This won’t happen unless you have a variable batch size or data points that have different memory sizes.

One plausible reason could be that variables that you might potentially not be using are getting accumulated on GPU over steps/epochs. See if the following helps,

https://pytorch.org/docs/stable/notes/faq.html