I am working with audio data. The code that I am running takes the X and label data from the dataloader, sends it to GPU with X.cuda() and label.cuda() and then pass them through the forward graph. The dataset size in .npy files is around 8GB. My machine is RTX 2060 which has 6 gb memory. So if i run it on my GPU, it processes some batches and runs out of memory, although the code runs fine on Colab, which has tesla T4 with 15 GB memory. The reason for this behavior is I am guessing, by the first iteration, the code tries to force all the tensors on the gpu, but the size of tensors is greater than GPU RAM, That’s why cuda runs out of memory.
Here is what I want to do. I want to load A tensor to the GPU, compute the forward and backward graph, and then send it back to CPU again. But I know that these operations like
data.cpu() are expensive. So I want them to be parallel to the forward and backward pass (
model(input) to be specific).
Is it possible? If so, how?