RuntimeError:Cuda out of memory[Unable to use]

asha97 · June 19, 2020, 10:12am

my batch size is 1 , but amount of data is huge

ptrblck · June 19, 2020, 10:14am

In that case you would have to reduce the amount of the data (number of images, if I understand the use case correctly).

asha97 · June 19, 2020, 10:18am

But I need to take complete data, in my program.

ptrblck · June 19, 2020, 10:22am

The ~11GB are apparently not enough for all 240 images.
You could try to use torch.utils.checkpoint to trade compute for memory, reduce the model, or possible use model sharding on multiple GPUs.

asha97 · June 19, 2020, 10:28am

Can DataParallel help but there also I get error.

ptrblck · June 19, 2020, 10:30am

nn.DataParallel could be used, if your batch size is >1.
However, as far as I understand your use case your batch size is 1 and the number of images is in another dimension, which is apparently not the batch dimension?

asha97 · June 19, 2020, 10:52am

Using DataParallel ,I m getting error:

File “/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py”, line 149, in forward
“them on device: {}”.format(self.src_device_obj, t.device))
RuntimeError: module must have its parameters and buffers on device cuda:0 (device_ids[0]) but found one of them on device: cuda:1

Trikaldarshi · June 19, 2020, 12:58pm

I guess you are using some sort of time distributed CNN-LSTM architecture. When I was using the same, I used to face a lot of memory issues. One thing you can do is to avoid loading the entire data in the memory. You can divide the data into chunks and load the chunks one by one to train your model. For example, the following code divides the data into two chunks and proceed for the training of the model.
pytorch_first_post