Pytorch GPU Memory Usage

Hi guys, I’m not really sure why this is happening but if I measure my data object, it’s about 265mb in the GPU. If I measure the model, it’s also about 300mb. But once I start training, pytorch uses up almost all my GPU memory. I can’t really understand why. I’ve set pin_memory=False in my DataLoader already and it still displays this behavior. Is there a way to trace the memory usage properly for each object? I’m only doing a single sample every time as I can’t move to batches since all my GPU is being used up.

1 Like

Besides the data and model parameters, the CUDA context will use some memory as well as the intermediate activations, which are needed to calculate the backward pass.
Also note that PyTorch uses a caching allocator, which will reuse the memory.
nvidia-smi will thus show the complete memory usage, while torch.cuda.memory_allocated() will give you the allocated memory only.

Hmm if that’s the case, it’d mean that majority of the GPU allocation is for the intermediate stages of forward / backward passes? Hence, will it be likely that I can improve the GPU allocation if I reduce the data types of the inputs? e.g. converting from float 64 to float 32.

That might be the case.
E.g. for a single conv layer, the output might contain more elements than the input, if out_channels > in_chanels and you don’t reduce the spatial size.

Yes. PyTorch uses float32 by default, so if you’ve called double() on your model (and don’t strictly need the precision), I would recommend to use FP32.

1 Like

Okay got it. Here’s something that’s really puzzling to me though. 1 sample of my DataLoader appears to have ~250Mb (when I first load 1 sample into the GPU). But oddly, if i increment my batch size to 4, I get this error

RuntimeError: CUDA out of memory. Tried to allocate 7.41 GiB (GPU 0; 11.17 GiB total capacity; 8.34 GiB already allocated; 2.28 GiB free; 8.60 GiB reserved in total by PyTorch)

Which is really odd, does this mean my data object is actually larger? In this instance, 7.41/4 = 1.8 Gb?

Yes, it means that not only the data sample itself uses additional memory, but also all intermediate activations.

I see ok. So one way is to reduce the size of the tensors that in the object. I keep a tensor that is used to index another tensor in the object (e.g. A[b: ]). In this case, does b have to be of type LongTensor? Or can I reduce this to a smaller integer type?