I’m trying to train a GNN using pytorch_geometric. I already know my data is rather large, but CUDA tries to allocate 107.80 GiB during training !
Currently I only have one 16 GB GPU available and my batches are around 8 GiB big (measured with sys.getsizeof(tensor.storage()). I assumed that pytorch would just iterate over the batches, somehow recognise that there is only space for one batch and therefore only load one batch.
Now, I can trim down my graph and reduce the memory footprint of the data, but I would prefer not.
Also I’m getting a ValueError, but that occurs while handling the CUDA exception, hence it can be ignored. ValueError: Encountered a CUDA error. Please ensure that all indices in 'edge_index' point to valid indices in the interval [0, 951740) in your node feature matrix and try again.
If I understand correctly, for your case, only data (input) tensor alone costs 8GiB of memory.
But, for computations done during forward prop and backward intermediate tensor values are stored as well.
In case you already did not, you should also calculate memory of these intermediate tensor values (due to computation graph) to get more accurate measure of memory being used
Of course, there are other factors due to cuda, optimizers etc.
@ptrblck and @ConvolutionalAtom I thank you both for your answers.
Firstly, @ptrblck I’m sorry, I was tired yesterday and it seems like I did a bad job articulate my thoughts effectively. I know that the batch size is a fixed value, that has to be assigned before training. I already reduced the batch size to 10, which is already too low for my standards. The issue is that I assumed that a batch which is approximately 8GiB in size would fit into the 16GiB GPU memory and leave enough space for calculations. Unfortunately that is not the case since CUDA tries to allocate 107.80 GiB. I don’t understand why pytorch needs that much GRAM tbh.
I understand and no I did not consider that the tensors used for computation could be that big. I will figure out how to measure the memory footprint of those tensors and I hope that will clear up where the 107 GiB come from.
Depending on the model architecture the intermediate activations would need a huge amount of memory as @ConvolutionalAtom explained.
A simple conv layer would be a good example, as it’s often not reducing the spatial size significantly while increasing the number of channels in the output activation.
An 8GB input would thus create an even larger output.
Take a look at this post which estimates the memory usage of a resnet.
Unfortunately I couldn’t get the code in the post to run properly. The forward act is always 0, but I think I get what you indented to show with the post → memory usage of the calculation should not be underestimated.
I will trim down my graphs and look into multi-gpu training, hope that will solve my issue.