A huge difference of memory usage on different GPUs

Hi all,

I have a problem about memory consulting on different GPUs. I implement a model containing convolution layers and LSTM. I try to train it using both the GPU on my workstation and also the GPU on the server. However, it consults different size of memory on different GPUs, which confuses me.

The GPU on my workstation is GeForce GTX. When the model uses this GPU, it takes 5286MB in total. However, when it runs on the TITAN X on the server, it takes up to 7539MB. There is a huge difference.

Especially, when it runs on TITAN X, it consults only 5315MB after first time backward and it consults 7539MB since the second time backward. I do not understand why the second time backward consults so much memory? Since the memory usage holds the same after the second backward, I cannot conclude there exists memory leaks.

Can anyone share any thought about this situation?


Do you have identical cuda, cudnn and pytorch versions?

Yes, they are the same.

If the numbers you mentioned are observed via nvidia-smi, then it is not an accurate depiction of the actual memory usage since pytorch use a caching allocator http://pytorch.org/docs/0.3.0/notes/cuda.html#memory-management. Moreover, the cudnn may choose different algorithms basing on different architectures. Since your model contains conv layers and an LSTM, it uses cudnn heavily.

Thank you for explanation.

I get all the numbers from nvidia-smi. I have tried out using torch.cuda.empty_cache() to free all unused memory and I get the same result as I stated before.

I know that memory allocation will be different on different architectures, but is it responsible for such a huge difference of memory usage, i.e. 5286MB v.s. 7539MB, on two kinds of GPUs?

Also, I do not understand why the algorithm consults much more memory since the second backward compared with the first backward on TITAN X, while it consults the same memory no matter which time backward on GeForce GTX.

I have same problem

I have the same problem. When I load the same transformer model (Roberta-Large) on Titan X and V100 I get vastly different results:

  • Gradient Checkpointing True, fp16 off, TitanX: 8034G
  • Gradient Checkpointing True, fp16 off, V100: 10861G
  • Gradient Checkpointing True, fp16 on, V100: 10613G

I tried with same cuda/cudnn on both machines (Tried Cuda 10.1/10.2, and cudnn 7605/7603). I also tried both pytorch 1.6 and 1.3. It seems there are some Volta specific kernels that consume more memory on V100? Does anyone know how to reduce the GPU memory consumption on the Volta cards? I don’t mind sacrificing some speed in my use case.

You could try to disable cudnn via torch.backends.cudnn.enabled = False at the beginning of your script in order to use the PyTorch-native implementations.

I tried it right after import torch, the GPU memory consumption is exactly the same (digit by digit). Any other advices?

In that case your initial code doesn’t seem to have used architecture-specific kernels.
You could trade compute for memory by using e.g. torch.utils.checkpoint or by lowering the batch size, if this fits your use case.

Thanks for the quick reply!

My batch size is already 1, so no luck there. torch.utils.checkpoint is interesting! My model consists of transformers and classification heads on top. Transformers are already using gradient_checkpointing, but I’ll try to checkpoint the classification heads now!

Thanks for the tip :smiley:

A quick question about torch.utils.checkpoint.

My classification head is:

class ClassificationHead(nn.Module):
  def __init__(self, config):
    self.dense = nn.Linear(config.hidden_size, config.hidden_size)
    self.dropout = nn.Dropout(config.hidden_dropout_prob)
    self.out_proj = nn.Linear(config.hidden_size, config.num_labels)
 forward(self, inputs):

I use it in another bigger model:

sf_logits = checkpoint(self.sf_classifier, sequence_output)

But I’m getting an error: “TypeError: ‘module’ object is not callable”.

In the documentation, it doesn’t say what is the type of the function to be passed.

Nevermind, a stupid mistake on my part ^^’

I was doing:

from torch.utils import checkpoint

instead of:

from torch.utils.checkpoint import checkpoint

It was a good try, but it only saved 5Mb in my case.

I still appreciate it, thanks!