I have a problem about memory consulting on different GPUs. I implement a model containing convolution layers and LSTM. I try to train it using both the GPU on my workstation and also the GPU on the server. However, it consults different size of memory on different GPUs, which confuses me.
The GPU on my workstation is GeForce GTX. When the model uses this GPU, it takes 5286MB in total. However, when it runs on the TITAN X on the server, it takes up to 7539MB. There is a huge difference.
Especially, when it runs on TITAN X, it consults only 5315MB after first time backward and it consults 7539MB since the second time backward. I do not understand why the second time backward consults so much memory? Since the memory usage holds the same after the second backward, I cannot conclude there exists memory leaks.
Can anyone share any thought about this situation?
If the numbers you mentioned are observed via nvidia-smi, then it is not an accurate depiction of the actual memory usage since pytorch use a caching allocator http://pytorch.org/docs/0.3.0/notes/cuda.html#memory-management. Moreover, the cudnn may choose different algorithms basing on different architectures. Since your model contains conv layers and an LSTM, it uses cudnn heavily.
I get all the numbers from nvidia-smi. I have tried out using torch.cuda.empty_cache() to free all unused memory and I get the same result as I stated before.
I know that memory allocation will be different on different architectures, but is it responsible for such a huge difference of memory usage, i.e. 5286MB v.s. 7539MB, on two kinds of GPUs?
Also, I do not understand why the algorithm consults much more memory since the second backward compared with the first backward on TITAN X, while it consults the same memory no matter which time backward on GeForce GTX.
Gradient Checkpointing True, fp16 on, V100: 10613G
I tried with same cuda/cudnn on both machines (Tried Cuda 10.1/10.2, and cudnn 7605/7603). I also tried both pytorch 1.6 and 1.3. It seems there are some Volta specific kernels that consume more memory on V100? Does anyone know how to reduce the GPU memory consumption on the Volta cards? I don’t mind sacrificing some speed in my use case.
In that case your initial code doesn’t seem to have used architecture-specific kernels.
You could trade compute for memory by using e.g. torch.utils.checkpoint or by lowering the batch size, if this fits your use case.
My batch size is already 1, so no luck there. torch.utils.checkpoint is interesting! My model consists of transformers and classification heads on top. Transformers are already using gradient_checkpointing, but I’ll try to checkpoint the classification heads now!