Gradient_as_bucket_view does not seem to do anything

I am trying a large model using DDP with a single batch on 4 GPUs on a single node. I am trying to reduce GPU memory usage as much as possible and have been trying to set gradient_as_bucket_view=True to reduce CUDA memory overhead. However, when setting gradient_as_bucket_view to either True or False, I am seeing about 2GB of additional overhead per card.

I am also not sure if this is related, but in PyCharm, the argument for gradient_as_bucket_view is indicated as an unexpected argument, but I am not sure why this is so.

image

I am running torch==1.10.1+cu11.3 installed through pip (cannot use conda as it is through a HPC).

What is the 2GB additional overhead in comparison to? Are you comparing to when you don’t use DDP at all?

cc @Yanli_Zhao

Its a lot higher when compared to another HPC server I use (which usually is at 980 MB overhead), but it is using a different CUDA version, and more importantly an unoptimized version of PyTorch which is the likely culprit.

Should it not be reduced with gradient_as_bucket_view? Or am I just confused with the parameter?

The memory should be reduced when you specify gradient_as_bucket_view. After initializing DDP, can you run a few iterations and then look at the memory consumption? Also, typically looking at nvidia-smi is not a good indication of memory usage due to the caching allocator (see CUDA semantics — PyTorch 2.1 documentation). I’d suggest using torch.cuda.memory_allocated — PyTorch 2.1 documentation or torch.cuda.max_memory_allocated — PyTorch 2.1 documentation to track this.

Ok, I’ll try that then with 2 GPUS. Is there a way to reduce the overhead from DDP?

Setting gradient_as_bucket_view=True is the primary way to avoid some of the overhead. Also note that you need to run a few training iterations for the memory savings to be applied.