I am trying a large model using DDP with a single batch on 4 GPUs on a single node. I am trying to reduce GPU memory usage as much as possible and have been trying to set gradient_as_bucket_view=True to reduce CUDA memory overhead. However, when setting gradient_as_bucket_view to either True or False, I am seeing about 2GB of additional overhead per card.
I am also not sure if this is related, but in PyCharm, the argument for gradient_as_bucket_view is indicated as an unexpected argument, but I am not sure why this is so.
I am running torch==1.10.1+cu11.3 installed through pip (cannot use conda as it is through a HPC).
Its a lot higher when compared to another HPC server I use (which usually is at 980 MB overhead), but it is using a different CUDA version, and more importantly an unoptimized version of PyTorch which is the likely culprit.
Should it not be reduced with gradient_as_bucket_view? Or am I just confused with the parameter?
Setting gradient_as_bucket_view=True is the primary way to avoid some of the overhead. Also note that you need to run a few training iterations for the memory savings to be applied.