I am trying a large model using DDP with a single batch on 4 GPUs on a single node. I am trying to reduce GPU memory usage as much as possible and have been trying to set
gradient_as_bucket_view=True to reduce CUDA memory overhead. However, when setting
gradient_as_bucket_view to either
False, I am seeing about 2GB of additional overhead per card.
I am also not sure if this is related, but in PyCharm, the argument for
gradient_as_bucket_view is indicated as an unexpected argument, but I am not sure why this is so.
I am running torch==1.10.1+cu11.3 installed through pip (cannot use conda as it is through a HPC).