Hello
Among the parameters available in DDP, what does “gradient_as_bucket_view” do?
If I use this parameter as True, the memory of the GPU is reduced. What is the reason?
Hello
Among the parameters available in DDP, what does “gradient_as_bucket_view” do?
If I use this parameter as True, the memory of the GPU is reduced. What is the reason?
gradient_as_bucket_view
enables DDP’s internal implementation to avoid a copy for each parameter gradient, thereby reducing memory.
Oh, thnks.
As an additional question, why need to copy gradients within DDP?