My training code will produce a large amount of memory fragments. Therefore, in DDP mode, it usually reports that it can’t allocate 15.00GB memory in backward stage.
I set gradient_as_bucket_view=True and it works well. However, when I use gradient_accumulation, this method can’t work.
So I want to create a chunk of continuous gpu memory for gradient backward. Are there any methods that can achieve this goal?
I guess that when setting gradient_as_bucket_view=True, maybe the backward operation conflicts with the function of gradient_as_bucket_view, because gradient accumulation needs multiple times of backward.
Sorry for the spelling error.
I mean that, when I set gradient_as_bucket_view=True, the multiple backward operations may make the gradient_as_bucket_view lose efficacy.
There are three experiments.
The GPU memory is unstable when gradient_as_bucket_view=False.
If I don’t use gradient_accumulation, the GPU memory is stable when gradient_as_bucket_view=True.
But if I use gradient_accumulation, the GPU memory is unstable when gradient_as_bucket_view=True.
With gradient accumulation , there will be at least 2 copies of the gradients. Maybe this is the reason why the gradient_as_bucket_view does not work.
Thank you all for your help. I have solved this problem. The gradient accumulation operation occupies some extra GPU memory. After slightly reducing the memory consumption, it works.