Is it possible to keep a chunk of continuous gpu memory (e.g. 20G) in DDP mode for gradient synchronization?

My training code will produce a large amount of memory fragments. Therefore, in DDP mode, it usually reports that it can’t allocate 15.00GB memory in backward stage.

I set gradient_as_bucket_view=True and it works well. However, when I use gradient_accumulation, this method can’t work.

So I want to create a chunk of continuous gpu memory for gradient backward. Are there any methods that can achieve this goal?

Could you describe why gradient accumulation fails?

I guess that when setting gradient_as_bucket_view=True, maybe the backward operation conflicts with the function of gradient_as_bucket_view, because gradient accumulation needs multiple times of backward.

IIUC, gradient_as_bucket_view is trying to solve the fragmentation. Is there any reason why you don’t want to use gradient_as_bucket_view ?

cc: @fegin

Sorry for the spelling error.
I mean that, when I set gradient_as_bucket_view=True, the multiple backward operations may make the gradient_as_bucket_view lose efficacy.

There are three experiments.

  1. The GPU memory is unstable when gradient_as_bucket_view=False.
  2. If I don’t use gradient_accumulation, the GPU memory is stable when gradient_as_bucket_view=True.
  3. But if I use gradient_accumulation, the GPU memory is unstable when gradient_as_bucket_view=True.

With gradient accumulation , there will be at least 2 copies of the gradients. Maybe this is the reason why the gradient_as_bucket_view does not work.

Thank you all for your help. I have solved this problem. The gradient accumulation operation occupies some extra GPU memory. After slightly reducing the memory consumption, it works.