Is it possible to keep a chunk of continuous gpu memory (e.g. 20G) in DDP mode for gradient synchronization?

My training code will produce a large amount of memory fragments. Therefore, in DDP mode, it usually reports that it can’t allocate 15.00GB memory in backward stage.

I set gradient_as_bucket_view=True and it works well. However, when I use gradient_accumulation, this method can’t work.

So I want to create a chunk of continuous gpu memory for gradient backward. Are there any methods that can achieve this goal?

Could you describe why gradient accumulation fails?

I guess that when setting gradient_as_bucket_view=True, maybe the backward operation conflicts with the function of gradient_as_bucket_view, because gradient accumulation needs multiple times of backward.

IIUC, gradient_as_bucket_view is trying to solve the fragmentation. Is there any reason why you don’t want to use gradient_as_bucket_view ?

cc: @fegin