My training code will produce a large amount of memory fragments. Therefore, in DDP mode, it usually reports that it can’t allocate 15.00GB memory in backward stage.
I set gradient_as_bucket_view=True and it works well. However, when I use gradient_accumulation, this method can’t work.
So I want to create a chunk of continuous gpu memory for gradient backward. Are there any methods that can achieve this goal?
I guess that when setting gradient_as_bucket_view=True, maybe the backward operation conflicts with the function of gradient_as_bucket_view, because gradient accumulation needs multiple times of backward.