How to use `torch.nn.parallel.DistributedDataParallel` and `torch.utils.checkpoint` together

Hi, I have some problems using torch.nn.parallel.DistributedDataParallel (DDP) and torch.utils.checkpoint together.

It is Ok, if I set find_unused_parameters=False in DDP. The dilemma is that my network is a dynamic CNN, which will not forward the whole model during training, which means I have to set the find_unused_parameters=True… And, if I don’t use the torch.utils.checkpoint, my network is too large to run, leading to the OOM problem…

Therefore, what should I do to meet my demands?

There are some links to this question, but they not solve my problem.

  1. How to use torch.utils.checkpoint and DistributedDataParallel together · Issue #43135 · pytorch/pytorch · GitHub
  2. Using `torch.utils.checkpoint.checkpoint_sequential` and `torch.autograd.grad` breaks when used in combination with `DistributedDataParallel` · Issue #24005 · pytorch/pytorch · GitHub

Part of error report:

Thanks for all the suggestions!!!

DDP does not work with torch.utils.checkpoint yet. One work around is to run forward-backward on the local model, and then manually run all_reduce to synchronize gradients after the backward pass.

3 Likes

It’s ok in find_unused_parameters=False, so I guess maybe manually define checkpoint can solve my problem?

Thus, I modify the checkpoint to save the variable(indicating which part to forward), and get output again in backward with this variable.

But it makes new problems (Not Enough Values to unpack in backward)…

The reason find_unused_parameters=True does not work is because, DDP will try to traverse the autograd graph from output at the end of the forward pass when find_unused_parameters is set to True. However, with checkpoint, the autograd graphs are reconstructed during the backward pass, and hence it is not available when DDP tries to traverse it, which will make DDP think those unreachable parameters are not used in the forward pass (although they are just hidden by checkpoint). When setting find_unused_parameters=False, DDP will skip the traverse, and expect that all parameters are used and autograd engine will compute grad for each parameter exactly once.

But it makes new problems (Not Enough Values to unpack in backward)…

Could you please elaborate more on this problem?

1 Like

If so, then the new problem is caused by my codes. I will try to figure it out .

I understand the effect of find_unused_paramerts now. Therefore, even my network has some unused parameters, but in each forward, each GPU updates the same unused parameters. In such a situation, I also can set find_unused_paramerts=False all right ?

1 Like

I fix my problem like the FP16Optimizer in mmdetection, which is similar to the Apex delay all reduce.

And I realize the solution is just you said before… Thank you very much!!!

But it seems work no matter what value of find_unused_parameters. I wonder why it work when find_unused_parameters=True, the traverse should failed logically.

1 Like

But it seems work no matter what value of find_unused_parameters . I wonder why it work when find_unused_parameters=True , the traverse should failed logically.

IDK, I would assume this would lead to autograd hook fire on a ready parameter, which should trigger an error in DDP. BTW, when you manually run allreduce, DDP is no longer necessary. Any reason for still wrapping the model with DDP?

You are right. Actually, there is no need to use the ddp anymore. Thx~

For me, simply setting find_unused_parameters to False in DistributedDataParallel (DDP) solves the problem. There is no error anymore.

Ref: https://stackoverflow.com/questions/68000761/pytorch-ddp-finding-the-cause-of-expected-to-mark-a-variable-ready-only-once