Hi, I have some problems using
torch.nn.parallel.DistributedDataParallel (DDP) and
It is Ok, if I set
find_unused_parameters=False in DDP. The dilemma is that my network is a dynamic CNN, which will not forward the whole model during training, which means I have to set the
find_unused_parameters=True… And, if I don’t use the
torch.utils.checkpoint, my network is too large to run, leading to the OOM problem…
Therefore, what should I do to meet my demands?
There are some links to this question, but they not solve my problem.
Part of error report:
Thanks for all the suggestions!!!