Distributed Data Parallel with Multiple Losses

maralm · March 1, 2020, 10:02pm

Hi,

I am using DistributedDataParallel with nccl. I have two losses which are averaged before calling backward. But backward doesn’t work. Here is the part of the code that is problematic:

self.inputs['qa_in'][i] = Variable (self.inputs['qa_in'][i].data, requires_grad=True)

self.outputs['qa_outputs'][i] = self.qa_outputs(self.inputs['qa_in'][i])

start_logits, end_logits = self.outputs['qa_outputs'][i].split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)

ignored_index = start_logits.size(1)
        
start_positions_ubatches[i].clamp_(0, ignored_index)
end_positions_ubatches[i].clamp_(0, ignored_index)

loss_fct = CrossEntropyLoss(ignore_index=ignored_index)

start_loss = loss_fct(start_logits, start_positions_ubatches[i])
end_loss = loss_fct(end_logits, end_positions_ubatches[i])

self.outputs['loss_out'][i] = (start_loss + end_loss) / 2
self.outputs['loss_out'][i].backward( retain_graph=True)

and I get the following error:

File “/home/suncast/venv3/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Expected to mark a variable ready only once. This error is caused by use of a module parameter outside the forward function. The return value of the forward function is inspected by the distributed data parallel wrapper to figure out if any of the module’s parameters went unused. If this is the case, it knows they won’t receive gradients in a backward pass. If any of those parameters are then used outside forward, this error condition is triggered. You can disable unused parameter detection by passing the keyword argument find_unused_parameters=False to torch.nn.parallel.DistributedDataParallel. (mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:342)

I think the problem is the self.qa_outputs parameters are used twice in backward but I don’t know how to solve this. I don’t have any problem without distributed.

osalpekar · March 2, 2020, 7:01pm

Have you tried disabling unused parameter detection by passing find_unused_parameters = False to torch.nn.parallel.DistributedDataParallel?

maralm · March 2, 2020, 9:19pm

Yes. I get the following error when I set it to False:

File “/home/suncast/venv3/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: has_marked_unused_parameters_ INTERNAL ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:290, please report a bug to PyTorch.

maralm · March 20, 2020, 1:57am

Anyone knows how to solve this?

mrshenli · March 20, 2020, 2:27pm

Hey @maralm

From your post, it is unclear which part is the DDP model. My assumption is that:

self.inputs['qa_in'][i]: this is input to DDP forward
self.qa_outputs: this is your DDP model
self.outputs['qa_outputs'][i]: this is your DDP outputs

I think the problem is the self.qa_outputs parameters are used twice in backward but I don’t know how to solve this. I don’t have any problem without distributed.

This should be fine, the autograd engine should be able to manage backward inputs and dependencies from start_loss and end_loss properly.

Two questions:

Does it work if you directly call self.outputs['qa_outputs'][i].sum().backward() after line 3?
Does any of the model parameters or outputs participates in other forward/backward passes?

It will be very helpful for us to debug if you could share a minimum repro example. As we don’t know what happens outside of the posted code snippet, we can only make assumptions.

maralm · March 20, 2020, 8:51pm

Hi @mrshenli,

Thanks for your reply.
Your assumption is correct and self.qa_outputs is just a linear layer.

Regarding your questions:

No it doesn’t work with that.
No, I am trying to just run a forward layer and compute backward with autograd.backward on that layer instead of running loss.backward().

Basically, I have a large model and when I run it in a conventional way for forward and backward(loss.backward()), it works fine. But, I have a new implementation which runs backward layer by layer using autograd.backward. Using that, the algorithm works fine on a single gpu but face this error in distributed. I tried it on a different model which doesn’t have multiple losses and it is fine. In this case that I add multiple losses, the error comes.

mrshenli · April 5, 2020, 4:15pm

I see. DDP does not work for this case yet. Currently, all outputs you get from DistributedDataPararlel.forward() must participate in the same backward pass, otherwise, it would mess up with DDP’s internal communication state. Hope this can help explain that: Distributed Data Parallel — PyTorch master documentation

mrshenli · April 5, 2020, 4:21pm

I tried it on a different model which doesn’t have multiple losses and it is fine. In this case that I add multiple losses, the error comes.

I might have misunderstand the use case. Adding up multiple losses should work, and this is different from running layer-by-layer backward, right? Would I be correct if I assume the code snippet you shared above is adding two losses together instead of doing layer-by-layer backward?

It would be helpful if you could share a minimum repro for this error. Thanks!

maralm · April 6, 2020, 11:37pm

No this is the same issue.
To simplify, let’s assume that I want to find the gradients for the last layer of the network only which includes a linear classifier and loss (using autograd.backward()). If I use a linear layer with single loss, DDP works with autograd but when I add two losses, it gives that error.

wuchiz · January 27, 2021, 2:14am

Did you solve this problem? I have met the same problem.

xannex · May 31, 2021, 5:44pm

Same problem here. Any solutions?