Distributed Data Parallel with Multiple Losses

Hi,

I am using DistributedDataParallel with nccl. I have two losses which are averaged before calling backward. But backward doesn’t work. Here is the part of the code that is problematic:

self.inputs['qa_in'][i] = Variable (self.inputs['qa_in'][i].data, requires_grad=True)

self.outputs['qa_outputs'][i] = self.qa_outputs(self.inputs['qa_in'][i])

start_logits, end_logits = self.outputs['qa_outputs'][i].split(1, dim=-1)
start_logits = start_logits.squeeze(-1)
end_logits = end_logits.squeeze(-1)

ignored_index = start_logits.size(1)
        
start_positions_ubatches[i].clamp_(0, ignored_index)
end_positions_ubatches[i].clamp_(0, ignored_index)

loss_fct = CrossEntropyLoss(ignore_index=ignored_index)

start_loss = loss_fct(start_logits, start_positions_ubatches[i])
end_loss = loss_fct(end_logits, end_positions_ubatches[i])

self.outputs['loss_out'][i] = (start_loss + end_loss) / 2
self.outputs['loss_out'][i].backward( retain_graph=True)

and I get the following error:

File “/home/suncast/venv3/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Expected to mark a variable ready only once. This error is caused by use of a module parameter outside the forward function. The return value of the forward function is inspected by the distributed data parallel wrapper to figure out if any of the module’s parameters went unused. If this is the case, it knows they won’t receive gradients in a backward pass. If any of those parameters are then used outside forward, this error condition is triggered. You can disable unused parameter detection by passing the keyword argument find_unused_parameters=False to torch.nn.parallel.DistributedDataParallel. (mark_variable_ready at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:342)

I think the problem is the self.qa_outputs parameters are used twice in backward but I don’t know how to solve this. I don’t have any problem without distributed.

Have you tried disabling unused parameter detection by passing find_unused_parameters = False to torch.nn.parallel.DistributedDataParallel?

Yes. I get the following error when I set it to False:

File “/home/suncast/venv3/lib/python3.6/site-packages/torch/autograd/init.py”, line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: has_marked_unused_parameters_ INTERNAL ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:290, please report a bug to PyTorch.

Anyone knows how to solve this?

Hey @maralm

From your post, it is unclear which part is the DDP model. My assumption is that:

  1. self.inputs['qa_in'][i]: this is input to DDP forward
  2. self.qa_outputs: this is your DDP model
  3. self.outputs['qa_outputs'][i]: this is your DDP outputs

I think the problem is the self.qa_outputs parameters are used twice in backward but I don’t know how to solve this. I don’t have any problem without distributed.

This should be fine, the autograd engine should be able to manage backward inputs and dependencies from start_loss and end_loss properly.

Two questions:

  1. Does it work if you directly call self.outputs['qa_outputs'][i].sum().backward() after line 3?
  2. Does any of the model parameters or outputs participates in other forward/backward passes?

It will be very helpful for us to debug if you could share a minimum repro example. As we don’t know what happens outside of the posted code snippet, we can only make assumptions.

Hi @mrshenli,

Thanks for your reply.
Your assumption is correct and self.qa_outputs is just a linear layer.

Regarding your questions:

  1. No it doesn’t work with that.
  2. No, I am trying to just run a forward layer and compute backward with autograd.backward on that layer instead of running loss.backward().

Basically, I have a large model and when I run it in a conventional way for forward and backward(loss.backward()), it works fine. But, I have a new implementation which runs backward layer by layer using autograd.backward. Using that, the algorithm works fine on a single gpu but face this error in distributed. I tried it on a different model which doesn’t have multiple losses and it is fine. In this case that I add multiple losses, the error comes.

I see. DDP does not work for this case yet. Currently, all outputs you get from DistributedDataPararlel.forward() must participate in the same backward pass, otherwise, it would mess up with DDP’s internal communication state. Hope this can help explain that: https://pytorch.org/docs/master/notes/ddp.html#internal-design

I tried it on a different model which doesn’t have multiple losses and it is fine. In this case that I add multiple losses, the error comes.

I might have misunderstand the use case. Adding up multiple losses should work, and this is different from running layer-by-layer backward, right? Would I be correct if I assume the code snippet you shared above is adding two losses together instead of doing layer-by-layer backward?

It would be helpful if you could share a minimum repro for this error. Thanks!

No this is the same issue.
To simplify, let’s assume that I want to find the gradients for the last layer of the network only which includes a linear classifier and loss (using autograd.backward()). If I use a linear layer with single loss, DDP works with autograd but when I add two losses, it gives that error.

Did you solve this problem? I have met the same problem.

1 Like

Same problem here. Any solutions?