DistributedDataParralled not support cuda?

there was not proper synchronization with the CUDA events that recored copies into this contents tensor before bucket contents tensor allreduce. it seems like not supporting cuda?

web link:

DistributedDataParallel (DDP) does supports CUDA. The comment suggests extra care might be necessary when backward run on non-default stream. Actually, even if backward occurs on non-default streams it should be fine for most use cases. Below is why:

background: I learned from @albanD that autograd engine will use the same stream as the forward pass.

Let’s take a look at what could go wrong for the code you quoted.

1: the tensor is not ready when launching the allreduce operation
2: the tensor was destroyed too soon before the allreduce finishes.

We can rule out 2 for now, as all_reduce does recordStream() properly to prevent CUDA blocks to be freed too early.

Then the only thing left is 1. The operation on that tensor before allreduce is bucket_view.copy_(grad.view({-1}), /* non_blocking */ true); in mark_variable_ready_dense. The copy here happens on the same device (replica.contents and grad). And Reducer itself does not switch streams in between. So the only case that could hit race condition is when the application used different streams for different operators during the forward pass, and grads associated with those operators fall into the same bucket in reducer.