Please see section 3.2.3 third paragraph of the paper. Explanation is bolded:
"
Second, it is possible that one training iteration only involves a sub-graph in the model and the sub-graph can be different from iteration to iteration, meaning that some gradients might be skipped in some iterations. However, as gradient-to-bucket mapping is determined at the construction time, those absent gradients would leave some buckets never seeing the final autograd hook and failing to mark the bucket as ready. As a result, the backward pass could hang. Fig. 3 (b) shows an example, where the parameter corresponding to gradient g3 is skipped in one iteration, leading to the absent of the ready signal for g3. To address this problem, DDP traverses the autograd graph from the output tensors of the forward pass to find all participating parameters. The readiness of those participating tensors is a sufficient signal to conclude the completion of the backward pass. Therefore, DDP can avoid waiting for the rest of the parameter gradients by proactively marking them ready at the end of the forward pass. Note that, this change does not prevent us from developing non-intrusive APIs, because application directly invokes the forward function on DDP and hence DDP can easily insert this step in its member function.
"