Help: DDP hangs when using multi-threaded CUDA stream parallel loss computation

Body:

Hi everyone,

I’m training a multi-task model using PyTorch DistributedDataParallel (DDP), and I want to parallelize different loss computations (e.g., pvb, lane3d, occ3d, etc.) using multiple threads + custom CUDA streams for performance reasons.

I’ve implemented a ParallelLossExecutor class that looks roughly like this:

  • Uses ThreadPoolExecutor with multiple threads
  • Each thread sets a specific torch.cuda.Stream
  • Inside each thread, it runs the loss computation with:
with torch.cuda.stream(stream):
    result = loss_fn(...)
stream.synchronize()
  • After all losses are computed, I merge the loss dicts and do:
total_loss = loss1 + loss2 + ...
total_loss.backward()

This works perfectly in single-GPU mode.

:red_exclamation_mark: But in DDP multi-GPU mode, the program hangs during backward(), most likely in torch.distributed.all_reduce().


Questions:

  1. Is this kind of parallel loss computation (multi-thread + per-task CUDA streams) safe to use under DDP?
  2. Does DDP assume that all autograd operations (including loss computation) happen in the default stream and main thread?
  3. Are there any official recommendations on how to structure multi-task loss computations under DDP?

Other Info:

  • PyTorch version: [your version, e.g. 2.2.1]
  • NCCL backend
  • Mixed precision is off
  • All losses are computed from cloned model heads (to avoid parameter sharing issues)

Thanks in advance! This is a pretty common pattern in multi-task learning, so I’d really appreciate any insight or recommendations.