Body:
Hi everyone,
I’m training a multi-task model using PyTorch DistributedDataParallel (DDP), and I want to parallelize different loss computations (e.g., pvb
, lane3d
, occ3d
, etc.) using multiple threads + custom CUDA streams for performance reasons.
I’ve implemented a ParallelLossExecutor
class that looks roughly like this:
- Uses
ThreadPoolExecutor
with multiple threads - Each thread sets a specific
torch.cuda.Stream
- Inside each thread, it runs the loss computation with:
with torch.cuda.stream(stream):
result = loss_fn(...)
stream.synchronize()
- After all losses are computed, I merge the loss dicts and do:
total_loss = loss1 + loss2 + ...
total_loss.backward()
This works perfectly in single-GPU mode.
But in DDP multi-GPU mode, the program hangs during
backward()
, most likely in torch.distributed.all_reduce()
.
Questions:
- Is this kind of parallel loss computation (multi-thread + per-task CUDA streams) safe to use under DDP?
- Does DDP assume that all autograd operations (including loss computation) happen in the default stream and main thread?
- Are there any official recommendations on how to structure multi-task loss computations under DDP?
Other Info:
- PyTorch version: [your version, e.g. 2.2.1]
- NCCL backend
- Mixed precision is off
- All losses are computed from cloned model heads (to avoid parameter sharing issues)
Thanks in advance! This is a pretty common pattern in multi-task learning, so I’d really appreciate any insight or recommendations.