In NVIDIA container environments, PyTorch's NCCL allreduce operation exhibits extremely poor performance

hello,pytorch team, have you noticed issue mentioned in Torch allreduce with low performance on cuda12.8 compatibility - GPU-Accelerated Libraries - NVIDIA Developer Forums , where nccl-test works fine but torch allreduce not?

cc @ptrblck could you help on this issue? sounds like a cuda toolkit + pytorch issue, thank you.

I see this was cross-posted a few times already and based on the additional information it currently sounds like a setup issue to me.

I would recommend keeping the discussion in one place to avoid cross-posting different debug steps and outputs. In this case, let’s follow up in the bug you have created.