In NVIDIA container environments, PyTorch's NCCL allreduce operation exhibits extremely poor performance

ginjia_lee · August 20, 2025, 9:38am

hello，pytorch team, have you noticed issue mentioned in Torch allreduce with low performance on cuda12.8 compatibility - GPU-Accelerated Libraries - NVIDIA Developer Forums , where nccl-test works fine but torch allreduce not?

ginjia_lee · August 21, 2025, 2:35am

cc @ptrblck could you help on this issue? sounds like a cuda toolkit + pytorch issue, thank you.

ptrblck · August 22, 2025, 2:40pm

I see this was cross-posted a few times already and based on the additional information it currently sounds like a setup issue to me.

I would recommend keeping the discussion in one place to avoid cross-posting different debug steps and outputs. In this case, let’s follow up in the bug you have created.

ginjia_lee · August 23, 2025, 9:13am

thank you for your reply. let’s follow in the bug report.