I was training a large model using a multi-machine distribution, and it worked fine when micro-batch=4 and gradient accumulation=16. But when I changed the configuration to micro-batch=2 and gradient accumulation =32, I got the following error: RuntimeError: NCCL error in: … / torch / / distributed/CSRC c10d/ProcessGroupNCCL CPP: 47, unhandled cuda error, NCCL version 21.0.3
My environment is:
Torch1.10.0 + cu111
Cuda - 11.1
Driver Version: 470.94
CUDA Version: 11.4
I guess it is because the original cuda method called is compatible although the cuda version does not match, but the larger gradient accumulation execution of different cuda methods, this cuda method happens to be incompatible?
Is that right? Can anyone familiar with the subject confirm my guess?
I don’t completely understand where exactly you are seeing a mismatch and what kind of compatibility seems to be affected.
In any case, I would recommend updating PyTorch to the latest stable or nightly release and to check if you would still run into the same issue.
I think Torch1.10.0 + cu111 and CUDA Version: 11.4 are incompatible.
But amazingly, it doesn’t report errors when micro-batch=4, gradient accumulation=16, my training has been normal, After I adjusted it to micro-batch=2 and gradient accumulation =32, it told me cuda error.
I’m not trying to report any bugs, because the bug has been resolved when I upgraded it to Torch1.11.0 + cu113. I just want to explore why is that?
No, that’s not that case as the locally installed CUDA toolkit won’t be used unless you build PyTorch from source or custom CUDA extensions since the PyTorch binaries ship with their own CUDA runtime dependencies.
My code does involve the cuda extension build, so do you think the environmental incompatibilities caused by modifying batch are mainly due to the cuda extension code? The built-in operations in pytorch doesn’t have like this, right?