I try to using cuda kernel with customized torch.autograd.function for both foward and backward (some naive implementation, nothing to do with cuda stream). After it is plugined into DDP pipeline, crash will happen. I can only see “NCCL Error 1: unhandled cuda error”. Is there any sample for this pipeline, Or any idea how to debug these crashes?
Did you verify whether it’s your kernel is causing those errors?
One common mistake is to not call cudaGetLastError() after launching a kernel using CUDA syntax.
Another option is to force a stream sync prior to launching a nccl collective to isolate the issue.
I found out why. Simply forget to add CUDAGaurd before each kernel launch. Silly mistake