Distributed training stucks with CUDA12.0

I am using torch2.0 with cuda12.0. The env is installed by pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118. However, I found the distributed training stucks with no output. The GPU usage is normal, as I increase the batch size , the GPU usage will also rise. However, the model has no outputs for even one batch. The training script is here transformer_mlm. I found it stucks at outputs = self.parallel_apply(replicas, inputs, kwargs) in data_parallel.py during and thus there are no outputs.
I did a simple verification with this data parallel example. Also, I got no outputs for output = model(input).

I am just curious, does pytorch2.0 does not support data parallel with cuda12.0? Do I need to downgrade cuda or pytorch?

Thanks a lot in advance.

The posted install command will use the CUDA 11.8 runtime with their corresponding math libs, not 12.0.
Are you able to execute the DDP tutorial successfully?

Thanks for replying! In my env, I can run the DDP tutorial, but I still cannot get the data parallel example passed. I just tested on another machine with the same env but with CUDA11.8 installed. It works. So I guess the issue may be raised by the CUDA version.

The PyTorch binaries ship with their own CUDA libs as already mentioned and I doubt your locally installed CUDA toolkit is related to the mentioned issue, as it won’t even be used unless you build PyTorch from source or a custom CUDA extension.

Good to hear DDP is working as it’s the recommended way since DataParallel is in maintenance mode.