Distributed training stucks with CUDA12.0

hubert233 · April 8, 2023, 10:52pm

I am using torch2.0 with cuda12.0. The env is installed by pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118. However, I found the distributed training stucks with no output. The GPU usage is normal, as I increase the batch size , the GPU usage will also rise. However, the model has no outputs for even one batch. The training script is here transformer_mlm. I found it stucks at outputs = self.parallel_apply(replicas, inputs, kwargs) in data_parallel.py during and thus there are no outputs.
I did a simple verification with this data parallel example. Also, I got no outputs for output = model(input).

I am just curious, does pytorch2.0 does not support data parallel with cuda12.0? Do I need to downgrade cuda or pytorch?

Thanks a lot in advance.

ptrblck · April 9, 2023, 5:08am

The posted install command will use the CUDA 11.8 runtime with their corresponding math libs, not 12.0.
Are you able to execute the DDP tutorial successfully?

hubert233 · April 9, 2023, 10:11am

Thanks for replying! In my env, I can run the DDP tutorial, but I still cannot get the data parallel example passed. I just tested on another machine with the same env but with CUDA11.8 installed. It works. So I guess the issue may be raised by the CUDA version.

ptrblck · April 9, 2023, 5:47pm

The PyTorch binaries ship with their own CUDA libs as already mentioned and I doubt your locally installed CUDA toolkit is related to the mentioned issue, as it won’t even be used unless you build PyTorch from source or a custom CUDA extension.

Good to hear DDP is working as it’s the recommended way since DataParallel is in maintenance mode.