I am using torch2.0 with cuda12.0. The env is installed by
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118. However, I found the distributed training stucks with no output. The GPU usage is normal, as I increase the batch size , the GPU usage will also rise. However, the model has no outputs for even one batch. The training script is here transformer_mlm. I found it stucks at
outputs = self.parallel_apply(replicas, inputs, kwargs) in data_parallel.py during and thus there are no outputs.
I did a simple verification with this data parallel example. Also, I got no outputs for
output = model(input).
I am just curious, does pytorch2.0 does not support data parallel with cuda12.0? Do I need to downgrade cuda or pytorch?
Thanks a lot in advance.