RTX 4090 2way cannot train model; it stuck

Hyunjin_Kim · March 29, 2023, 3:29pm

One RTX4090 can train a model normally with Pytorch-cuda. But with 2x RTX4090, it stuck at dataloader. Does NVIDIA block P2P communication between GPUs? I cannot understand why this serious problem is not announced in public? If I know it, I will not buy this kind of machine. My machine uses ThreadRipper Pro 5955wx and 2 GPUs.

ptrblck · March 29, 2023, 3:35pm

Your 4090s do not support P2P and some NVIDIA drivers created hangs. Drivers 525.107 and 520.86 should fix it and avoid the hang.

Hyunjin_Kim · March 29, 2023, 5:14pm

Thanks for quick answer. But I have another question. Do all 4090 GPUs not support P2P communication? Or no plan for allow P2P communication?

Hyunjin_Kim · March 29, 2023, 5:18pm

And where can I download the stable drivers? Thanks again.

ptrblck · March 29, 2023, 7:47pm

Sorry, the driver should be released by end of this month and might not be published yet. As a workaround you could use export NCCL_P2P_DISABLE=1.

P2P is not enabled on 4090s as it’s an unsupported configuration. The driver fix doesn’t enable it, but avoids hangs and data corruption.

Hyunjin_Kim · April 3, 2023, 7:28am

I think that the release of these drivers require more time. Is it right? On 30th March, 525.105 driver was released.

Hyunjin_Kim · April 6, 2023, 7:12am

I want to know whether 525.105 driver was safe for the hangs and data corroptions.

ptrblck · April 6, 2023, 7:45am

Yes, this seems to be the case based on this comment.