RTX 4090 2way cannot train model; it stuck

One RTX4090 can train a model normally with Pytorch-cuda. But with 2x RTX4090, it stuck at dataloader. Does NVIDIA block P2P communication between GPUs? I cannot understand why this serious problem is not announced in public? If I know it, I will not buy this kind of machine. My machine uses ThreadRipper Pro 5955wx and 2 GPUs.

Your 4090s do not support P2P and some NVIDIA drivers created hangs. Drivers 525.107 and 520.86 should fix it and avoid the hang.

1 Like

Thanks for quick answer. But I have another question. Do all 4090 GPUs not support P2P communication? Or no plan for allow P2P communication?

And where can I download the stable drivers? Thanks again.

Sorry, the driver should be released by end of this month and might not be published yet. As a workaround you could use export NCCL_P2P_DISABLE=1.

P2P is not enabled on 4090s as it’s an unsupported configuration. The driver fix doesn’t enable it, but avoids hangs and data corruption.

1 Like

I think that the release of these drivers require more time. Is it right? On 30th March, 525.105 driver was released.

I want to know whether 525.105 driver was safe for the hangs and data corroptions.

Yes, this seems to be the case based on this comment.