One RTX4090 can train a model normally with Pytorch-cuda. But with 2x RTX4090, it stuck at dataloader. Does NVIDIA block P2P communication between GPUs? I cannot understand why this serious problem is not announced in public? If I know it, I will not buy this kind of machine. My machine uses ThreadRipper Pro 5955wx and 2 GPUs.
Your 4090s do not support P2P and some NVIDIA drivers created hangs. Drivers 525.107 and 520.86 should fix it and avoid the hang.
Thanks for quick answer. But I have another question. Do all 4090 GPUs not support P2P communication? Or no plan for allow P2P communication?
And where can I download the stable drivers? Thanks again.
Sorry, the driver should be released by end of this month and might not be published yet. As a workaround you could use export NCCL_P2P_DISABLE=1
.
P2P is not enabled on 4090s as it’s an unsupported configuration. The driver fix doesn’t enable it, but avoids hangs and data corruption.
I think that the release of these drivers require more time. Is it right? On 30th March, 525.105 driver was released.
I want to know whether 525.105 driver was safe for the hangs and data corroptions.