This could indicate an issue in your node setup and a broken communication.
Could you run the NCCL-tests and see, if any issues are detected while communicating between the devices?
Is NCCL required? I didn’t have it installed previously. I’m not using both of the GPUs for a single task. I’m simply trying to transfer data from ‘cuda:0’ to ‘cuda:1’ offline.
It is interesting to note that when running the nccl-tests, I am seeing a lot of lines with “IO_PAGE_FAULT”.
Yes, NCCL would be needed to run the tests.
The IO_PAGE_FAULT issues are also most likely shown in dmesg when you run your previous code snippet, which could point towards a faulty communication between these devices.
Try to disable IOMMU as described here.
In my previous comment, I was wondering whether NCCL is required in general (not only for the tests), since everything seemed to be working without it (besides my issue in this post). Could you please elaborate on how/when NCCL would be beneficial?