So what I did was grab a docker container in which I knew distributed worked and when that failed, it was clear that it was the driver. Downgrading the nvidia driver helped.
Of course, it would be nice if there was a proper error message somewhere if NCCL didn’t like my driver, but I guess that’s not a PyTorch thing.
1 Like