SLURM cluster CUDA error: all CUDA-capable devices are busy or unavailable

to('cuda') calls are supported in single-GPU and multi-GPU runs and will push the tensor or module to the default device.
If your distributed setup masks all devices and sets only one GPU to be visible, the to('cuda') operation will use this visible device and to('cuda:n') would fail.
On the other hand, if all devices are visible, you might recreate multiple CUDA contexts as also seems to be the issue in this topic.