I’m trying to run distributed training on a cloud instance that has A100 80Gb NVIDIA GPUs in MIG mode. I have 7 MIG instance with type “1g.10gb”. I assigned multiple MIG instances to a container (using NVIDIA’s Kubernetes device plugin) and checked that if the instances are visible when running
nvidia-smi and saw the correct MIG instances.
My goal is to run distributed training on multiple MIG instances using YOLOV7 (this section). The problem is that the functions under
torch.cuda do not seem to detect MIG instances as seperated GPU devices. And when I run the training using one instance (I can pass
0 or UUID
--device argument), training starts however it throws errors when I try to pass
--device argument and cannot perform distributed training.
Is there any configuration that I can run distributed training on multiple MIG instances? Any kind of help would be appreciated.