Hi,
I’m trying to run distributed training on a cloud instance that has A100 80Gb NVIDIA GPUs in MIG mode. I have 7 MIG instance with type “1g.10gb”. I assigned multiple MIG instances to a container (using NVIDIA’s Kubernetes device plugin) and checked that if the instances are visible when running nvidia-smi
and saw the correct MIG instances.
My goal is to run distributed training on multiple MIG instances using YOLOV7 (this section). The problem is that the functions under torch.cuda
do not seem to detect MIG instances as seperated GPU devices. And when I run the training using one instance (I can pass 0
or UUID MIG-XYZ
as --device
argument), training starts however it throws errors when I try to pass 0,1
or MIG-XYZ,MIG-ABC
as --device
argument and cannot perform distributed training.
Is there any configuration that I can run distributed training on multiple MIG instances? Any kind of help would be appreciated.