How can I use multiple MIG instances with PyTorch?

tunahanertekin · October 23, 2023, 11:05am

Hi,

I’m trying to run distributed training on a cloud instance that has A100 80Gb NVIDIA GPUs in MIG mode. I have 7 MIG instance with type “1g.10gb”. I assigned multiple MIG instances to a container (using NVIDIA’s Kubernetes device plugin) and checked that if the instances are visible when running nvidia-smi and saw the correct MIG instances.

My goal is to run distributed training on multiple MIG instances using YOLOV7 (this section). The problem is that the functions under torch.cuda do not seem to detect MIG instances as seperated GPU devices. And when I run the training using one instance (I can pass 0 or UUID MIG-XYZ as --device argument), training starts however it throws errors when I try to pass 0,1 or MIG-XYZ,MIG-ABC as --device argument and cannot perform distributed training.

Is there any configuration that I can run distributed training on multiple MIG instances? Any kind of help would be appreciated.

ptrblck · October 23, 2023, 2:45pm

You cannot use multiple MIG slices in a distributed setup as described also here.