I have one server w/ Two type GPU(RTX 3090, TITAN RTX)

jaejun · April 9, 2021, 1:04am

Hello
I’m newbie user on server
first of all, I introduce my server spec
Power : 1600Wx3
CPU : 2 x Intel® Xeon® Silver 4210R CPU @ 2.40GHz
GPU : TITAN RTX x5 + RTX3090 x3

Now i’m trying to my server for deeplearning(object detection)
i use EfficientDet pytorch version code
At first I had only TITAN RTX x5
i could train w/ All graphics card

but i bought another RTX3090 x3 and mount on the server yesterday
so i just changed ./projects/proeject.yml num_gpu=8
and then it did not train
i checked using nvidia-smi
but it did not changed after allocated almost 1.5GB each
and pytorch recognize RTX3090 for cuda:0 even though RTX3090 is not PCI bus number 0
RTX3090 PCI bus number=1,2,3
TITAN RTX PCI bus number=0,4,5,6,7

so I doubted 2 things

CUDA 11.1 does not recognize 2 types of graphic card(3090,titan)
==> but CUDA 11.1 can recognize 2types of graphic card according to NVIDIA Developer Forum
Dataparallel module does not have compatibility for 2 types of graphics card

How can i use all the GPU for one Task

ptrblck · April 9, 2021, 7:17am

CUDA11.1 does recognize Turing and Ampere GPUs and it seems that they are indeed used, but your code seems to hang?
nn.DataParallel should also be compatible with different architectures, but note that the slower GPUs would most likely be the bottleneck and the faster ones would have to wait.

That’s expected and you can change it via export CUDA_DEVICE_ORDER=PCI_BUS_ID.

jaejun · April 9, 2021, 8:38am

Thanks for reply

Yes, my situation is hang

os.environ[“CUDA_DEVICE_ORDER”] = “PCI_BUS_ID”
os.environ[“CUDA_VISIBLE_DEVICES”] = “0,1,2,3,4,5,6,7”

I used upper code in my script
i waited more than 10m but it did not work
so if i wait more time, does it work???

agolynski · April 9, 2021, 5:42pm

Could you try to see where it hangs, maybe add logging or see which GPUs are busy and idle during training?

You can also do an experiment where you run on 1 TITAN and 1 RTX3090 only and see if such training completes.

If the issue is some GPUs are faster and it leads to work imbalance as the result, maybe try to decrease batch size and see if it helps.