Hello
I’m newbie user on server
first of all, I introduce my server spec
Power : 1600Wx3
CPU : 2 x Intel® Xeon® Silver 4210R CPU @ 2.40GHz
GPU : TITAN RTX x5 + RTX3090 x3
Now i’m trying to my server for deeplearning(object detection)
i use EfficientDet pytorch version code
At first I had only TITAN RTX x5
i could train w/ All graphics card
but i bought another RTX3090 x3 and mount on the server yesterday
so i just changed ./projects/proeject.yml num_gpu=8
and then it did not train
i checked using nvidia-smi
but it did not changed after allocated almost 1.5GB each
and pytorch recognize RTX3090 for cuda:0 even though RTX3090 is not PCI bus number 0
RTX3090 PCI bus number=1,2,3
TITAN RTX PCI bus number=0,4,5,6,7
so I doubted 2 things
CUDA 11.1 does not recognize 2 types of graphic card(3090,titan)
==> but CUDA 11.1 can recognize 2types of graphic card according to NVIDIA Developer Forum
Dataparallel module does not have compatibility for 2 types of graphics card
CUDA11.1 does recognize Turing and Ampere GPUs and it seems that they are indeed used, but your code seems to hang? nn.DataParallel should also be compatible with different architectures, but note that the slower GPUs would most likely be the bottleneck and the faster ones would have to wait.
That’s expected and you can change it via export CUDA_DEVICE_ORDER=PCI_BUS_ID.