GPU utilization is low while using CUDA extension on multi-gpu

jhd · October 26, 2019, 8:16am

Hi, I built my CUDA extension module following this link. Everything works well when I only use 1 GPU. And its utilization is high (>90%). However, when I integrated it into my neural work and trained with 2 GPU, the utilization of each gpu is pretty low (≈50%).
Any ideas? thanks!
My environments:
Pytorch:1.1
CUDA:10.1
OS:Ubuntu 18
GPU: RTX2080ti

Mendel123 · October 26, 2019, 9:11am

Maybe you can analysis the running time of each part, such as data loading, model forwarding, output processing. This may not be caused by forwarding.

jhd · October 26, 2019, 9:34am

Hi, @Mendel123 thanks for your reply.
After analysis the running time of each part, Forward has taken the largest part.

	Forward	backward	Data
2 GPU(32 batchsize)	1.1 sec	0.27 sec	0.001
1 GPU(16 batchsize)	0.4 sec	0.27 sec	0.001

jhd · October 26, 2019, 9:40am

I used APEX mix-precision training, will it be related to apex? Besides, I didn’t implement half() operation in my module , so I just convert inputs to float() before my invoking and convert it to half() after my invoking.

jhd · October 26, 2019, 1:21pm

Checking line by line, I finally found that declaring variables and allocating CUDA memory in CUDA extension of Pyotrch will greatly reduce GPU efficiency. By removing the following statements,

float *dist_Dev; 
gpuErrchk(cudaMalloc((void**)&dist_Dev, myParameter.obj_num * myParameter.cluster_num * sizeof(float)));
int *obj_num_Dev; 
gpuErrchk(cudaMalloc((void**)&obj_um_Dev, myParameter.cluster_num * sizeof(int)));
int *num_per_classt;
gpuErrchk(cudaMalloc((void**)&num_per_classt, myParameter.t * myParameter.cluster_num * sizeof(int)));

, the utilization of each gpu is high. And each variable will be created in pytorch and then passed to CUDA module.
However, it still can’t expalin why single GPU works. I also compiled C++/CUDA into a .so file and invoked by ctypes.The result proves that declearing variables like the method above is fine.
So I guess there might be a problem with the way CUDA extension works.