GPU utilization is low while using CUDA extension on multi-gpu

Hi, I built my CUDA extension module following this link. Everything works well when I only use 1 GPU. And its utilization is high (>90%). However, when I integrated it into my neural work and trained with 2 GPU, the utilization of each gpu is pretty low (≈50%).
Any ideas? thanks!
My environments:
OS:Ubuntu 18
GPU: RTX2080ti

Maybe you can analysis the running time of each part, such as data loading, model forwarding, output processing. This may not be caused by forwarding.

Hi, @Mendel123 thanks for your reply.
After analysis the running time of each part, Forward has taken the largest part.

Forward backward Data
2 GPU(32 batchsize) 1.1 sec 0.27 sec 0.001
1 GPU(16 batchsize) 0.4 sec 0.27 sec 0.001

I used APEX mix-precision training, will it be related to apex? Besides, I didn’t implement half() operation in my module , so I just convert inputs to float() before my invoking and convert it to half() after my invoking.

Checking line by line, I finally found that declaring variables and allocating CUDA memory in CUDA extension of Pyotrch will greatly reduce GPU efficiency. By removing the following statements,

float *dist_Dev; 
gpuErrchk(cudaMalloc((void**)&dist_Dev, myParameter.obj_num * myParameter.cluster_num * sizeof(float)));
int *obj_num_Dev; 
gpuErrchk(cudaMalloc((void**)&obj_um_Dev, myParameter.cluster_num * sizeof(int)));
int *num_per_classt;
gpuErrchk(cudaMalloc((void**)&num_per_classt, myParameter.t * myParameter.cluster_num * sizeof(int)));

, the utilization of each gpu is high. And each variable will be created in pytorch and then passed to CUDA module.
However, it still can’t expalin why single GPU works. I also compiled C++/CUDA into a .so file and invoked by ctypes.The result proves that declearing variables like the method above is fine.
So I guess there might be a problem with the way CUDA extension works.