Hi, I built my CUDA extension module following this link. Everything works well when I only use 1 GPU. And its utilization is high (>90%). However, when I integrated it into my neural work and trained with 2 GPU, the utilization of each gpu is pretty low (≈50%).
Any ideas? thanks!
My environments:
Pytorch:1.1
CUDA:10.1
OS:Ubuntu 18
GPU: RTX2080ti
Maybe you can analysis the running time of each part, such as data loading, model forwarding, output processing. This may not be caused by forwarding.
Hi, @Mendel123 thanks for your reply.
After analysis the running time of each part, Forward has taken the largest part.
Forward | backward | Data | |
---|---|---|---|
2 GPU(32 batchsize) | 1.1 sec | 0.27 sec | 0.001 |
1 GPU(16 batchsize) | 0.4 sec | 0.27 sec | 0.001 |
I used APEX mix-precision training, will it be related to apex? Besides, I didn’t implement half()
operation in my module , so I just convert inputs to float()
before my invoking and convert it to half()
after my invoking.
Checking line by line, I finally found that declaring variables and allocating CUDA memory in CUDA extension of Pyotrch will greatly reduce GPU efficiency. By removing the following statements,
float *dist_Dev;
gpuErrchk(cudaMalloc((void**)&dist_Dev, myParameter.obj_num * myParameter.cluster_num * sizeof(float)));
int *obj_num_Dev;
gpuErrchk(cudaMalloc((void**)&obj_um_Dev, myParameter.cluster_num * sizeof(int)));
int *num_per_classt;
gpuErrchk(cudaMalloc((void**)&num_per_classt, myParameter.t * myParameter.cluster_num * sizeof(int)));
, the utilization of each gpu is high. And each variable will be created in pytorch and then passed to CUDA module.
However, it still can’t expalin why single GPU works. I also compiled C++/CUDA into a .so
file and invoked by ctypes
.The result proves that declearing variables like the method above is fine.
So I guess there might be a problem with the way CUDA extension works.