Libtorch in linux cluster much lower speed than my local machine

Yue_Li · April 29, 2020, 6:29am

Hi, when I am running a small CNN model in C++ using libtorch on linux clusters, i find its speed is much slower than on my local machine. Does anyone know the reason? The network input is 2x128x128, on my local machine, it takes only 0.002 seconds to finish the forward. However on linux clusters, it takes rougly 0.3 seconds to finish the forward. Does this issue have anything to do with loading dynamic libraries or multi thread, I mean since the input and network are very simple, will multi thread cause slower speed?

ptrblck · April 29, 2020, 6:58am

Are you using the CPU only or also a GPU?
In the latter case, are you synchronizing the code? Since CUDA operations are asynchronous, you would have to synchronize the code via torch.cuda.synchronize() to get a valid profiling result.
If you are using a GPU, do both machines have the same GPU, CUDA, and cudnn version?

Also, how did you install PyTorch on these machines? Did you build from source or installed a binary?
Are both PyTorch versions equal?

Yue_Li · April 29, 2020, 2:43pm

Hi, Thank you for helping!

For both cases, I am using CPU only.

I installed from a binary for both cases. In my local machine, what I used is libtorch for Mac. While on linux clusters, I have tried different versions of libtorch and observed this issue always. Also I noticed that on clusters, running time at different times are quite different. Sometimes, it takes 0.4 seconds. Sometimes, it takes maybe only 0.03 seconds for a same input. In addition, after i reduce the model size from 90M to 4 M, the running speed on my local machine improved. However, on clusters, the running speed seems almost unchanged. It feels just like on clusters, there is some extra processing that takes some time. I am so confused.

ptrblck · April 30, 2020, 4:07am

A 10x speedup sounds strange.
Are you seeing these 0.4s and 0.03s runs randomly or is the first pass slower than the rest?

Yue_Li · May 21, 2020, 7:07pm

Hi, sorry for the late reply.
First regarding your question, I was seeing these 0.4s and 0.03s almost in every runs.
Later, i found that this issue can be fixed by adding the following sentence at::init_num_threads();
Thanks again for your kind help!