A strange performance problem with the CPU version of libtorch

A. ENV

  1. centos 7
  2. Linux localhost.localdomain 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
  3. gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
  4. Python 3.8.5 (default, Sep 4 2020, 07:30:14)
  5. libtorch 1.8.1 CPU version
  6. x86_64 4CPU

B. TEST CODE

torch::jit::script::Module module;
module = torch::jit::load(argv[1]);
std::vectortorch::jit::IValue inputs;
at::Tensor input = torch::ones({1, 300, 40});
inputs.push_back(input);

for (int i=0; i < num; i++) {
    b = time_get_ms();
    std::cout << "module forward begin" << std::endl;
    at::Tensor output = module.forward(inputs).toTensor();
    e = time_get_ms();
    d = e - b;

    std::cout << "forward output size : " << output.numel() << ", No." << i <<" forward time : " << d << std::endl;

    usleep(100 * 1000);  // sleep 100ms
}
avg_duration = total_duration / num ;
std::cout << "AVG forward time : " << avg_duration << std::endl;

C. Performance

  1. When I commented out the “usleep(100 * 1000)” code, I found that latency is good.

  2. When I keep the “usleep(100 * 1000)” code, I find that latency is bad .

  3. The above phenomenon is inevitable .

D. Question

  1. Why do I get worse performance when I execute the “usleep(100 * 1000)” code ?
  2. Especially in the multi-threaded scenario (for example. at::set_num_threads(4)), it is more serious ;
  3. The performance data is in the comments.

When I commented out the “usleep(100 * 1000)” code, I found that latency was good. As shown in the picture below :
good_perf

When I keep the “usleep(100 * 1000)” code, I find that latency is bad . As shown in the picture below :
bad_perf

I have done the profiling , as shown in the figure below

no sleep 100ms

I have done the profiling , as shown in the figure below :

sleep 100ms

I’m not sure I understand the question correctly. If you let the main thread sleep, the loop would be paused and the execution of the next iteration would be delayed. What is the expectation?

The forward time is the duration of forwarding execution, not the start time . When I execute usleep() function in the test code, the duration of each forwarding operation is longer.

LOL, I’ve solved this problem, and the reason is CPU frequency conversion .

A. RESEON :

  1. When I commented out the “usleep(100 * 1000)” code, The CPU is in high performance state;you can check the ‘CPU MHz’ by running the command : lscpu
  2. When I keep the “usleep(100 * 1000)” code,the CPU will switch between high performance and low performance which slows down the calculation. so the duration of function ‘module.forward’ is getting longer.

B. HOW CAN I FIND THIS QUESTION ?

  1. I do a lot of experiments. I am confuesed about the phenomenon, it is very strange. There’s no reason to suspect that the ‘module.forward’ function has a bug because of running ‘usleep(100*1000)’;

  2. When the experiment is in operation, I check the CPU state and I found that the CPU MHz is changging. :boom: :boom: :boom: :laughing: :laughing: :laughing: Suddenly inspiration came, and I remember this will slow down the calculation, because I’ve had similar problems before

  3. So I run the command : cat /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
    The output is : powersave

  4. Run the command : sudo cpupower -c all frequency-set -g performance

  5. Rerun the test code, it works.

1 Like

That’s great debugging and thanks for sharing it here! :slight_smile: