What is the forward function is doing at the beginning

A. ENV

1. libtorch 1.8.1 CPU version
2. centos 7
3. Linux localhost.localdomain 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
4. gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
5. x86_64 4 CPU 

B. TEST CODE

torch::jit::script::Module module;
module = torch::jit::load(argv[1]);
std::vectortorch::jit::IValue inputs;
at::Tensor input = torch::ones({1, 300, 40});
inputs.push_back(input);

torch::NoGradGuard guard;
at::set_num_interop_threads(1);
at::set_num_threads(1);

// record profile
torch::autograd::profiler::RecordProfile perf_guard(argv[2]);

for (int i=0; i < num; i++) {
    b = time_get_ms();
    std::cout << "module forward begin" << std::endl;
    at::Tensor output = module.forward(inputs).toTensor();
    e = time_get_ms();
    d = e - b;
    std::cout << "forward output size : " << output.numel() << ", No." << i <<" forward time : " << d << std::endl;
    total_duration = total_duration + d;
}

avg_duration = total_duration / num ;
std::cout << "AVG forward time : " << avg_duration << std::endl;

C. LATENCY

See the comments for specific data .

  1. No.0 forward duration : 188.196 ms;
  2. No.1 forward duration : 106.588 ms;
  3. No.2 ~ No.9 forward duration : about 70 ms;

D. PERFORMANCE

E. QUESTION

  1. what is the ‘module.forward’ function is doing at the beginning ?
  2. I guess when run ‘module.forward’ function in the first time, the ‘module.forward’ function may do something about warming up or else ?
  3. If preheating is needed during the first ‘module.forward’ running? Why does the function spend so much time at the beginning of the second run? But in the subsequent run, it did not take the same time ?

Time of each run

good_perf

The JIT is optimizing the model and its graph in the first iterations, which would add some overhead to them. Generally, the first iteration would also see an overhead from the cuda memory allocation (which would then be cached) etc.

Thank you very much, but I still have two question :

  1. I use the libtorch with the option ‘USE_CUDA = OFF’. I don’t know whether the libtorch will allocate cuda memory or not ?
  2. If JIT is optimizing the model and its graph in the first iterations , Why the libtorch take some time at the beginning in the second iterations ? you can see the consume time in the performance analysis chart .

Have a good day, thanks.

  1. I would assume that the GPU is not usable, if CUDA is deactivated.

  2. I think the first 3(?) iterations are used for the optimizations.

  1. As the performance analysis chart show, the third iterations are used for the optimizations . I am confused about this phenomenon, what are the second iterations doing at the beginning ?

  2. Generally speaking, the first iterations will take some time to do the optimizations, and then there is no need to do the same thing later, except for other optimized contents. I don’t know if this assumption is correct ?

  3. I did an experiment a moment ago, I run the code ‘torch::jit::GraphOptimizerEnabledGuard _optimize_guard(true);’ before the ‘module.forward’ function. I want to make the module do the optimizations actively, But It didn’t seem to work.

We would have to wait for a dev working on the JIT as they know in detail which optimizations are performed. However, to the best of my knowledge, multiple passes are needed for different stages of optimizations so a single forward pass won’t be enough.

Thank u, I have done an experiment again. I set the parameter of the GraphOptimizerEnabledGuard to false instead of true. and it works.

image

code :

torch::jit::GraphOptimizerEnabledGuard optimize_guard(false);

I’m looking forward to getting the details of the optimization .