What is the forward function is doing at the beginning

JeremyGong · April 14, 2021, 11:31am

A. ENV

1. libtorch 1.8.1 CPU version
2. centos 7
3. Linux localhost.localdomain 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
4. gcc version 9.3.1 20200408 (Red Hat 9.3.1-2) (GCC)
5. x86_64 4 CPU

B. TEST CODE

torch::jit::script::Module module;
module = torch::jit::load(argv[1]);
std::vectortorch::jit::IValue inputs;
at::Tensor input = torch::ones({1, 300, 40});
inputs.push_back(input);

torch::NoGradGuard guard;
at::set_num_interop_threads(1);
at::set_num_threads(1);

// record profile
torch::autograd::profiler::RecordProfile perf_guard(argv[2]);

for (int i=0; i < num; i++) {
    b = time_get_ms();
    std::cout << "module forward begin" << std::endl;
    at::Tensor output = module.forward(inputs).toTensor();
    e = time_get_ms();
    d = e - b;
    std::cout << "forward output size : " << output.numel() << ", No." << i <<" forward time : " << d << std::endl;
    total_duration = total_duration + d;
}

avg_duration = total_duration / num ;
std::cout << "AVG forward time : " << avg_duration << std::endl;

C. LATENCY

See the comments for specific data .

No.0 forward duration : 188.196 ms;
No.1 forward duration : 106.588 ms;
No.2 ~ No.9 forward duration : about 70 ms；

D. PERFORMANCE

E. QUESTION

what is the ‘module.forward’ function is doing at the beginning ?
I guess when run ‘module.forward’ function in the first time, the ‘module.forward’ function may do something about warming up or else ?
If preheating is needed during the first ‘module.forward’ running? Why does the function spend so much time at the beginning of the second run? But in the subsequent run, it did not take the same time ？

JeremyGong · April 14, 2021, 11:55am

Time of each run

good_perf

ptrblck · April 15, 2021, 6:16am

The JIT is optimizing the model and its graph in the first iterations, which would add some overhead to them. Generally, the first iteration would also see an overhead from the cuda memory allocation (which would then be cached) etc.

JeremyGong · April 15, 2021, 6:41am

Thank you very much, but I still have two question ：

I use the libtorch with the option ‘USE_CUDA = OFF’. I don’t know whether the libtorch will allocate cuda memory or not ?
If JIT is optimizing the model and its graph in the first iterations , Why the libtorch take some time at the beginning in the second iterations ? you can see the consume time in the performance analysis chart .

Have a good day, thanks.

ptrblck · April 15, 2021, 6:57am

I would assume that the GPU is not usable, if CUDA is deactivated.
I think the first 3(?) iterations are used for the optimizations.

JeremyGong · April 15, 2021, 7:13am

As the performance analysis chart show, the third iterations are used for the optimizations . I am confused about this phenomenon, what are the second iterations doing at the beginning ?
Generally speaking, the first iterations will take some time to do the optimizations, and then there is no need to do the same thing later, except for other optimized contents. I don’t know if this assumption is correct ?
I did an experiment a moment ago, I run the code ‘torch::jit::GraphOptimizerEnabledGuard _optimize_guard(true);’ before the ‘module.forward’ function. I want to make the module do the optimizations actively, But It didn’t seem to work.

ptrblck · April 15, 2021, 7:20am

We would have to wait for a dev working on the JIT as they know in detail which optimizations are performed. However, to the best of my knowledge, multiple passes are needed for different stages of optimizations so a single forward pass won’t be enough.

JeremyGong · April 15, 2021, 7:41am

Thank u, I have done an experiment again. I set the parameter of the GraphOptimizerEnabledGuard to false instead of true. and it works.

code ：

torch::jit::GraphOptimizerEnabledGuard optimize_guard(false);

I’m looking forward to getting the details of the optimization .