{method 'run_backward' of 'torch._C._EngineBase' objects} is very slow

Dear all,

I develop a mathematical model based on PyTorch and utilize the torch.autograd method to provide gradient, Jacobian, and Hessian matrix to an optimization solver. I use cyipopt, which is a python wrapper of IPOPT.
When I test my snippet, the optimization step is very slow.
I profile my script with cProfile:
python -m cProfile -s 'tottime' hoge.py
and get the following outcomes:

         340915 function calls (332141 primitive calls) in 6.423 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       43    4.547    0.106    4.547    0.106 {method 'run_backward' of 'torch._C._EngineBase' objects}
    94/92    1.292    0.014    1.294    0.014 {built-in method _imp.create_dynamic}
      715    0.109    0.000    0.109    0.000 {method 'read' of '_io.BufferedReader' objects}
      715    0.044    0.000    0.044    0.000 {built-in method marshal.loads}
     3232    0.029    0.000    0.029    0.000 {built-in method posix.stat}

Apparently {method 'run_backward' of 'torch._C._EngineBase' objects} is a bottleneck.

Interestingly, when I profile the same script with:
python -m torch.utils.bottleneck hoge.py,
the optimization step requires a reasonable amount of CPU time, and indeed, the profiler returns:

--------------------------------------------------------------------------------
  cProfile output
--------------------------------------------------------------------------------
         80654 function calls (78746 primitive calls) in 0.160 seconds

   Ordered by: internal time
   List reduced from 945 to 15 due to restriction <15>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       26    0.018    0.001    0.018    0.001 {built-in method prod}
      180    0.014    0.000    0.014    0.000 {built-in method marshal.loads}
       50    0.012    0.000    0.012    0.000 {built-in method _imp.create_dynamic}
       43    0.011    0.000    0.011    0.000 {method 'run_backward' of 'torch._C._EngineBase' objects}
    50/45    0.006    0.000    0.017    0.000 {built-in method _imp.exec_dynamic}
        1    0.005    0.005    0.045    0.045 {method 'solve' of 'ipopt_wrapper.Problem' objects}

where {method 'run_backward' of 'torch._C._EngineBase' objects} is quick enough.

Where are the issues coming from? Why I can correctly execute the script with ‘torch.utils.bottleneck’ and not without the profiler? How can I correctly execute the script?

Thank you very much in advance.