A question about backward propagation speed in cpp extension in tutorial

hello, everyone. I have a question about backward propagation speed in writing cpp extension.

I download the source code of “Custom C++ and CUDA extensions” in PyTorch Tutorial, which is at https://github.com/pytorch/extension-cpp .

when I ran benchmark.py, I found that the backward propagation time in cpp is much longer than that in python. I really do not know why this happen.

The result is listed as below. And I use the original code in extension-cpp repo.

(base) wmf997@wmf997-E743-Q7C08:~/extension-cpp-master$ python benchmark.py py
Forward: 246.525/250.680 us | Backward 365.973/376.601 us
(base) wmf997@wmf997-E743-Q7C08:~/extension-cpp-master$ python benchmark.py cpp
Forward: 178.099/180.986 us | Backward 536.442/549.297 us

I ran 10000 times, and the result is listed as below. (the result above ran 100 times)

(base) wmf997@wmf997-E743-Q7C08:~/extension-cpp-master$ python benchmark.py --runs 10000 py
Forward: 245.571/334.904 us | Backward 363.827/486.081 us
(base) wmf997@wmf997-E743-Q7C08:~/extension-cpp-master$ python benchmark.py --runs 10000 cpp 
Forward: 177.860/290.786 us | Backward 537.395/834.295 us

I run my code on Kubuntu 20.04, with Python 3.8.5(Anaconda), and PyTorch 1.7.0+cpu is installed via pip. The CPU is intel core i7-3540m.