I have a custom pytorch model of which I moved the forward method from python to C++ to speed it up. However, I notice that the pure Python version is engaging multiple cores while training, while the C++ version is engaging only 1 core.
I would like to ask if there is anything that specifically controls this particular behaviour and I need to make sure to add in my code or is it just that my C++ implementation is inefficient?
Code for both Python and C++ versions can be seen below: