Caffe2 vs Torch performance

Hi, I recently thrown myself in the real of machine learning because I need some kind of non-linear function estimation. I ultimately need to run everything in C++ because it will be used for robot control at a frequency up to 1kHz.

I started by building, training and testing a network with Python using Caffe2, then exported it so I could load it and run in C++. The C++ part was a bit painful because of the clear lack of documentation but I finally managed to get it working and was very pleased with the performance.

While struggling with Caffe2 C++ API, I found out that the PyTorch C++ one has some decent documentation, examples and seemed easier to use. So I rebuilt, trained and ran the same network architecture (everything directly in C++ this time) and I observed that it’s around 6x slower than the Caffe2 implementation.

To give some numbers, a prediction (on CPU) takes around 8us with Caffe2 and 46us with Torch. It still pretty fast but as I intend to expand my network architecture to add more variables and take and more parameters into account I fear that the difference in performance might become a problem.

Is this difference in performance between the two libraries is a known thing or should they both perform the same and the issue is probably with my code?

For the info, my network currently has 2 inputs, 1 output and 3 hidden fully connected layers with 50, 30 and 10 neurons and a tanh function after each layer.