Currently working on moving an implementation of a couple popular reinforcement learning algorithms from TensorFlow to PyTorch, and the PyTorch code is noticeably slower (up to 50%). We believe it’s because some algorithms, such as Soft-Actor Critic, have multiple semi-independent neural networks (e.g. Policy, Value, Q Function) that during the update step need to be evaluated, their losses computed, and backpropagation performed on them. In TensorFlow these could all be computed with a single
session.run call, and they would be parallelized across multiple cores. However, in PyTorch we are limited to evaluating and updating each network sequentially.
Note that our networks are quite small and we don’t expect much benefit in running on GPU. However, the asynchronous execution in CUDA seems to alleviate a lot of the performance gap between TensorFlow, but at the end of the day we still need to support CPU.
Was wondering what the “right” way to do this for PyTorch on CPU. I’ve played around with using Python threading to evaluate the networks (PyTorch does release the GIL, right?) as well as changing the
KMP_BLOCKTIME settings, with some success, but am still at the 50% performance gap. Guidance would be much appreciated, thanks!