Gaussian agent 46% slower in Pytorch compared to Tensorflow

We are currently trying to transfer our department code base from Tensorflow to Pytorch. For most of our code, this transformation has been done. In doing this we greatly simplified our codebase while not having to sacrifice on performance :rocket: .

For one agent, however, we noticed that the Pytorch Version is about 46% slower than it’s Tensorflow counterpart (while using the tf.function decorator). As expected when the tf.function decorator is not used Pytorch becomes about 10 times faster than TensorFlow. I profiled both versions using the functiontrace and it looks like the biggest difference can be found in the forward and backward passes through the network. This difference is most prevalent in the Actor-network. This network contains the following components:

  • A pass through a fully connected network.
  • A rsample operation from a normal distribution (Explicitly performed in tensorflow).
  • A operation that calculates the log probabilities of these sampled actions.
  • A operation which performs a squashing operation on the normal distribution.

A timed comparison of each of these components between Pytorch and Tensorflow can be found in this issue](https://github.com/rickstaa/torch-tf2-lac-speed-compare/issues/1). The code used for these tests can be found in the accompanying repository.

The conclusion from these tests is as follows:

  • The sample operation is significantly faster in Pytorch.
  • The log probability operation has the same performance in both versions.
  • The squashing operation is significantly slower in Pytorch.

We can therefore see that the difference is caused by the Squahsing operations. This is also the part which differs the most between the two versions. As to my knowledge, no bijector class exists in Pytorch the squashing operation has to be performed explicitly (see L108-L117 of gaussian actor_torch):

logp_pi = pi_distribution.log_prob(pi_action).sum(axis=-1)
logp_pi -= (2 * (np.log(2) - pi_action - F.softplus(-2 * pi_action))).sum(
    axis=sum_axis
)

The Tensorflow version, however, uses the TensorFlow bijection class to perform the squashing operation. This bijection is defined using the squashed_bijector class):

"""Creates a squash (tanh) bijector for using the parameterization trick.
"""

import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp


class SquashBijector(tfp.bijectors.Bijector):
    """A squash bijector used to keeps track of the distribution properties when the
    distribution is transformed using the tanh squash function."""

    def __init__(self, validate_args=False, name="tanh"):
        super(SquashBijector, self).__init__(
            forward_min_event_ndims=0, validate_args=validate_args, name=name
        )

    def _forward(self, x):
        return tf.nn.tanh(x)
        # return x

    def _inverse(self, y):
        return tf.atanh(y)

    def _forward_log_det_jacobian(self, x):
        return 2.0 * (
            np.log(2.0) - x - tf.nn.softplus(-2.0 * x)
        )  # IMPROVE: Speed check

This bijector is then used in L121-L137 of the gaussian actor to perform the squashing operation:

# Create bijectors (Used in the re-parameterization trick)
squash_bijector = SquashBijector()
affine_bijector = tfp.bijectors.Shift(mu)(tfp.bijectors.Scale(sigma))

# Transform distribution back to the original policy distribution
reparm_trick_bijector = tfp.bijectors.Chain((squash_bijector, affine_bijector))
distribution = tfp.distributions.TransformedDistribution(
    distribution=base_distribution, bijector=reparm_trick_bijector
)

I, therefore, think that it is the bijector class in combination with the tf.function wrapper that allows TensorFlow to be able to perform the squashing operation in a more efficient way. I tried to use the TorchSCript wrapper to speed up this computation. This however currently is not yet supported as torch script wrapper does not support the normal distributions that are used in the gaussian actor (see pytorch/pytorch#18094 and pytorch/pytorch#29843).

Since I am quite new to TensorFlow I was wondering if a more experienced user can comment on my reasoning and see if I missed something. Thanks a lot in advance!

Additional Information:

The full code base of the PyTorch and TF2 implementations of the agent can be found in this repository. I’m using the following system:

  • PC: Hp Zbook x360
  • OS: Ubuntu 20.04
  • CPU: Intel® Core™ i7-8750H CPU @ 2.20GHz
  • GPU: Quadro P1000
  • Torch version: 1.7.1
  • TensorFlow version: 2.3.0
  • Tensorflow-distributions version: 0.11.0