How to improve PyTorch training speed -- same TensorFlow code trains 40x faster

Hi all,

I converted a tensorflow code (link to tf code) to PyTorch and it all works fine (results are comparable in my opinion). However, the code in PyTorch is way slower than in TensorFlow. In TensorFlow, the training only takes 1 minute, where the PyTorch trains in over 40 minutes (I can’t use cuda on my laptop). It uses the same amount of iterations, optimizer, loss function, etc. My question is, do you know anything which might slow down the PyTorch code this much and is there anything I could do about it?

A brief overview of what I did. If you require the code-snippet I could post that too, of course.
I literally translated the tf code to torch, but I do not have much experience in PyTorch.

  • Everything (loss, optimizer, initialization, training, forward pass,…) is still written in a (nn) class, as is done in the tf code, although I read it is not convenient to do that in PyTorch.
  • the loss function is written as a specific function of the class.
  • I wrote down the training ‘session’ as:
loss = self.loss()
loss.backward(retain_graph = True)
  • The derivatives of the differential equation are given as (in which the sigma the normalization factor is):
u_x = torch.autograd.grad(outputs = u.sum(), inputs = x, create_graph = True)[0]/self.sigma_x
u_xx = torch.autograd.grad(outputs = u_x.sum(), inputs = x, create_graph = True)[0]/self.sigma_x

I can understand that the ‘create_graph=True’ option slows down the training, but not as much as 40 times.

Do you have any general advice on making the PyTorch code faster?

Could you check, if the memory usage increases in each iteration if you are using retain_graph=True? If so, this should also show a slowdown in each iteration and might be the reason for the huge performance difference.

I’m not familiar with your specific use case, but are you seeing an error by removing retain_graph=True?

Thank you for your reply!
Leaving out the retain_graph = True in the backward() function does not have any effect on the training speed, nor on the performance. As you asked, I also looked at the memory usage. The processor usage is slightly bigger when adding the retain_graph = True. However, it does not change the training speed significantly.

I thought you might want to see some code-snippets which shows the neural network.
The linear layers are implemented by myself (instead of using nn.Linear()) in the forward_pass() function. I also tried to use the pre-defined nn.Linear but, unfortunately, it did not speed up the training.

The code snippet below shows the forward pass, which represents the neural network which has as output the solution ‘u’ to the PDE u_{xx} = f(x), and a vector x as input.
Then we have the function net_u() which returns the solution u from the forward_pass. The function net_r() returns the residual of the PDE, which should in the end be equal to zero (since residual = u_xx - f(x)).

 # Evaluate the forward pass
    def forward_pass(self, H):
        num_layers = len(self.layers)
        for l in range(0,num_layers-2): 
            W = self.weights[l]
            b = self.biases[l]
            H = torch.tanh(torch.add(torch.matmul(H,W), b))
        W = self.weights[-1]
        b = self.biases[-1]
        H = torch.add(torch.matmul(H,W), b)
        return H

    # Forward pass for u
    def net_u(self, x):
        u = self.forward_pass(x)
        return u

    # Forward pass for f
    def net_r(self, x):
        # LHS
        x.requires_grad = True 
        u = self.net_u(x)
        u_x = torch.autograd.grad(outputs = u.sum(), inputs = x, create_graph = True)[0]/self.sigma_x
        u_xx = torch.autograd.grad(outputs = u_x.sum(), inputs = x, create_graph = True)[0]/self.sigma_x
        x.requires_grad = False
        # RHS
        f = self.f(x*self.sigma_x + self.mu_x)

        residual = u_xx - f
        return residual

With this in mind, we can write our loss function (mean squared error) as:

    def loss_r(self):
        r_pred = self.net_r(self.X_r)
        loss_r = torch.mean(torch.square(r_pred))
        return loss_r

    def loss_u(self):
        u_pred = self.net_u(self.X_u)
        loss_u = torch.mean(torch.square(self.Y_u - u_pred))
        return loss_u

    def loss(self):
        loss = self.loss_u() + self.loss_r()
        return loss

In the training, the loss() will be minimized.

I have a similar experience with poor PyTorch performance compared to TensorFlow. Here’s a simple script that you can test for yourself.

Tensorflow runs at least 4x faster than PyTorch on this notebook using GPU, TPU, CPU at Google’s Colab.

I used the following script to time the training cells:

import time
# ----------------------------
start_time = time.time()
# ----------------------------
_ =, y_trainTF, epochs=epochs, batch_size=batch_size, verbose = 0)
print("--- %s seconds ---" % (time.time() - start_time))

# ----------------------------
start_time = time.time()
# ----------------------------
for e in range(epochs):
    for images, labels in xy_trainPT_loader:
        images = images.view(images.shape[0], -1) 
        loss = criterion(modelPT(images), labels)        
print("--- %s seconds ---" % (time.time() - start_time))

The slowdown seems to be triggered by the data loading pipeline.
If you remove the DataLoader usage, the time will drop from ~47s to ~10s (TF uses ~17s).
I don’t know what kind of nodes Colab is using, but you might want to play around with the num_workers argument to the DataLoader to use multiprocessing or create the data as a big tensor instead of loading each image separately (as done in TF), since no data augmentation is used.