Training slows down when using `.clone()` operation

I am new to PyTorch and want to better understand PyTorch’s autograd functionality. Therefore I implemented a tiny network with two hidden layers. The network’s architecture is as follows:

    n_inputs = 2
    n_hidden_1 = 2
    n_hidden_2 = 2
    n_outputs = 2

The fully connected neural networks has therefore only 8 trainable parameters. I implemented the vector-matrix multiplication explicitly to work more low level and to get a better understanding (see code below).

When I use the first version of my implementation (# Feedforward version 1) I get the following error:

 RuntimeError: one of the variables needed for gradient computation has been modified by an
 inplace operation: [torch.FloatTensor [1]] is at version 2; expected version 1 instead.

I have noticed that when I use the .clone() operation (# Feedforward version 2), the error no longer occurs. Now, however, the optimization slows down significantly with each iteration. Why is this the case? What do I do wrong?

Here is my code:

import time
import torch
import random
    
def test_tiny_network():

    torch.autograd.set_detect_anomaly(True)

    # Parameters
    n_samples = 10
    lr = 0.001
    n_inputs = 2
    n_hidden_1 = 2
    n_hidden_2 = 2
    n_outputs = 2
    sigma = 0.1
    n_steps = 10000

    # Get some data with labels
    x_data = torch.rand(size=(n_samples, n_inputs))
    y_data = torch.rand(size=(n_samples, n_outputs))

    # Placeholder for activations
    a = [[torch.zeros(size=(1,)) for _ in range(n_hidden_2 + 1)] for _ in range(n_hidden_1)]

    # Trainable parameters
    w_1 = [[torch.normal(mean=0.0, std=sigma, size=(1,), requires_grad=True)
            for _ in range(n_hidden_2)] for _ in range(n_hidden_1)]
    w_2 = [[torch.normal(mean=0.0, std=sigma, size=(1,), requires_grad=True)
            for _ in range(n_hidden_2)] for _ in range(n_hidden_1)]
    b = [[torch.normal(mean=0.0, std=sigma, size=(1,), requires_grad=True)
          for _ in range(n_hidden_2)] for _ in range(n_hidden_1)]

    for n in range(n_steps):

        t0 = time.time()

        # Get data
        rand_idx = random.randint(0, n_samples - 1)
        x = x_data[rand_idx]
        y = y_data[rand_idx]

        # Assign data to input layer
        for i in range(n_hidden_1):
            a[i][0] = x[i]

        # Feedforward version 1
        #a[0][1] = torch.sigmoid(w_1[0][0] * a[0][0] + w_1[0][1] * a[1][0] + b[0][0])
        #a[1][1] = torch.sigmoid(w_1[1][0] * a[0][0] + w_1[1][1] * a[1][0] + b[1][0])
        #a[0][2] = torch.sigmoid(w_2[0][0] * a[0][1] + w_2[0][1] * a[1][1] + b[0][1])
        #a[1][2] = torch.sigmoid(w_2[1][0] * a[0][1] + w_2[1][1] * a[1][1] + b[1][1])

        # Feedforward version 2
        a[0][1] = torch.sigmoid(w_1[0][0].clone() * a[0][0].clone() +
                                w_1[0][1].clone() * a[1][0].clone() + b[0][0].clone())
        a[1][1] = torch.sigmoid(w_1[1][0].clone() * a[0][0].clone() +
                                w_1[1][1].clone() * a[1][0].clone() + b[1][0].clone())
        a[0][2] = torch.sigmoid(w_2[0][0].clone() * a[0][1].clone() +
                                w_2[0][1].clone() * a[1][1].clone() + b[0][1].clone())
        a[1][2] = torch.sigmoid(w_2[1][0].clone() * a[0][1].clone() +
                                w_2[1][1].clone() * a[1][1].clone() + b[1][1].clone())

        # Loss computation
        loss = ((y[0] - a[0][2])**2 + (y[1] - a[1][2])**2)

        # Backpropagation
        loss.backward(retain_graph=True)

        # Gradient descent
        with torch.no_grad():
            w_1[0][0].sub_(lr * w_1[0][0].grad)
            w_1[1][0].sub_(lr * w_1[1][0].grad)
            w_1[0][1].sub_(lr * w_1[0][1].grad)
            w_1[1][1].sub_(lr * w_1[1][1].grad)

            w_2[0][0].sub_(lr * w_2[0][0].grad)
            w_2[1][0].sub_(lr * w_2[1][0].grad)
            w_2[0][1].sub_(lr * w_2[0][1].grad)
            w_2[1][1].sub_(lr * w_2[1][1].grad)

            b[0][0].sub_(lr * b[0][0].grad)
            b[1][0].sub_(lr * b[1][0].grad)
            b[0][1].sub_(lr * b[0][1].grad)
            b[1][1].sub_(lr * b[1][1].grad)

        t1 = time.time()

        if n % 100 == 0:
            print(f"n {n} loss {loss} time {(t1-t0)}")


if __name__ == '__main__':
    test_tiny_network()

It’s expected that the clone operation would have a performance impact, since you are creating new tensors. Since your model is tiny, you might see this overhead.

I would try to avoid the inplace operations and store the results of the torch.sigmoid calls into a new tensor instead of the a “placeholder”.