How does AddmmBackward0 save its tensors

Akash_Guna_R.T · September 16, 2023, 3:40am

I am trying to create a neural network with linear layers that are parallel to each other ( f1(x), f2(x)…) where f is a linear layer and x is the same input to all layers. In this case would a separate copy of x be stored and used for backpropagation of each layer? how do i view the implementation of AddmmBackward() (code).

KFrank · September 16, 2023, 2:43pm

Hi Akash!

No. Linear does need its input for the backward pass, but it keeps a reference
to the original input tensor. So all of the f1, f2, ... keep their own (python)
reference, but all of these references refer to the same single tensor – no copies
made.

(As an aside, if – at the cost of memory – Linear did make a copy of x, we would
avoid the dreaded has been modified by an inplace operation errors.)

Best.

K. Frank

J_Johnson · September 16, 2023, 3:21pm

Effectively, you’re trying to perform multiple asynchronous linear operations on your inputs. In that case, I would think this would call for using the Parameter class. We can make one tensor with 3 dims that is appropriately matrix multiplied with the inputs to give out an identical operation as using separate linear layers. Please see the code below which you can test that demonstrates this method is an identical operation.

import torch
import torch.nn as nn
from torch.nn.parameter import Parameter

num_para_layers = 3
input_size = 10
output_size = 1

# create uniform model weights that will be used in both methods
model_weight = torch.rand(num_para_layers, output_size, input_size)

#create a parameter tensor that will act as your parallel model weights as a control to check against
model = Parameter(model_weight)

#create a module list of separate linear layers with matching weights
layers = nn.ModuleList()

for i in range(num_para_layers):

    layer_weight = model_weight[i, ...]
    layer = nn.Linear(10, 1, bias=False)
    layer.weight.data = layer_weight
    layers.append(layer)

#create a dummy dataset
batch_size = 5
dummy_inputs = torch.rand((batch_size, input_size))

#get the parameter model outputs. If using a bias, make that a separate parameter of size (num_para_layers, output_size) and just add it to the function below
outputs1 = (model@dummy_inputs.T).permute(0,2,1)

#get the separate linear layer outputs
outputs2 = torch.empty((0, batch_size, 1))
for i in range(num_para_layers):
    subout = layers[i](dummy_inputs)
    outputs2 = torch.cat([outputs2, subout.unsqueeze(0)])

#print and compare the results
print(outputs1, outputs2)

#if this prints true, the math is effectively identical
print(torch.isclose(outputs2, outputs1).all())

Note that the Parameter class effectively creates layers that will be captured by autograd for training and can be optimized, just like any other layer. This method can be highly useful when you wish to create a new layer type out of the box and make use of the latest in parallel CUDA operations.