I am trying to create a neural network with linear layers that are parallel to each other ( f1(x), f2(x)…) where f is a linear layer and x is the same input to all layers. In this case would a separate copy of x be stored and used for backpropagation of each layer? how do i view the implementation of AddmmBackward() (code).
Linear does need its input for the backward pass, but it keeps a reference
to the original input tensor. So all of the
f1, f2, ... keep their own (python)
reference, but all of these references refer to the same single tensor – no copies
(As an aside, if – at the cost of memory –
Linear did make a copy of
x, we would
avoid the dreaded
has been modified by an inplace operation errors.)
Effectively, you’re trying to perform multiple asynchronous linear operations on your inputs. In that case, I would think this would call for using the Parameter class. We can make one tensor with 3 dims that is appropriately matrix multiplied with the inputs to give out an identical operation as using separate linear layers. Please see the code below which you can test that demonstrates this method is an identical operation.
import torch import torch.nn as nn from torch.nn.parameter import Parameter num_para_layers = 3 input_size = 10 output_size = 1 # create uniform model weights that will be used in both methods model_weight = torch.rand(num_para_layers, output_size, input_size) #create a parameter tensor that will act as your parallel model weights as a control to check against model = Parameter(model_weight) #create a module list of separate linear layers with matching weights layers = nn.ModuleList() for i in range(num_para_layers): layer_weight = model_weight[i, ...] layer = nn.Linear(10, 1, bias=False) layer.weight.data = layer_weight layers.append(layer) #create a dummy dataset batch_size = 5 dummy_inputs = torch.rand((batch_size, input_size)) #get the parameter model outputs. If using a bias, make that a separate parameter of size (num_para_layers, output_size) and just add it to the function below outputs1 = (model@dummy_inputs.T).permute(0,2,1) #get the separate linear layer outputs outputs2 = torch.empty((0, batch_size, 1)) for i in range(num_para_layers): subout = layers[i](dummy_inputs) outputs2 = torch.cat([outputs2, subout.unsqueeze(0)]) #print and compare the results print(outputs1, outputs2) #if this prints true, the math is effectively identical print(torch.isclose(outputs2, outputs1).all())
Note that the Parameter class effectively creates layers that will be captured by autograd for training and can be optimized, just like any other layer. This method can be highly useful when you wish to create a new layer type out of the box and make use of the latest in parallel CUDA operations.