Does it matter at which stage to average?

This is a rather fundamental question and if I want to be honest, not really a PyTorch question.

Let’s say that I have two linear layers:

layer1 = nn.Linear(input_size, hidden_size)
layer2 = nn.Linear(hidden_size, output_size)

I’m going to populate the input tensor like this:

input = torch.randn(batch_size, arbitrary_size, input_size) 

Now, consider two different ways of applying the two layers to the input:

hidden = layer1(input)
output = layer2(hidden).mean(dim=1)

and

hidden = layer1(input).mean(dim=1)
output = layer2(hidden)

In both cases the output will be of shape [batch_size, output_size]. But in the first case, we’ll have:

[batch_size, arbitrary_size, input_size]
-> [batch_size, arbitrary_size, output_size]
-> [batch_size, output_size]

While in the second case, it’s going to be:

[batch_size, arbitrary_size, input_size]
-> [batch_size, output_size]
-> [batch_size, output_size]

My question is: What are the differences (if any) between the two cases? Are the learning capacity the same? How about the computational demand?

Hi Mehran!

If you have two Linears, one after the other without any intervening
nonlinear activations (such as relu() or sigmoid()), then they collapse,
in effect, into a single Linear (input_size, output_size).

The output in the two cases will be numerically equal, up to some floating-point
round-off error.

The number of parameters in the two cases will be the same, which is an
argument that they will have the same learning capacity. The gradients
produced by backpropagating will be different in the two cases (because
you inject the .mean() in a different place), but my intuition suggests that
training will progress in more or less the same way in the two cases.

The computational demand will be less in the second case (where .mean()
is injected after the first layer, rather than the second), because the size and
dimensionality of the computation is reduced earlier in the forward pass.
(However, based on how pytorch, especially with gpus, operates on entire
tensors, the realized time savings is likely to be significantly less than the
actual reduction in the number of individual floating-point operations.)

But again, there is no real point in what you are doing. If this is what you
want, just use a single layer of Linear (input_size, output_size).

Best.

K. Frank

1 Like

Thanks, Frank.

This was my intuition too but I started doubting myself when I saw the OpenAI’s Whisper model (on HuggingFace) using the first approach! I thought there might be something to it that I don’t understand. But now that it’s not just me, I feel more confident that my knowledge is not lacking.