This is a rather fundamental question and if I want to be honest, not really a PyTorch question.
Let’s say that I have two linear layers:
layer1 = nn.Linear(input_size, hidden_size)
layer2 = nn.Linear(hidden_size, output_size)
I’m going to populate the input tensor like this:
input = torch.randn(batch_size, arbitrary_size, input_size)
Now, consider two different ways of applying the two layers to the input:
hidden = layer1(input)
output = layer2(hidden).mean(dim=1)
and
hidden = layer1(input).mean(dim=1)
output = layer2(hidden)
In both cases the output will be of shape [batch_size, output_size]
. But in the first case, we’ll have:
[batch_size, arbitrary_size, input_size]
-> [batch_size, arbitrary_size, output_size]
-> [batch_size, output_size]
While in the second case, it’s going to be:
[batch_size, arbitrary_size, input_size]
-> [batch_size, output_size]
-> [batch_size, output_size]
My question is: What are the differences (if any) between the two cases? Are the learning capacity the same? How about the computational demand?