This is a rather fundamental question and if I want to be honest, not really a PyTorch question.

Let’s say that I have two linear layers:

```
layer1 = nn.Linear(input_size, hidden_size)
layer2 = nn.Linear(hidden_size, output_size)
```

I’m going to populate the input tensor like this:

```
input = torch.randn(batch_size, arbitrary_size, input_size)
```

Now, consider two different ways of applying the two layers to the input:

```
hidden = layer1(input)
output = layer2(hidden).mean(dim=1)
```

and

```
hidden = layer1(input).mean(dim=1)
output = layer2(hidden)
```

In both cases the output will be of shape `[batch_size, output_size]`

. But in the first case, we’ll have:

```
[batch_size, arbitrary_size, input_size]
-> [batch_size, arbitrary_size, output_size]
-> [batch_size, output_size]
```

While in the second case, it’s going to be:

```
[batch_size, arbitrary_size, input_size]
-> [batch_size, output_size]
-> [batch_size, output_size]
```

My question is: What are the differences (if any) between the two cases? Are the learning capacity the same? How about the computational demand?