# Is there any benefit to stack multiple GRU one after the other?

I’m working on time series problems.

On the web I saw 2 type of models:

• Models which used one `GRU` with multiple layers.

nn.GRU(num_layers=4,…)

• Models which used multiple `GRU's` (one after the other) with less (or equal) number of layers.
``````nn.GRU(num_layers=2, ...)
nn.GRU(num_layers=2, ...)
``````

I understand that using one `GRU` or multiple `GRU` change the number of wights to calculates and to optimize,

but I didn’t found any explanation for the benefits between the 2 options above:

1. What is the benefit (or when we want to use) multiple `GRU's` or one `GRU`) ?
2. Is there a benefit for useing one `GRU` with high number of layers vs multiple `GRU's` with less layers ?

The two above are practically identical, only difference being semantics. GRU structure for each layer is

https://pytorch.org/docs/stable/generated/torch.nn.GRU.html

You can think of the above as one building block, with additional GRU layers just repeating the same gated algorithm. So whether you want to give each a different label, or put them all under one label, is a matter of preference. The math works out the same.

1 Like

I could see there being a benefit to splitting it up for very large models, as it’s much easier to assign each “layer” to different GPUs/TPUs during initialization and training. If you put them all in one nn.GRU unit, it’s going to take a lot more code and effort to assign each layer to separate GPUs.

1 Like

Thanks.
If we use 2 `GRU` the `H0` of the second `GRU` may be initialised with zeros (or different values) (instead of getting the last hidden state of the first `GRU`),
while using 1 `GRU` with multiple layers, the hidden state passed to the next layer, across all layers.

Is that something that can change the results dramatically ?

Right, it would change the result. You would need to keep track of 2 hidden states. One for each separate unit. Notice how the hidden state contains the number of layers in the dims:

If you set the hidden state to zeros for the second layer, it would no longer be the equivalent of using multiple layers in one GRU. It would be the same as if you used one GRU and set half of the hidden state to zeros.

1 Like