I’m working on time series problems.
On the web I saw 2 type of models:
- Models which used one
GRU with multiple layers.
- Models which used multiple
GRU's (one after the other) with less (or equal) number of layers.
I understand that using one
GRU or multiple
GRU change the number of wights to calculates and to optimize,
but I didn’t found any explanation for the benefits between the 2 options above:
- What is the benefit (or when we want to use) multiple
GRU's or one
- Is there a benefit for useing one
GRU with high number of layers vs multiple
GRU's with less layers ?
The two above are practically identical, only difference being semantics. GRU structure for each layer is
You can think of the above as one building block, with additional GRU layers just repeating the same gated algorithm. So whether you want to give each a different label, or put them all under one label, is a matter of preference. The math works out the same.
I could see there being a benefit to splitting it up for very large models, as it’s much easier to assign each “layer” to different GPUs/TPUs during initialization and training. If you put them all in one nn.GRU unit, it’s going to take a lot more code and effort to assign each layer to separate GPUs.
If we use 2
H0 of the second
GRU may be initialised with zeros (or different values) (instead of getting the last hidden state of the first
while using 1
GRU with multiple layers, the hidden state passed to the next layer, across all layers.
Is that something that can change the results dramatically ?
Right, it would change the result. You would need to keep track of 2 hidden states. One for each separate unit. Notice how the hidden state contains the number of layers in the dims:
If you set the hidden state to zeros for the second layer, it would no longer be the equivalent of using multiple layers in one GRU. It would be the same as if you used one GRU and set half of the hidden state to zeros.