Calculation of gate outputs in GRU layer

Hey, I am trying to figure out the calculations that take place in a GRU layer.

I obtained a pre-trained model and it has a GRU layer define as GRU(96, 96, bias=True).

I checked the dimensions of the weights and bias:
weight_ih_l0 = [288, 96]
weight_hh_l0 = [288, 96]
bias_ih_l0 = [288]
bias_hh_l0 = [288]

The input that is fed to the layer is of size [1000, 8, 96]
The batch_first variable is ‘False’, this would mean:
Sequence Length = 1000
Batch size = 8
Input size = 96

I tried to follow the equations in GRU — PyTorch 1.9.0 documentation
In r(t) we multiply W with X, but my W is 2 dimensions and X is 3 dimensions which makes them incompatible for matrix multiplication.

I know that there are multiple time steps involved, but how is the input X (which is 3D) split so that it is compatible for multiplication with the weight martix

in optimized implementations it is not sliced, instead batch matrix multiplication is done: (1000,8,96) @ (96,hs*3) = (1000, 8, hs*3) = (1000, 8, hs*3), this is Wx summands for three gates and all timesteps at once (hs=hidden size, it can differ from input size 96) . This is a “precalculation”, as Wh summands must be computed step by step.

@googlebot Thank you. I got the calculations