I found in GRU implementation, weight parameters are initialized in shape( 3*hidden_size, hidden_size) and (3*hidden_size, input_size), I guess GRU inside would slice the parameter into three matrices, corresponding to reset gate, update gate and candidate activation h respectively. But input matrix is often (None, embed_size), it seems GRU performs W*xT computation rather than xW+b?

W*xT: (hidden_size, input_size) * (batch, embed_size)T

x*W: (batch, embed_size)* (embed_size, hidden_size)

I need to implement a GRU cell to handle variable length hidden states, one of my candidate solution is to reuse pytorch GRU, and set the weight matrices to a maximum shape, then apply {0,1} mask matrix to the output of GRU. It is possible when GRU behaves xW+b internally.