xW+b or Wx+b? In GRU matrix multiplication

I found in GRU implementation, weight parameters are initialized in shape( 3hidden_size, hidden_size) and (3hidden_size, input_size), I guess GRU inside would slice the parameter into three matrices, corresponding to reset gate, update gate and candidate activation h respectively. But input matrix is often (None, embed_size), it seems GRU performs W*xT computation rather than xW+b?

W*xT: (hidden_size, input_size) * (batch, embed_size)T

xW: (batch, embed_size) (embed_size, hidden_size)

I need to implement a GRU cell to handle variable length hidden states, one of my candidate solution is to reuse pytorch GRU, and set the weight matrices to a maximum shape, then apply {0,1} mask matrix to the output of GRU. It is possible when GRU behaves xW+b internally.

I think if you set batch_first=True when you create the GRU it should process as you want.

Emmm, I realize that it make no difference whether xW or Wx. Just fill zero and slice.