Parallel GRUCell computation

Say I have a tensor A: [B, n, 64, 64]. Now I want to have n GRUCell units process this tensor across dim=1.

That is GRUCell_0 gets A[:, 0] and so on and so forth.

Is there a parallel way to achieve this behavior instead of a for loop?

Thanks!