If you organise things sequence first, then each timestep, which is much like a regular layer (linear on hidden + linear on input + nonlinearities + gating) operates on contiguous bits of data, and you have good caching properties etc.
Best regards
Thomas