Why parameter 'batch_first' is needed?

If you organise things sequence first, then each timestep, which is much like a regular layer (linear on hidden + linear on input + nonlinearities + gating) operates on contiguous bits of data, and you have good caching properties etc.

Best regards

Thomas

1 Like