Why are the dimensions of input/output in lstm oriented in the way they are?

RNNs use a loop, unrolling the time dimension, so time-major RNNs process contiguous sequential memory blocks. How important that is, depends on implementation, device and tensor size (i.e. whether it fits in cache).