How to process squence of multi-dimensional input in nn.GRU?

As per the pytorch documentation, nn.GRU expects input in the shape (seq_len, batch, input_size) or (batch, seq_len, input_size) based on whether the batch_first is False or True, where the input_size is an integer. However, if I am dealing with a sequence of multi-dimensional tensors then is it possible to use nn.GRU() without explicitly flattening the input at each time stamp? If yes, then what should be the input_size then? as it only takes integer and not tuple of integers.

For example say, I have a grayscale video of 16 frames with each frame dimension being 128x128. Here, I can easily set seq_len to 16, but how to deal with the shape of each frame at each time stamp? Do I need to flatten it to a single dimension of 128*128?
(I understand passing a video directly to GRU may not be a good idea, but just wanted to know how nn.GRU deals with multidimensional input at each time step)