Why are some vanilla RNNs initiliazed with a hidden state with a sequence_length=1 for mnist image classification

I came across several examples of classifying MNIST digit using a RNN, what it the reason to initialize the hidden state with a sequence_length=1? If you were doing 1 step ahead prediction of a video frame prediction, how would you initialize it?

def init_hidden(self, x, device=None): # input 4D tensor: (batch size, channels, width, height)
    # initialize the hidden and cell state to zero
    # vectors:(number of layer, sequence length, number of hidden nodes)
    if (self.bidirectional):
        h0 = torch.zeros(2*self.n_layers, 1, self.n_hidden)
    else:
        h0 = torch.zeros(self.n_layers,  1, self.n_hidden)

    if device is not None:
        h0 = h0.to(device)
    self.hidden = h0

The input is usually represented as

inputs = inputs.view(batch_size*image_height, 1, image_width)

In this above example are the images passed columns-wise? Is there another way to represent inputs images in RNN? And how does it related to how one initialize the hidden state?