I came across several examples of classifying MNIST digit using a RNN, what it the reason to initialize the hidden state with a sequence_length=1? If you were doing 1 step ahead prediction of a video frame prediction, how would you initialize it?
def init_hidden(self, x, device=None): # input 4D tensor: (batch size, channels, width, height)
# initialize the hidden and cell state to zero
# vectors:(number of layer, sequence length, number of hidden nodes)
if (self.bidirectional):
h0 = torch.zeros(2*self.n_layers, 1, self.n_hidden)
else:
h0 = torch.zeros(self.n_layers, 1, self.n_hidden)
if device is not None:
h0 = h0.to(device)
self.hidden = h0
The input is usually represented as
inputs = inputs.view(batch_size*image_height, 1, image_width)
In this above example are the images passed columns-wise? Is there another way to represent inputs images in RNN? And how does it related to how one initialize the hidden state?