I am trying to implement a many to many RNN network. I have defined the model and forward prop like below:
def __init__(self, hid_size=100, output_dim):
self.hidden_size = hid_size
self.output_dim = output_dim
self.rnn = nn.LSTM(input_size=4, hidden_size=self.hidden_size, num_layers=1, batch_first=True)
self.fc1 = nn.Linear(self.hidden_size, self.output_dim)
def forward(self, inputs):
outputs = 
rnn_out, _ = self.rnn(inputs)
output = self.fc1(rnn_out[:, -1])
I can set the value of
output_dim while initializing the class and then the value gets fixed. However, in my application, the value of the output dimension keeps changing. I can’t fix the value of
output_dim before hand.
My question is, how can one construct a many-to-many RNN with output dimension that is flexible? I am open to changing the architecture of the model to achieve this functionality. Any help is appreciated.
You need to fix the value of output_dim beforehand so you can make use of the same output weight matrix always (and so that weight updates are informative). The way I would do this is to make a guess at the largest output size you will need (max of training data, 1.5 times that, …?) and then mask the output for smaller target values. This means that you multiply the last k scalars of your output by 0, the ones which you dont need. The specifics of when to mask depend on when you know about your current output size. Simple example: If you already know you have a target tensor of a specific size beforehand, you could do:
max_size = big
for x, y in batches:
mask_last_k = max_size-y.shape[-1]
mask = torch.ones(max_size)
mask[mask_last_k:] = 0
predictions = model(x)
predictions_masked = predictions * mask
loss = loss_fn(predictions_masked, y)
If you actually somehow generate output tensors of different sizes, you can of course also pad the difference up to the required length:
pad_last_k = y.shape[-1] - x.shape[-1]
pad = torch.zeros(max_size)
predictions = model(x)
predictions_padded = torch.cat([predictions, mask], dim=-1)
If you get autograd errors for creating the pad/ mask tensor yourself, create them in the scope of with torch.no_grad().
Hope this helps
Hey @marvosyntactical, thank you so much for your reply!
I already know the maximum and minimum size of the output. What I wonder is that, RNNs are powerful, in that, they can be trained using a value of input sequence length and they’ll perform for shorter or longer input sequence lengths. I was trying to find out if such a thing can be said for the output of an RNN or any other architecture.
An example being, if the architecture learns to predict three numbers in a sequence while training, is it possible for the network to predict, maybe, five numbers during inference.
Oh then I misunderstood. What you ask is still doable pretty much the way I described, for example you can save the outputs at each step in a list, and then create a tensor from that, and pad it to a standard length if you need to calculate your loss on the entire sequence. In many applications, this isn’t needed though, for example in language models you usually calculate the loss on each step (greedy training), so in that case you dont need to deal with variable sizes at all.
@marvosyntactical perfect, this is what I came up with before you wrote this answer. I have one question though. Suppose my input is of the size
(batch_size, seq_len, num_inputs) and suppose I need a length
len at the output. More clearly, the expected output size is
(batch_size, len). I run a loop for
len number of times and in each iteration roll my input to the RNN (same input would produce the same output, so I had to do something to the input ). Do you recommend this way of doing it? Or is there a better way to change the inputs to get the required sequence length?
Thanks a lot for your time!
So If I understand correctly seq_len does not necesarily vary but len can vary and you know this ahead of time. I think it depends on the task: Is there some kind of termination criterion in your task? In language, for example, we predict discrete tokens, and have a special termination token, which, if it is generated, terminates the sequence, but we still use a maximum sequence length as I described above and pad the predictions that go over the limit given by your ‘len’ or the termination token.
I would recommend looking at MAX_LENGTH here NLP From Scratch: Translation with a Sequence to Sequence Network and Attention — PyTorch Tutorials 1.7.1 documentation