Hi,
I have a related problem if you don’t mind…
I want to process a sequence (tokens) with LSTM by hand, meaning that I’m going through the sequence with a for-loop instead of giving the whole sequence to LSTM and let it process all-at-once (let’s say).
The reason for doing this is not important (but I can tell it if you are curious).
Also, I want to use character features for each token.
So, in my network I have a first LSTM (charLSTM let’s say) computing character-level representations for tokens.
Such representations are the hidden layer of the charLSTM once it has processed the whole token.
I save character-level representations in a Variable which looks like:
char_rep = autograd.Variable( torch.zeros(sequence_length, batch_size, character_features) )
I fill this variable in a for-loop which looks like:
for i in range(sequence_length):
char_features = init_char_features()
lstm_out, char_features = charLSTM(char_input, char_features) # char_input goes through the whole token
char_rep[i,:,:] = char_features
So, now I have charcter-features, and I can compute also token features as embeddings, and I can process the sequence using both kind of features.
These two features are given as input to a second LSTM. The hidden state of this second LSTM is then used to compute the final output of the network.
So, I save the hidden state of the LSTM in a similar way as I do for the character-level representations:
hidden_state = autograd.Variable( torch.zeros(sequence_length, batch_size, hidden_features) )
And I fill this variable in a similar way:
for i in range(sequence_length):
lstm_input = torch.cat( [ word_embeddings[i,:,:], char_rep[i,:,:] ] )
lstm_out, hidden_features = tokenLSTM(lstm_input, hidden_features) # process one sequence position, but for the whole batch
hidden_state[i,:,:] = hidden_features
After the forward step of the network, I call backward on the whole sequence, actually on the whole batch of sequences. I process batch_size sequences at the same time.
I have actually 2 questions:
- Should I also detach some hidden state at some point ?
- Filling variables as I do, e.g. cha_rep or hidden_state, does break the computation graph ?
I mean, I know there are issues with in-place Variable operations, that’s why I’m asking this question.
My guess for the 1st question is non. However I’m experiencing a huge memory utilisation, much more than what I expected, and also more than I (roughly) computed.
My guess for the 2nd question is no. However results are not convincing to me. I’m trying to replicate networks I already coded in the past with other frameworks. I would like to move to PyTorch because I think that would be much faster. But actually at the moment I’m not faster, and results are actually much worse than what I got with the other framework. However the latter may be due to my own mistakes in coding the network.
Any answer would be appreciated.
Thank you in advance.