Iteratively passing elements of different sequences through LSTM

Zador · December 31, 2021, 12:37pm

Hi,

My goal is to use an LSTM in a dataloader. I am filling up a volume in the dataloader with LSTM states. At some iteration, I take a feature vector and assign it to a position in this space. Then I use the h_n and c_n I have assigned to this space (from a previous iteration) as an input to my LSTM in addition to the feature vector. The output I save in the volume as well, and additionally I save the new h_n and c_n in the volume. Then, after this process is complete, I return the final outputs of the lstm.

After loading from the dataloader, I use these outputs as inputs to another NN to make a prediction, and finally, want to backpropogate to update both the prediction NN and the LSTM parameters.

So of course when I am saving the LSTM outputs in a volume, I need to keep its grad_fn, however, must I also save the grad_fn info of h_n and c_n when saving it in my volume and using it as inputs to the LSTM in the upcoming iteration?

ptrblck · December 31, 2021, 10:14pm

I’m not sure why you are moving the LSTM execution to the DataLoader as it would complicate the state assignments. Wouldn’t it work to use the DataLoader in the common way to yield the samples and use the LSTM as well as the other model inside the DataLoader loop?

Zador · January 2, 2022, 10:15am

Hi, thanks for your response. I am sure I can avoid using a DataLoader but it makes my framework simpler (unless it really does “complicate the state assignments”). Everything in the DataLoader is part of the preprocessing of data and that includes the LSTM. The task is simple; for some point in space, I accumulate information temporally in the forms of vectors. The LSTMs task is to give me one feature vector summarizing this sequence. The result is my training data ready for my actual net responsible for making the predictions. The two main reasons why I want my framework to work as described:

DataLoader makes it convenient to parallelize the preprocessing, and that includes the LSTM.
As this will be computationally intensive, I do not want to load all relevant vectors of the sequences into memory; instead I want to do this incrementally.

I believe I have found the answer to my original question. I had not realized what I was looking for was called “Statefule LSTM”, and once I had realized this it was easy to search for examples on how to do this. However, you mentioned that there are difficulties in executing the LSTM in the DataLoader. Is there I way to do this?

ptrblck · January 3, 2022, 7:32am

I’m sure the is a way to achieve what you want by writing a custom dataset, sampler, etc.
The main complication I see is that the common workflow of using indices to load and process a data sample wouldn’t work anymore, as your LSTM would need to get the previous states as well.
I.e. assuming you want to use multiple workers in the DataLoader, I guess you want to track and provide the last LSTM state in each copy of the dataset instance or are you woring on a way to use a “global” state and share it between each worker?

Zador · January 3, 2022, 1:58pm

Thanks for your reply! Yes, it is important for me to be able to use multiple workers in the DataLoader. I am not sure I see your point. I will write out what I think is happening so that you can tell me where I am going wrong.

What I imagined:

DataLoader workers with LSTM load data [data1, …, datan], and then these tensors would have the LSTM autograd states [LSTM1, …, LSTMn].
DataLoader returns [(data1 w/ LSTM1), …, (datan w/ LSTMn)], where (datai w/ LSTMi) just means LSTMi states are stored in the datai tensor.
Then through SomeNet: [(data1 w/ LSTM1), …, (datan w/ LSTMn)] → [(data1 w/ [LSTM1, SomeNet1]), …, (datan w/ [LSTMn, SomeNetn])

So if this works the optimizer optimizing over [LSTM.parameters(), SomeNet.parameters()] should have access to the full history of autograd states. Considering the data generated by each worker get passed independently through SomeNet, it should be okay that the autograd states of datai are independent of those fo dataj. Which step/assumption is incorrect?

Perhaps there was a misunderstanding? When I talked about passing sequences incrementally through an LSTM what I meant was within one worker, I am loading in data incrementally and thus the LSTM must be stateful, and I didn’t mean multiple workers work on different sections of the same sequence.

Looking forward to your reply.

ptrblck · January 3, 2022, 10:23pm

Your general workflow might work, but:

DataLoader workers with LSTM load data [data1, …, datan], and then these tensors would have the LSTM autograd states [LSTM1, …, LSTMn].

an LSTM expects the data input as well as the cell and hidden state. If you are not passing the latter ones, they will be initialized with their default values and this wouldn’t match “and thus the LSTM must be stateful”.
So how are you passing the cell and hidden state from the previous iteration to the current forward pass of the LSTM inside each worker?

Zador · January 4, 2022, 8:38am

Right so here is what I am doing:

I made a custom dataset for my DataLoader; inside the load_file function of this DataLoader I incrementally load batches of a sequence (another DataLoader w/o shuffle). For example, when DataLoader would be loading data-i, inside the dataloder I load [batch-i.1, …, batch-i.m] in a loop. Furthermore, inside the dataset I also initialize two tensors of zeros b and c. Then the pseudo code for the loop looks as follows:

Init: j = 1

Loop:

_, (b, c) = LSTM(batch-i.j, (b,c))
j ← j+1

while j < m

data1, _ = LSTM(batch-i,m, (b,c))

So hidden states get passed iteratively back into the LSTM and all of this is handled by one worker. What do you think?

About this loop however I did see this post Training Stateful LSTM in Pytorch cause runtime error where the accepted answer says “I think you need to detach both hiddens because the hiddens that are output from the LSTM will require grad.” Wouldn’t you say this information is counterintuitive? If we are removing the gradient states from b and c form previous iterations using .detach(), then how do the gradient states get stored throughout the entire sequence?