I have minibatches where each sequence stands for the record of one particular student solving exercises. This record is one-hot encoded. I have 124 exercises and I want to know at any given step of the sequence whether a student has mastered them. So each sequence has 248 columns (“answer is true” or " answer is false" per exercise) and as many rows as a student has done exercises over the entire observation period (i.e., time steps). This can range from 1 to over 1000.
I feed, e.g., 5 of these sequences into the LSTM as one minibatch. So I padded the sequences and got the code to at least formally run like that.
The thing is that I basically try to rewrite an old source code that has been written using Keras and Theano: dkt/dkt.py at master · mmkhajah/dkt · GitHub (I am also using the dataset provided in the repo). And there is one additional thing they are doing there that I am unable to reproduce. They separate their already-padded minibatches by using only a selection of sequence timesteps (a time window) in every training step. E.g., if you have a minibatch with actual sequence lengths [1210, 342, 26, 17, 2] and you have a time window 100, they pad to the smallest multiple of 100 greater than max(lengths), i.e., every sequence will be padded to have 1300 timesteps. And then they train the first 100 steps of each of the 5 sequences of the minibatch, then the 100th to 200th step, … until they are at 1300, at which point they start training on a new minibatch.
I know how to slice the minibatch so I can get these samples. However, my problem is how to pack the sub-sequences of 100 steps each for the LSTM algorithm, more specifically: which input sizes to pass (to pack_padded_sequence, that is). The most natural to me seemed to actually calculate in each step how many non-padded steps there are in the slice, so taking the example above, if we were at window 1100:1200, the input sizes would be (100, 0, 0, 0, 0). If I were at 1100:1200, my input sizes would be (10, 0, 0, 0, 0).
So the problem here is that at some point I will have a sequence that is technically of length 0 because there is only padding left. And batches of length 0 are not allowed by LSTM. My naive guess consisting of keeping the lengths the same over all slices even though I am truncating the sequences that the lengths refer to also didn’t work, giving me a “Start (u) + length (v) exceeds dimension size (u+v)” error. How can I best deal with such a situation/how can I feed the network the minibatch slice by slice while in keeping with the concept of having the 5 same students per minibatch?
I hope that this is understandable. I am only starting out with PyTorch, and I struggled a lot already to generally understand the five-year-old Keras code. Maybe there are more PyTorchian ways to achieve the same thing.