I am working on timeseries dataset. There are 13 timeseries. First 10 of them are actually input features and last 3 are ground truth targets that model needs to learn to predict. I am working with 1024 mini batch size. The window size is 200. So, the dataloader returns minibatch of shape [1024, 200, 13].
Now I have new requirement. During inference, I may not get ground truth readings for target. So I want to train model with past predictions instead of ground truth values for past time steps, so that model will learn to work even when there is ground truth reading for target.
So instead of mini batching, I can train on individual windowed at a time. Do forward and backward pass. Take next window and replace last sample’s (inside a window) Y with last forward pass’ prediction and do forward and backward pass and so on. But I feel training against single window will make the model difficult to converge. Also it will take excessively more time since it will not utilize all cores GPU in parallel.
However I am unable to think how can I do mini-batching with this.
First, I need mini batches in some sequences to include past window’s prediction in current window. So cannot do shuffling while creating mini batches. (Thats why in the tabular image, I have not done shuffling.)
Now consider, I have processed minibatch 1’s window 1. Its predictions are to be used for next window which turns out to be minibatch 1’s window 2. But we process whole minibatch in one go. That is forward and backward passes of all windows in mini batch 1 will be done parallelly on GPU. So, I cannot create mini batch like shown in image. So what I thought is I will divide the whole dataset into 1024 parts. (1024 being batch size). Then I will create a mini batch by picking 1 element from each of these parts successively. So, new-minibatch-1 will contain [minbatch]-1-window-1, [minibatch]-2-window-1 and so on. ([minibatch] (in square brackets) refers to minibatches displayed in tabular image.) Once I complete new-minibatch-1 (containing window 1 of all [minibatches]), I will use their predictions for replacing last three elements of next new-minibatch-2 which will contain window-2 of all [minibatches].
There are some challenges with this approach too.
- How can I implement it with pytorch? Do I have to write custom DataLoader sampler?
- What if last part has less than 1024 elements? I guess in that case I wont process last new-minibatch, right?
- This dataset is made of several sessions of operations of a machine. Different sessions contain different number of samples. Some may contain some hundreds of samples, other may contain several thousands. And predictions done on a window from one session, should not be used windows from another session. I believe I cannot handle this constraint in above described approach, right?
I have thought another approach: The min batches will be formed by shuffling windows. Let the dataset also return window index and whether this window is a starting window of any minibatch. Once any prediction is done, I will store them in the map against the window index as a key. When a window is obtained from data loader, I will check if its starting window of any session. If not, then I will check of the predictions for window at earlier index is available in the map or not. If it is available, then I will use it to replace current window’s last samples ground truth. If the window of earlier index is not available I will go with ground truth. The only issue with this approach is that many windows may not last window predictions available, since the last window may not have been already processed.
Looking at above options, I feel last approach (with shuffling and window map) is more feasible, right?
I know all this sounds a bit complex, but what other options I have?