I am working on time series forecasting using GPT-like model. The idea is similar to training a language model that, given the beginning part of a sentence, the model can generate the rest of that sentence.
“I’m hungry, I want a hamburger and a cup of cola.”
Input: I’m hungry
Predict: I want a hamburger and a cup of cola.
An autoregressive language model will generate words step by step.
I’m hungry, I
I’m hungry, I want
I’m hungry, I want a
I’m hungry, I want a hamburger and a cup of cola.
That is, the newly generated word will be appended to the end of the previous input sequence to construct the new input sequence. During training, I will calculate the loss on the generated content “I want a hamburger and a cup of cola” and use back-propagation to update model parameters. The generation process can be implemented through a for-loop and a “decoder-only” module.
However, the GPU memory usage always spikes in this for-loop, and causes out-of-GPU-memoery error. If I set “@torch.no_grad()”, there will be no such problem. So I guess maybe the problem is caused by stored intermediate data for back-propagation.
Do you think my implementation is the right way to generate word sequences? Do you have any suggestions for optimizing my implementation?
My time series forecasting sequence contains around 100 elements, that is, the for-loop repeat operation 100 times.