I’m training a time series transformer on 60GB of data. Previously, when my model wasn’t improving after a few epochs, I would stop it, make some tweaks to the model, data, or hyperparameters, and re-run it. However, that was when I was training the model with much less data. Now that I am training it on a much larger dataset, I am unsure if the model isn’t learning due to limited capacity of the model itself, or simply because it needs more time to go through the data. For context, it will take my model about a week to go through 1 epoch, which is a lot of time to leave a broken model running. I unfortunately don’t have access to more computing power, and have parallelized and increased batch size to my systems limits.
When I went through previous training runs, there were instances that loss would plateau at the beginning, and then learn after a few hours of training. My concern with stopping it before it has gone through one epoch, is that i may be stopping a good model too soon. However, if my model has been running for a few days with no real improvement, is that always indicative of a poor model? Any ideas would be much appreciated.