Previously I have put a lot of effort into training networks appropriately. However, talking to colleagues, a lot of the things I did may be redundant due to novel optimizers and the theory of deep learning, who are training networks for beating benchmarks in academia. I am hoping for a brief discussion. Here are the things I used to do which might not be necessary, which were inspired by traditional ML classes not DeepL specific classes:
-
using train-val-test split instead of just train-test: During training I used an additional val split, which I used to check if the training is converging or overfitting. I heard from colleagues that they simply use the training set running loss to check for convergence (and don’t need to check for overfitting)
-
Using learning rate schedulers. My colleagues do not anymore because optimizers such as Adam have their own. I know this to be true but I still did it because it seemed to be a general procedure.
-
Changing training at validation plateau:
3.1. I used to use ReduceLROnPlateau from pytorch. when validation loss starts to increase, the learning rate is reduced
3.2. I used to save checkpoints until reaching plateau → reset to this checkpoint when val loss increases → decrease learning rate and continue
Notice for 1 and 3 I made these decisions based on a patience period, for example, declare overfit/convergence if after 10 epochs the validation loss hasn’t decreased.
What do you think? What is redundant? is anything potentially harmful?