Training Loss Randomly Varying on Startup

Hello, I am currently training a CNN model with the AdamW optimizer. I notice that when I initiate the education process, the loss is often unacceptably high. I solve the issue by simply restarting the training script up to three times. With every execution, the loss varies significantly until I hit a fine, low starting point. I have concluded that the optimizer must randomly initiate values within itself every time, which can result in a high loss or otherwise. Is this correct?

I would rather guess the model’s random parameter initialization is not optimal and you might want to change it. What makes you think the optimizer is at fault? If the very first loss output is already high, the optimizer would not have been used by then.

@ptrblck Your theory cannot be the case, as my model is saved to a file and loaded on each startup. The only thing I determine can be randomly initiated is the optimizer. Even so, if I save the optimizer and load it in the next training session, the loss still varies randomly.

This would mean that you are not checking the initial loss, but the loss after a few update steps?
If not, how can the optimizer influence the initial loss?
Also, could you explain which part of AdamW is random and which parameters are initialized in this optimizer?

@ptrblck Let me explain. My training program operates as shown:

  1. Import modules, define model, dataset, other functions.
  2. Setup model, load state dictionary from file is possible.
  3. Generate AdamW optimizer with MSELoss function.
  4. Train on batches while tracking average loss per epoch.
  5. After each epoch, print average loss.

The average loss, especially for the first epochs, is what varies the most with every training session. For example, I could get this on the first execution:

Epoch 1/50000 finished with an average training loss of: 0.00483 and an average test loss of: 0.00799

Too high. Restart training script:

Epoch 1/50000 finished with an average training loss of: 0.00371 and an average test loss of: 0.00782

That’s more like it. :+1: The exact same model file, dataset, loss function, and optimizer class are used in both instances. I changed nothing. How can you explain this?
As for your question regarding which parts of the optimizer are randomly generated… I haven’t an inkling. :person_shrugging:

Any other random operation, e.g. the shuffling in the DataLoader or transformations, could cause it.
Let me know if you narrowed down what exactly is randomly initialized in the optimizer potentially causing the difference in loss.