Training accuracy (loss) increases (decreases) in a zigzag way?

My training log is very strange that first batch always performs better then later batches in an epoch.

Here is my training log

As you see, they decrease (increase) in a very typical zigzag way, which makes me puzzled a lot.

I’ve shuffled the training data:

train_loader =
    train_dataset,, shuffle=True,
    num_workers=12, pin_memory=True

So why do things like this happen, any tips?

Do you plot the loss of the current batch or are you somehow summing / averaging it?
Could you post the code regarding the accuracy and loss calculation?

The complete code is here

Actually I do plot the averaged loss/precision.
But it still makes me confused why the first batch always performs the best.

1 Like

Especially since you are shuffling the data.
I couldn’t find any issues by skimming through your code.
Do you see the same effect by just storing the batch losses (without AverageMeter)?

I just removed the AverageMeter and store the instant values. Here is the complete code:

Strangely still the loss (and top1 accuracy) vibrates in a zigzag manner, the first batch always evaluates the highest accuracy and lowest loss.

Thanks for the code! I’ll have a look at it.
Which script is behaving strangely, or

Could you tell me which resolution your dataset (casia) has? I would use a random dataset first and check, if there are some obvious mistakes.

Are the default values for the other arguments set, such that the strange behavior is raised?

Didn’t go through your code, but in general a possible cause for this type behaviour could be a feature set that isn’t normalised properly.

For example one of the features isn’t normalised at all and has for example values between 0 and 255. This causes the changes to the weights by the optimiser to overshoot when applied to this feature and it is not going in a more straight line to the optimum (hence the zigzag).

I assume at around 8000 iterations you reduce the learning rate and the changes to the weights are reduced and so is the “overshoot” factor (so smaller zigzag).

both of and behave strangely.

In the later one I removed the AverageMeter and store the instant values.

The figure of logs is here, where the ExLoss typically decrease in a zigzag way.

Though the model finally converged and seems works well, but I just want to know what caused the strange behaviors.

I am having the same zig-zag-ed loss values in one of my works.
In that work, I am dealing with multi-label (multiple hot encoder) of size 600.
I think this rather large multi-hot binary output is the reason for this zig-zag-ness.

So, the large variations in the input/output could be the reason for this phenomenon. I would not worry about it as long as the model is achieving convergence.

Thanks for your information.

Actually my model converged as well, but I still want to know what causes the strange behavior.

There’s a similar post here Strange behavior with SGD momentum training which also shows a saw toothed loss. A suggestion by smth is to do sampling with replacement.