Unable to pause and resume training (loading a model) without getting a small jump in training loss

pranayKD · July 10, 2020, 3:14pm

After spending few hours trying to figure out the problem, found the root cause. And it has nothing to do with model / optimizer loading and saving at all.

What I needed to change is

lab_to_val = {j:i for i,j in enumerate(training_classes)}

To

lab_to_val = {j:i for i,j in enumerate(sorted(training_classes))}

Since the label to value conversion was not happening on the sorted list, each time my runtime was reset, a single label would take different values. Ex - if person class takes value 1 for some run, it used to take value 4 after restarting the runtime and running the code.

After fixing this small error, I do not observe any jumps in training loss