Average epoch training loss not recovered by checkpoints

Dear community,

The problem is that the average epoch training loss of my network will converge nicely (say to .005 mean SmoothL1Loss), then after I save the checkpoint, when loading the average epoch training loss, it will be back to .05 (10 times worse). This repeats when loading from the next checkpoint: the model never picks up where it leaves off.

The data input has been checked, it’s all neat. Otherwise, the model wouldn’t converge this well anyway.

Because of all the preprocessing steps, the average epoch loss should be incredibly consistent (which it is within a run, but not between checkpoints).

Using another optimizer gives exactly the same problem. I’ve tried loading and saving in many different ways, exhausting all online resources, but to no avail.

I would call it an obscure PyTorch checkpoint problem.

What I use to save:

Save FusionNet checkpoint

            'model_module_state_dict': FusionNet.module.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            'scheduler_state_dict': scheduler.state_dict()
        }, f'{models_folder}FusionNet_snapshot{load_snapshot+iE+1}.tar')

What I use to load:

Initiate FusionNet

FusionNet = FusionGenerator(1,1,64)

if load_snapshot:
    model_path = f'{models_folder}FusionNet_snapshot{load_snapshot}.tar'
    checkpoint = torch.load(model_path, map_location=nn_handler_device)
    check = FusionNet.load_state_dict(checkpoint['model_module_state_dict'])

FusionNet = nn.DataParallel(

# Define optimizer and send to GPU
optimizer = torch.optim.Adam(FusionNet.parameters(), lr=lr, weight_decay=weight_decay)
if load_snapshot:
optimizer_to(optimizer, torch.device(nn_handler_device))

# Define scheduler
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=gamma)

# Optional model snapshot loading
if load_snapshot:
    print(f'\tSnapshot of model {model} at epoch {load_snapshot} restored...')
    print(f'\tUsing network to train on images from {data_set}/{data_subset}...')

# Make sure training mode is enabled

I would try to verify that the right values are indeed loaded by removing the nn.DataParallel warpper (to simplify the code) and by comparing the values of a static input (e.g. torch.ones) before saving and after loading the model (and calling model.eval() additionally). If these values differ, then the loading itself might have some trouble. On the other hand, if these values are equal (up to floating point precision) then the data processing might be different in the validation script (or something else).

Dear @ptrblck,

The results of your test:
network 1
before saving and loading, mode eval: … 0.23162976, 0.08113185, 0.46776736 …
after saving and loading, mode eval: … 0.23162976, 0.08113185, 0.46776736 …

network 2
before saving and loading, mode train: … -3.0789476e-17, -3.0789476e-17, -3.0789476e-17 …
after saving and loading, mode train: … -3.0789476e-17, -3.0789476e-17, -3.0789476e-17 …
after saving and loading, mode train before, mode eval after: … 0.80313516, -0.21686335, 0.5515088 …

Note that these scores don’t involve the optimizer in any way, even though I checked whether optimizer states are loaded, and they are as well. I also checked whether model parameters were loaded before directly by comparing some of the model weights before and after loading. You mention the validation script, but what I’m talking about is that the average epoch training loss doesn’t even pick up where it left off (a measure even closer by the problem than validation). For instance, last night, I trained a model for 8000 epochs until it had .005 mean SmoothL1Loss, but then when I reloaded it, it started from .05, 10X higher. It always seems to start somewhere around there.

I’m really at a loss here. I have to add, when I visually inspect the network outputs, they get markedly worse after loading (not back to initialization, but much worse).

Could it be something within the network? What would you check next?

I tried doing the same test with the DataParallel wrapper (the result was the same), but I noted that in the output tensor grad_fn, related to the autograd, was instead of the tanh one when I tried it without the DataParallel wrapper. Would that have any significance?

Thanks for the update!
I’m not sure I understand the outputs. Are these showing differences or the output values?
In the former case, it seems that using eval() changes the behavior before saving vs. after loading?

The outputs are just samples from the output tensor to show you whether they are equal. Using train() for one epoch before saving and then eval() after being loaded to give the output indeed changes the behavior, but this is to be expected, right?

Just to clarify: I initialized 2 networks (1 and 2). The first one I put in eval mode and checked the output tensor to the static input of ones. A sample of the output is shown in the first line. Then, before letting the model parameters step, I saved it, loaded it, still in eval(), and checked whether the loaded model gave the same output to static ones, which is shown in the second line. It did.

I did the same with the second model in the mode train() which is shown on the 3rd and 4th output line. Then finally, after saving the second n

This approach doesn’t let the model step (because that would definitely change the output), and also doesn’t check the optimizer behavior. It does however show that the network itself is loaded and saved, right?

Yes, the outputs between model.train() and model.eval() are expected to be different as the behavior of some layers would be changed. E.g. dropout will be disabled during eva() and the running stats are used in batchnorm layers etc.

That’s good as it shows that saving and loading the model works fine.

I think some sentences are dropped here.
You should be careful about comparing the outputs in train() mode as e.g. dropout could be used (depends of course on your model) which would then change the outputs.
If I understand the explanation correctly, you are indeed able to save and load the model, but are seeing now a large difference between the train() and eval() mode?

If I understand the explanation correctly, you are indeed able to save and load the model, but are seeing now a large difference between the train() and eval() mode?

Not at all, the problem is that, even though the model is apparently saved correctly, the average epoch training loss, which should be very consistent, is way higher after loading everything (model, optimizer, scheduler) than before. Like I said, the method with static ones doesn’t let the model step, so we can’t really check what’s going on there with it. I’m totally unclear as to why that is happening.

I’ve tried simplifying the problem so that I don’t use DataParallel. I’ve tried saving and loading the whole model. Could it be something within the model itself? Is it possible the optimizer somehow doesn’t work fully?

I think the static input test showed that you are able to save and properly load the model.
In the majority of similar reported issues the difference comes from the data loading pipeline and in fact some issues were found where the processing, normalization, dataset creation etc. was different.
I agree that the issue not not solved yet, but I disagree that the loading/saving of the model is wrong.

Since you suspect that the optimizer might somehow “not work”, I would probably verify its behavior by repeating the previous experiment with an optimizer step:

  • disable random ops (e.g. dropout) via model.eval()
  • store the state_dict of the model and optimizer (sd1)
  • perform a single update step with static input data
  • store the updated state_dicts (sd2)
  • restore the model from sd1
  • perform the same update steps and compare the updates objects to sd2

If something in the usage or loading of the optimizer is wrong, this test should show a difference.
On the other hand, if this is still showing the same results, I would check the data again and make sure it’s indeed right.

So I found what it was: I used persistent workers in combination with torch randomness in my data transforms. Even though I used torch randomness, which reseeds on every worker automatically, the persistency caused an issue still.

What I’m doing now is killing the workers after every epoch (persistent_workers=False), but that’s is taking a lot of extra time. Is there a way to reseed the seed on every dataloader get_item() within the transforms themselves, or at least within every batch? In other words, how to update the random seed of a persistent worker without shutting it down?

That part is quite new to me, so maybe you have give a quick suggestion if it’s not too much trouble?

Thanks a lot for responding so quickly, and pointing me in the right direction time and time again! Extremely appreciated!

1 Like

Good to hear you’ve isolated the issue!

I think you could use the seed from each worker and change reset the current seed using it and an offset based on the index and epoch.
Here is a small code snippet to check the currently used seed:

class MyDataset(torch.utils.data.Dataset):
    def __init__(self):
        self.data = torch.arange(10)
    def __len__(self):
        return len(self.data)
    def __getitem__(self, index):
        print('index {}, seed {}\n'.format(index, torch.utils.data.get_worker_info().seed))
        x = self.data[index]
        return x
dataset = MyDataset()

loader = torch.utils.data.DataLoader(dataset, batch_size=2, shuffle=True, num_workers=2)
for epoch in range(2):
    print('epoch {}'.format(epoch))
    for data in loader:
loader = torch.utils.data.DataLoader(dataset, batch_size=2, shuffle=True, num_workers=2, persistent_workers=True)
for epoch in range(2):
    print('epoch {}'.format(epoch))
    for data in loader:

Let me know if changing the seed inside __getitem__ works for you.

So the seeds inside the getitem function are definitely not refreshed as shown to be by your snippet and could in theory by reset, but what I realized is that that shouldn’t even matter the way I programmed my data transforms. You see, I perform all image transforms from the base dataset per epoch inside of the GPU, then then send the epoch data to the CPU and sample from that during the epoch (even if the random seed of the workers in getitem is not random, the shuffling should still be random, right?).

I created a custom epoch_pretransform function inside my dataset() as such:

def epoch_pretransform(self):

        # Perform online epoch pretransforms
        self.epoch_dataset = self.epoch_pretransforms(self.dataset)
        # Send to CPU
        self.epoch_dataset = {
            for key in self.dataset.keys()

def __getitem__(self, idx):
        """Upon subscription or iteration
            idx (int/tensor): sample/annotation index
        # If annotation data is relevant
        if not self.deploy:

            # Match image and annotation identifier
            identifier = self.image_filenames[idx]
            annot_idx = self.annot_filenames.index(identifier)

            # Make image/annotation sample
            sample = {
                'image': self.epoch_dataset['image'][idx, 0:1, :, :], 
                'annot': self.epoch_dataset['annot'][annot_idx, 0:1, :, :]
        return sample

How can it be that turning persistent workers off still solved my problem even though I’m not performing transformations within getitem ? The custom dataset() functions aren’t executed by workers, right? Can I check the random seed within the image transforms themselves with torch.initial_seed() or some other function? Is there some explanation why my problem got solved just by turning persistent workers off?

Even if I’m using this within every image transform class call function:

# Reset torch randomness
torch.cuda.manual_seed_all(torch.initial_seed() + epoch)
torch.manual_seed(torch.initial_seed() + epoch)

It still doesn’t solve the issue when persistent_workers are on. Assuming this code does indeed change the seed all the time, I’m struggling to understand what it is about persistent workers that could change anything at all, and even more so if it isn’t the randomness…

Could it have anything to do with gradients being calculated during the online epoch transforms? Regardless whether it could be related, does it make sense to disable gradients during the tensor transforms?

I wrote my own custom DataLoader class that can fetch batches all within the GPU now, decreasing CPU-GPU writing time as a bonus. That solved the issue as well, but I’m still curious what caused it.

TL;DR: Writing your own DataLoader class takes one tenth of the time it will take to debug any obscure issue with the standard class caused by custom pieces of your code that interact unexpectedly with it