Save a pytorch model, reload it for transfer learning, but evaluation result is bad

Leonmac · May 7, 2022, 2:24pm

Dear community,
I am a newbie for Pytorch so forgive whatever mistake I could make. here I meet some problem that hopefully, somebody can give some advice,

I am using the Cifar10 data-set and using a partitioner tool to partition the train-loader, I am pretty sure that I am using the same train-loader/test-loader for all the tasks below.
This is what I did:

train the model for, say 10 epochs, I get an evaluation accuracy of about 40%. Then I saved the model;
next, load the model from the previously saved datafile, start the training again. On principle, I expect to start with an evaluation accuracy of ~40%. However, instead, I only get something like 15%…
Note if I train the model from scratch, I would get ~10% evaluation accuracy for the 1st epoch – this is under expectation, but this also indicates my loadfile indeed helps a little, better than a totally random start model.

I wonder if I made mistake with the model save/reload? The model save/load functions are pretty simple and follow the PyTorch standard method, as below;
The model is trained on the GPU (cuda), after loading the model from the file, I do have the code like the model.to(device)
I wrap up the model save function into my model class,

import torch

def import_parameters_from_pt_file(file: str, model):
    '''import a model from python pt file on disk, file_name shall include path'''
    #print(f"load the file: {file}")
    model.load_state_dict(torch.load(file))
    model.eval()
    return model

def save_parameters_to_pt_file(file: str, model):
    '''save a model to python pt file on disk, file_name shall include path'''
    torch.save(model.state_dict(), file)
    print(f"saving the file: {file}")

#in my cifar10 class wrap the model save method:
class cifar10():
...
    def save_model_to_file(self, file_name: str):
        utl.save_parameters_to_pt_file(file_name, self.model)

# this is how I call the model save at end of training:
<...training completed...>
    file_name = f"./{local.device}-fiinal.pth"
    cifar10.save_model_to_file(file_name)

Andrei_Cristea · May 7, 2022, 2:54pm

Greetings, try saving your optimizer’s state dict per the docs here, and see if that helps.

torch.save({
            'epoch': epoch,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),  # <-- this part
            'loss': loss,
            ...
            }, PATH)

If you’re using an optimizer like Adam, it is learning stuff along the way and might benefit from not being reset.

Leonmac · May 7, 2022, 3:52pm

I am using SGD as the optimizer.
I am trying to follow your suggestion. I warp a new method as below:

    def save_fullmodel_to_file(self, file_name: str):
        model_dict = {
                    #'epoch': epoch,
                    'model_state_dict':     self.model.state_dict(),
                    'optimizer_state_dict': self.optim.state_dict(),
                    #'loss': loss,
                    }
        utl.save_full_model(file_name, model_dict)

But when I call this method I get error report as:

    'optimizer_state_dict': self.optim.state_dict(),
TypeError: state_dict() missing 1 required positional argument: 'self'

the self.optim is an attribute in the cifar10 class, it is not a class, so I just don’t understand this error complain…

Andrei_Cristea · May 7, 2022, 4:07pm

Could you show where you define self.optim and also the definition of utl.save_full_model?

Leonmac · May 8, 2022, 1:54pm

optimiz is one of the components of the cifar10 class, which is initialized to None in init, but assigned before the train as the pass in parameter:

self.optimz  = optim.SGD

and please forget the utl.save_full_model, now I simply use torch.save as below:

torch.save({self.model.state_dict(), self.optimz.state_dict()}, file_name)

the reported error keep the same:

   torch.save({self.model.state_dict(), self.optimz.state_dict()}, file_name)
TypeError: state_dict() missing 1 required positional argument: 'self'

Andrei_Cristea · May 8, 2022, 2:04pm

Looks like you never initialized the optimizer, your self.optimz is merely a pointer to the (un-instantiated) optim.SGD class.

Try this instead and see if this helps:

self.optimz = optim.SGD(params=self.model.parameters(), lr=1e-4)

Also for the save, I think you should save a dict, not a set. Something like this:

torch.save({"model": self.model.state_dict(), "optim": self.optimz.state_dict()}, file_name)

Leonmac · May 8, 2022, 3:40pm

Great, I am able to save the model + optimizer state_dict, many thanks.
next I am trying to load and then check if I can get aligned performance.
<…>
Update: Yes follow the suggestion, I think my problem is solved.
remain questions:

Could you please help to explain a little bit: why the optimizer matter? Originally I think a model file is all…
If I am just interested in the inference but not transfer-learning, loading a model file would be enough–correct? because a pure test does not involve the optimizer at all…

Andrei_Cristea · May 9, 2022, 11:19am

Hi Leonmac, great to hear this solved your issue.

If you were using a very basic optimizer, it may not matter. However, a “stateful” optimizer learns things about the model / data as the training process unfolds, and applies that to take better optimization steps. If you restart the training process and don’t save / reload the optimizer state the optimizer loses all of its learnings, and has to start over from scratch, which means it’s going to take some awkward steps at first, damaging performance. Adam, for example, keeps track of a couple of coefficients per parameter, in essence learning the appropriate learning rate to use for each model parameter (based on moments of the gradient, detail here) so it’s helpful not to lose that.
That’s right: load the model file, set the model to eval(), and you’re good to go.