Saving and Loading Optimizer Params

alic · November 30, 2020, 7:44am

Hi,
I’m trying to save and load optimizer params as we do for a model, but although i tried in many different ways, still i couldn’t work it. Here is the code:

best_model_wts = copy.deepcopy(model.state_dict())
best_optim_pars = copy.deepcopy(optimizer.state_dict())

for epoch in range(num_epochs):
    for phase in ['train', 'val']:
         if phase == 'train':
             model.train()
         else:
             model.eval()

         running_loss = 0.0
         running_corrects = 0

         if epoch < 20:
             error_sigma = 2.0
         elif 19 < epoch < 40:
             error_sigma = 1.5
             if epoch == 20:
                 model.load_state_dict(best_model_wts)
                 optimizer.load_state_dict(best_optim_pars)

         for inputs, labels, _ in data_loaders[phase]:
             batch_size = len(labels)
             converted_inputs = dictionary_to_tensor(inputs, batch_size)
             converted_inputs = numpy2tensor(converted_inputs)
             labels = labels.to(device)

             optimizer.zero_grad()
             with torch.set_grad_enabled(phase == 'train'):
                    outs = model(converted_inputs)
                    loss = criterion(outs, labels)
                    target_err = torch.sign(loss - error_sigma)

                    _, preds_all = torch.max(outs, 1)

                    if phase == 'train':
                        loss.backward()
                        optimizer.step(target_err)

                running_loss += loss.item() * converted_inputs.size(1)
                running_corrects += torch.sum(preds_all == labels.data)

            if phase == 'train':
                scheduler.step()

            data_size = len(data_loaders[phase].dataset)

            epoch_loss = running_loss / data_size
            epoch_acc = running_corrects.double() / data_size

            if phase == 'val':
                val_acc_history.append(epoch_acc)
                if epoch_acc > best_acc:
                    best_acc = epoch_acc
                    best_epoch = epoch
                    best_loss = epoch_loss
                    best_model_wts = copy.deepcopy(model.state_dict())
                    best_optim_pars = copy.deepcopy(optimizer.state_dict()) #this line gives error

Then, I get the below error:
raise RuntimeError("Only Tensors created explicitly by the user "
RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

Any idea about how to solve it? Thank you very much.

Alexey_Demyanchuk · November 30, 2020, 8:44am

Hi. Did you try the official pytorch tutorial?
Is it work for you?

Alexey_Demyanchuk · November 30, 2020, 8:52am

import torch
import torch.nn as nn

m = nn.Linear(10, 2)
opt = torch.optim.Adam(m.parameters())
best = {'optimizer_state_dict': opt.state_dict()}

opt.zero_grad()
opt.step()

opt = torch.optim.Adam(m.parameters())
opt.load_state_dict(best['optimizer_state_dict'])

This dummy example is working fine for me.

alic · November 30, 2020, 8:53am

Hi, Thank you for the follow up. Yes, I tried the official solution but still it didn’t work. There’s no problem with model save/load. The problem is optimizer state save/load. In my current case, the below code raises an error:

best_optim_pars = copy.deepcopy(optimizer.state_dict())

When I update the code by removing copy.deepcopy as below:

best_optim_pars = optimizer.state_dict()

Then the line here gives error:

optimizer.load_state_dict(best_optim_pars)

Alexey_Demyanchuk · November 30, 2020, 8:57am

import torch
import torch.nn as nn

X = torch.rand(2, 10)
y = torch.tensor([[0,1], [1,0]], dtype=torch.float32)

m = nn.Linear(10, 2)
opt = torch.optim.Adam(m.parameters())
state = opt.state_dict()
crit = nn.BCEWithLogitsLoss()

for i in range(5):
    if i == 2:
        # load
        opt.load_state_dict(state)
    opt.zero_grad()
    out = m(X)
    loss = crit(y, out)
    loss.backward()
    opt.step()
    # save
    state = opt.state_dict()

Update to less dummy example. Also works fine.

Alexey_Demyanchuk · November 30, 2020, 8:58am

If you are running this in notebook, I would suggest to restart kernel so to get rid of best_optim_pars = copy.deepcopy(optimizer.state_dict()) variable.

Alexey_Demyanchuk · November 30, 2020, 8:59am

also, I am doing it in current version of Pytorch

alic · November 30, 2020, 9:14am

I tried all these different ways actually, the error remains. Optimizer is SGD with momentum:

optimizer_ft = optim.SGD(model_ft.parameters(), lr=params.lr, momentum=params.momentum)
optimizer_ft.zero_grad()

I also tried without copy.deepcopy as in your solution:

best_model_wts = copy.deepcopy(model.state_dict())
best_optim_pars = optimizer.state_dict()

And then, in the code when I load:

 if epoch == 20:
     model.load_state_dict(best_model_wts)
     #this is for initialization, not sure if needed?
     optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
     optimizer.zero_grad()
     optimizer.load_state_dict(best_optim_pars)

For saving the best model and the current optim params:

 if phase == 'val':
     val_acc_history.append(epoch_acc)
     if epoch_acc > best_acc:
         best_acc = epoch_acc
         best_epoch = epoch
         best_loss = epoch_loss
         best_model_wts = copy.deepcopy(model.state_dict())
         best_optim_pars = optimizer.state_dict()

Still getting this error:

File “/media/alic/ssdmain/Projects/crandrnn/src/train_multi_level_model.py”, line 83, in train_model
optimizer.load_state_dict(best_optim_pars)
File “/home/alic/anaconda3/envs/crandrnn/lib/python3.7/site-packages/torch/optim/optimizer.py”, line 105, in load_state_dict

File “/home/alic/anaconda3/envs/crandrnn/lib/python3.7/site-packages/torch/tensor.py”, line 23, in deepcopy
raise RuntimeError("Only Tensors created explicitly by the user "
RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment.

Alexey_Demyanchuk · November 30, 2020, 9:28am

It is strange, I can’t reproduce your error using deepcopy. I am not sure, it may be python version or even system dependent

import copy
import torch
import torch.nn as nn

X = torch.rand(2, 10)
y = torch.tensor([[0,1], [1,0]], dtype=torch.float32)

m = nn.Linear(10, 2)
opt = torch.optim.SGD(m.parameters(), lr=0.1)

m_state = copy.deepcopy(m.state_dict())
state = opt.state_dict()

crit = nn.BCEWithLogitsLoss()

for i in range(5):
    if i == 2:
        # load
        m.load_state_dict(m_state)
        opt = torch.optim.SGD(m.parameters(), lr=0.1)
        opt.load_state_dict(state)
    opt.zero_grad()
    out = m(X)
    loss = crit(y, out)
    loss.backward()
    opt.step()
    # save
    m_state = copy.deepcopy(m.state_dict())
    state = opt.state_dict()

Perhaps, as error suggest you can get rid off all deepcopy calls.

Alexey_Demyanchuk · November 30, 2020, 9:57am

Also, you could take a look in this thread. Perhaps, you are doing something similar with your model or optimizer.

alic · November 30, 2020, 12:04pm

Thank you for your help. I’m actually doing the same as you suggest in your solution:
Initial:

best_model_wts = copy.deepcopy(model.state_dict())
best_optim_state = optimizer.state_dict()

Updating in the loop:

if epoch < 20:
    error_sigma = 2.0
    elif 19 < epoch < 40:
        error_sigma = 1.5
        if epoch == 20:
            model.load_state_dict(best_model_wts)
            optimizer = optim.SGD(model.parameters(), lr=0.001)
            optimizer.load_state_dict(best_optim_state)

And saving the best models and optimizer state:

if phase == 'val':
    val_acc_history.append(epoch_acc)
    if epoch_acc > best_acc:
        best_acc = epoch_acc
        best_epoch = epoch
        best_loss = epoch_loss
        best_model_wts = copy.deepcopy(model.state_dict())
        best_optim_state = optimizer.state_dict()

However, I’m still getting the same error during the update line for optimizer.
I tried your solution and it runs without any problem on my environment (python 3.7.5, pytorch 1.2) but mine is still giving the above-mentioned runtime error.

Alexey_Demyanchuk · November 30, 2020, 12:07pm

Did you try not to use deepcopy at all in your code?
Also, did you check if your code updating or reassigning a model parameters somewhere (as in linked thread I posted before)?
Can you update to a current pytorch version?

alic · November 30, 2020, 12:58pm

Thank you very much. It works after updating to the current pytorch version.

I tried not to use any deepcopy at all, but again it failed during optimizer state updating with the same error. No problem for model update.
I checked the link but in their code, they are reproducing the error, I did not get it for my case actually.

Anyway, as soon as I updated the pytorch version, it worked without any problem. Still didn’t get it why but it’s working.
Thanks again.

Alexey_Demyanchuk · November 30, 2020, 2:37pm

Glad, I can help. Perhaps some intrinsic of pytorch deepcopy works differently in new version, the only reason I see. Apparently, now all tensors “support the deepcopy protocol”, not only “graph leaves”