Loading pretrained model and when execute `optimizer.step` get error

Igo312 · October 14, 2020, 11:25am

when I loaded a pretrained model and try to continue the training.I found when model executes optimizer.step() it cause error as following:

File "/home/f523/anaconda3/envs/rsy/lib/python3.6/site-packages/torch/optim/adam.py", line 110, in step
p.addcdiv_(exp_avg, denom, value=-step_size)
RuntimeError: output with shape [1, 256, 1, 1] doesn't match the broadcast shape [2, 256, 1, 1]

So I check the p.addcdiv by using try-except
However when breakpoint appears in the except case, I output the exp_avg and denom. I find they getting same shape.

denom.shape
Out[2]: torch.Size([2, 256, 1, 1])
exp_avg.shape
Out[3]: torch.Size([2, 256, 1, 1])

However the p.addcdiv_ still get same error information.

Dose output is different from exp_avg or denom?
I use two gpu to train it. And I have already used one gpu and two gpu to reloaded it and both failed.
So what should I do ?

fadetoblack · October 14, 2020, 11:37am

Can you please show us a snippet of your code which is throwing the error?

Also please check if this error persists if SGD is used instead of Adam. If that’s the case, then there might be an issue with batch size mismatch.

Igo312 · October 18, 2020, 11:49am

sorry for late. I am not sure what part of code should I post, here’s my loading weights and backforward the loss scripts:

# load weights part
    def load_my_model(self, args):
        # self.model, self.optimizer, start_epoch = self.load_model(self.model, self.optimizer, args.model_dir)
        weights = torch.load(args.model_dir)
        pretrained_dict = weights['model_state_dict']
        # only load that existent model
        model_dict = self.model.state_dict()
        #  delete the keys that does not belong to model_dict
        for k in list(pretrained_dict.keys()):
            if k[:2] == 'hm':
                continue
            if 'module.' + k in model_dict:
                model_dict['module.' + k] = pretrained_dict[k]
            elif 'module_list.' + k in model_dict:
                model_dict['module_list.' + k] = pretrained_dict[k]
        print('model weights loaded')
       
        # 更新现有的model_dict
        # model_dict.update(pretrained_dict)
        self.model.load_state_dict(model_dict)

        op_dict = self.optimizer.state_dict()
        pretrained_dict = weights['optimizer_state_dict']
        for k in list(pretrained_dict.keys()):
            if k in op_dict:
                op_dict[k] = pretrained_dict[k]
            elif 'module.' + k in op_dict:
                op_dict['module.' + k] = pretrained_dict[k]
            elif 'module_list.' + k in op_dict:
                op_dict['module_list.' + k] = pretrained_dict[k]

        self.optimizer.load_state_dict(op_dict)
        print('optimizer loaded ')

        return weights['epoch']

#  loss step 
loss = criterion(pr_decs, data_dict)
loss.backward()
if (i+1) % self.step == 0:
    self.optimizer.step()
    self.optimizer.zero_grad()

And it’s really strange when I use run in pycharm the model can update the weights properly, but when I use shell command python main.py it will cause above errors.What’s going on?
As for using SGD, you mean using SGD instead of Adam to train the model and reload it?

PS:
As your advice, I change the sgd instead of adam as the optimizer, I save one epoch model and reload it to check whether can work, the result is using the sgd can reload the weights and update the weights properly both in pycharm and shell.
So the question is what’s the different between adam and sgd can cause the difference?
And if I do any wrong in my loading weights scripts and training part, I really appreciate it that you tell me