Runtime Error Occurs when resuming the training with loaded optimizor state.
Here is the code snippet.
# define optimizer
# args.lr | args.momentum | args.weight_decay | args.adam_beta
logger.info('=> setting adam solver')
parameters = set(dispnet.parameters())
optimizer = torch.optim.Adam(parameters, args.lr,
betas=(args.momentum, args.adam_beta),
weight_decay=args.weight_decay)
# load optimizer checkpoint
# args.solve_state | args.start_epoch
if args.solve_state is not None and args.solve_state != '':
logger.info("=> resume from a solver state")
logger.info("=> loading {} ......".format(args.solve_state))
loaded = torch.load(args.solve_state)
optimizer.load_state_dict(loaded['state_dict'])
args.start_epoch = loaded['epoch']
logger.info("=> loading done.")
# network training
# args.saveprefix
for epoch in range(args.start_epoch, args.epochs):
# train for one epoch
train(train_loader, dispnet, optimizer, epoch)
# save checkpoint
timestamp = datetime.datetime.now().strftime("%m-%d-%H:%M")
save_path_net = "{}_epoch_{}_{}_dispnet.pth".format(args.saveprefix, epoch, timestamp)
logger.info('=> saving DirectDispNet to: {}'.format(save_path_net))
torch.save({'state_dict': dispnet.state_dict()}, save_path_net)
save_path_optimizer = "{}_epoch_{}_{}_optimizer.pth".format(args.saveprefix, epoch, timestamp)
logger.info('=> saving Solver State to: {}'.format(save_path_optimizer))
torch.save({'state_dict': optimizer.state_dict(), 'epoch': epoch + 1}, save_path_optimizer)
if we load the optimizor state (–solve_state), we meet a runtime error.
[2017-12-19 22:12:58,154 INFO train_directdisp.py line 81 15505] => using pre-trained weights for DirectDispNet
[2017-12-19 22:12:58,154 INFO train_directdisp.py line 82 15505] => loading /home/e/pytorch-ws/exp_example/try_epoch_0_12-19-21:52_dispnet.pth ......
[2017-12-19 22:12:59,756 INFO train_directdisp.py line 85 15505] => loading done.
[2017-12-19 22:12:59,796 INFO train_directdisp.py line 90 15505] => setting adam solver
[2017-12-19 22:12:59,796 INFO train_directdisp.py line 99 15505] => resume from a solver state
[2017-12-19 22:12:59,796 INFO train_directdisp.py line 100 15505] => loading /home/e/pytorch-ws/exp_example/try_epoch_0_12-19-21:52_optimizer.pth ......
[2017-12-19 22:12:59,981 INFO train_directdisp.py line 104 15505] => loading done.
Traceback (most recent call last):
File "train_directdisp.py", line 186, in <module>
main()
File "train_directdisp.py", line 110, in main
train(train_loader, dispnet, optimizer, epoch)
File "train_directdisp.py", line 166, in train
optimizer.step()
File "/home/e/miniconda3/lib/python3.6/site-packages/torch/optim/adam.py", line 69, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: invalid argument 3: sizes do not match at /home/e/Pytorch/pytorch/aten/src/THC/generated/../generic/THCTensorMathPointwise.cu:271
Process finished with exit code 1
If we do not load the saved state with --solve_state, no error occurs.