When dropout layers are added, checkpoint wont resume!

Shisho_Sama · July 11, 2018, 3:26am

I trained a model without any dropout, and now I want to add some dropout layers and resume the training using the last checkpoint I have. However, when I try to do so, I get these errors:

FLOPs: 387.73M, Params: 1.15M
=> loading checkpoint './snapshots/imagenet/simpnets/1mil/nodrp/chkpt_simpnet_imgnet_1m_nodrp_s1_2018-07-08_17-00-55.pth.tar'
Traceback (most recent call last):
  File "imagenet_train.py", line 537, in <module>
    main()
  File "imagenet_train.py", line 122, in main
    model.load_state_dict(checkpoint['state_dict'])
  File "/home/shishosama/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 721, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DataParallel:
	Missing key(s) in state_dict: "module.features.5.weight", "module.features.5.bias", "module.features.5.running_mean", "module.features.5.running_var", "module.features.8.weight", "module.features.8.bias", "module.features.9.running_mean", "module.features.9.running_var", "module.features.21.weight", "module.features.21.bias", "module.features.22.running_mean", "module.features.22.running_var", "module.features.30.weight", "module.features.30.bias", "module.features.30.running_mean", "module.features.30.running_var", "module.features.34.weight", "module.features.34.bias", "module.features.34.running_mean", "module.features.34.running_var", "module.features.37.weight", "module.features.37.bias", "module.features.38.running_mean", "module.features.38.running_var", "module.features.42.weight", "module.features.42.bias", "module.features.43.weight", "module.features.43.bias", "module.features.43.running_mean", "module.features.43.running_var", "module.features.46.weight", "module.features.46.bias", "module.features.47.weight", "module.features.47.bias", "module.features.47.running_mean", "module.features.47.running_var", "module.features.50.weight", "module.features.50.bias", "module.features.51.weight", "module.features.51.bias", "module.features.51.running_mean", "module.features.51.running_var". 
	Unexpected key(s) in state_dict: "module.features.3.weight", "module.features.3.bias", "module.features.4.running_mean", "module.features.4.running_var", "module.features.6.weight", "module.features.6.bias", "module.features.7.weight", "module.features.7.bias", "module.features.7.running_mean", "module.features.7.running_var", "module.features.10.weight", "module.features.10.bias", "module.features.10.running_mean", "module.features.10.running_var", "module.features.19.weight", "module.features.19.bias", "module.features.20.weight", "module.features.20.bias", "module.features.20.running_mean", "module.features.20.running_var", "module.features.23.weight", "module.features.23.bias", "module.features.23.running_mean", "module.features.23.running_var", "module.features.28.weight", "module.features.28.bias", "module.features.29.running_mean", "module.features.29.running_var", "module.features.32.weight", "module.features.32.bias", "module.features.33.running_mean", "module.features.33.running_var", "module.features.35.weight", "module.features.35.bias", "module.features.36.weight", "module.features.36.bias", "module.features.36.running_mean", "module.features.36.running_var", "module.features.39.weight", "module.features.39.bias", "module.features.39.running_mean", "module.features.39.running_var". 
	While copying the parameter named "module.features.4.weight", whose dimensions in the model are torch.Size([80, 60, 3, 3]) and whose dimensions in the checkpoint are torch.Size([80]).
	While copying the parameter named "module.features.9.weight", whose dimensions in the model are torch.Size([80]) and whose dimensions in the checkpoint are torch.Size([80, 80, 3, 3]).
	While copying the parameter named "module.features.12.weight", whose dimensions in the model are torch.Size([80, 80, 3, 3]) and whose dimensions in the checkpoint are torch.Size([85, 80, 3, 3]).
	While copying the parameter named "module.features.12.bias", whose dimensions in the model are torch.Size([80]) and whose dimensions in the checkpoint are torch.Size([85]).
	While copying the parameter named "module.features.13.weight", whose dimensions in the model are torch.Size([80]) and whose dimensions in the checkpoint are torch.Size([85]).
	While copying the parameter named "module.features.13.bias", whose dimensions in the model are torch.Size([80]) and whose dimensions in the checkpoint are torch.Size([85]).
	While copying the parameter named "module.features.13.running_mean", whose dimensions in the model are torch.Size([80]) and whose dimensions in the checkpoint are torch.Size([85]).
	While copying the parameter named "module.features.13.running_var", whose dimensions in the model are torch.Size([80]) and whose dimensions in the checkpoint are torch.Size([85]).
	While copying the parameter named "module.features.16.weight", whose dimensions in the model are torch.Size([85, 80, 3, 3]) and whose dimensions in the checkpoint are torch.Size([85, 85, 3, 3]).
	While copying the parameter named "module.features.22.weight", whose dimensions in the model are torch.Size([85]) and whose dimensions in the checkpoint are torch.Size([90, 90, 3, 3]).
	While copying the parameter named "module.features.22.bias", whose dimensions in the model are torch.Size([85]) and whose dimensions in the checkpoint are torch.Size([90]).
	While copying the parameter named "module.features.25.weight", whose dimensions in the model are torch.Size([90, 85, 3, 3]) and whose dimensions in the checkpoint are torch.Size([90, 90, 3, 3]).
	While copying the parameter named "module.features.29.weight", whose dimensions in the model are torch.Size([90, 90, 3, 3]) and whose dimensions in the checkpoint are torch.Size([110]).
	While copying the parameter named "module.features.29.bias", whose dimensions in the model are torch.Size([90]) and whose dimensions in the checkpoint are torch.Size([110]).
	While copying the parameter named "module.features.33.weight", whose dimensions in the model are torch.Size([90, 90, 3, 3]) and whose dimensions in the checkpoint are torch.Size([110]).
	While copying the parameter named "module.features.33.bias", whose dimensions in the model are torch.Size([90]) and whose dimensions in the checkpoint are torch.Size([110]).
	While copying the parameter named "module.features.38.weight", whose dimensions in the model are torch.Size([110]) and whose dimensions in the checkpoint are torch.Size([150, 127, 3, 3]).
	While copying the parameter named "module.features.38.bias", whose dimensions in the model are torch.Size([110]) and whose dimensions in the checkpoint are torch.Size([150]).

What should I do?
Thanks in advance

tom · July 11, 2018, 7:42am

Inserting layers into serial changes the indices of the layers (the number in the param names).
What you need to do is go through the state dict before loading it and adapt the layer indices.
For example, your first dropout layer seems to be number 3 (the fourth module). Thus the module.features.3.weight needs to be mapped to module.features.4.weight (don’t do it inplace if you use ascending order) and all subsequent layer parameters’ indices need to be increased by at least one, too.

Best regards

Thomas

Shisho_Sama · July 11, 2018, 8:05am

Thankyou verymuch, is there any examples I can use to do this ? I’m a newbie in Pytorch!

tom · July 11, 2018, 10:10am

You just do this at the python level as in:

m1 = torch.nn.Sequential(
    torch.nn.Linear(10,10),
    torch.nn.Linear(10,10),
    torch.nn.Linear(10,10),
                    )
m2 = torch.nn.Sequential(
    torch.nn.Linear(10,10),
    torch.nn.Dropout(p=0.2),
    torch.nn.Linear(10,10),
    torch.nn.Dropout(p=0.2),
    torch.nn.Linear(10,10),
                    )
mapping = {}
idxold = 0
for i,l in enumerate(m2):
    if not isinstance(l, torch.nn.Dropout):
        mapping[idxold] = i
        idxold += 1

sd1 = m1.state_dict()
sd1.keys()
sd2 = {}
for k in sd1:
    ksplit = k.split('.')
    ksplit[0] = str(mapping[int(ksplit[0])])
    knew = '.'.join(ksplit)
    sd2[knew] = sd1[k]

m2.load_state_dict(sd2)

Shisho_Sama · July 11, 2018, 11:20am

Thank you very much sir.
I changed my resume section as following :

    ...
    #loading the model in gpu
    model = mysimplenet(1000)
    model = torch.nn.DataParallel(model).cuda() 

    if args.resume:
        if os.path.isfile(args.resume):
            print_log("=> loading checkpoint '{}'".format(args.resume), log)
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            best_prec1 = checkpoint['best_prec1']
            if 'best_prec5' in checkpoint:
                best_prec5 = checkpoint['best_prec5']
            else:
                best_prec5 = 0.00

            #model.load_state_dict(checkpoint['state_dict'])
            
            mapping = {}
            idxold = 0
            for i,l in enumerate(model): # model contains dropout
                if not isinstance(l, torch.nn.Dropout):
                    mapping[idxold] = i
                    idxold += 1
            
            sd1 = checkpoint['state_dict'] # load from existing checkpoint state_dict which does not have dropout
            sd1.keys()
            sd2 = {}
            for k in sd1:
                ksplit = k.split('.')
                ksplit[0] = str(mapping[int(ksplit[0])])
                knew = '.'.join(ksplit)
                sd2[knew] = sd1[k]

            model.load_state_dict(sd2)

            #loading scheduler state
            if no_scheduler_stat:
                scheduler.load_state_dict(tmp)   
            else:      
                scheduler.load_state_dict(checkpoint['scheduler'])

            optimizer.load_state_dict(checkpoint['optimizer'])
            model.eval()
            print_log("=> loaded checkpoint '{}' (epoch {})".format(args.resume, checkpoint['epoch']), log)
        else:
            print_log("=> no checkpoint found at '{}'".format(args.resume), log)

However, I’m facing with TypeError: 'DataParallel' object is not iterable
What am I doing wrong here? What should I do?

tom · July 11, 2018, 11:50am

Think about your model structure? The above applies to nn.Sequential.

Shisho_Sama · July 11, 2018, 12:12pm

Sorry to sound stupid for not getting your point!
my model does use nn.sequential. However, the script that I’m using (Link here) uses Dataparallel so that we can use GPU for training. I have no idea what to do at this point!
I tried removing that dataparallel part and rerunning the model, however it introduced other errors complaining about floats and stuff like that!

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

InnovArul · July 11, 2018, 12:35pm

I think, you almost solved the problem.

the above error says that either model or data is in GPU and the other is in CPU.
use .cuda() on both model and data.