AttributeError: 'DataParallel' object has no attribute 'copy'

While trying to load a checkpoint into a resnet model I get this error !
What is wrong here?
this is the snippet that causes this error :

def _init_model(self):

        if self.device == 'cpu':
            checkpoint = torch.load(self.model_checkpoint_path, map_location=torch.device('cpu'))
        else:
            checkpoint = torch.load(self.model_checkpoint_path)

        if self.model_name == 'r18':
            self.model = resnet18(pretrained=False, use_se=False)
        elif self.model_name == 'r50':
            self.model = resnet50(pretrained=False, use_se=False)
        elif self.model_name == 'r101':
            self.model = resnet101(pretrained=False, use_se=False)
        else:
            raise Exception(f"Model name: '{self.model_name}' is not recognized.")

        # load the model weights
        self.model.load_state_dict(checkpoint['model'])
        self.model = self.model.to(self.device)
        self.model.eval()
        ....

and this is the error message I get :

Mar 17 20:47:27 ubuntu python3[20862]:   File "/home/user1/anaconda3/lib/python3.7/site-packages/FV/F_V.py", line 58, in __init__
Mar 17 20:47:27 ubuntu python3[20862]:     self._init_model()
Mar 17 20:47:27 ubuntu python3[20862]:   File "/home/user1/anaconda3/lib/python3.7/site-packages/FV/F_V.py", line 84, in _init_model
Mar 17 20:47:27 ubuntu python3[20862]:     self.model.load_state_dict(checkpoint['model'])
Mar 17 20:47:27 ubuntu python3[20862]:   File "/home/user1/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 803, in load_state_dict
Mar 17 20:47:27 ubuntu python3[20862]:     state_dict = state_dict.copy()
Mar 17 20:47:27 ubuntu python3[20862]:   File "/home/user1/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 576, in __getattr__
Mar 17 20:47:27 ubuntu python3[20862]:     type(self).__name__, name))
Mar 17 20:47:27 ubuntu python3[20862]: AttributeError: 'DataParallel' object has no attribute 'copy'

I want to know how I can get around this error other than doing sth like this which works:

self.model = checkpoint['model'].module

Thanks a lot in adance

You could probably wrap the model into nn.DataParallel after loading the state_dict.

Thanks a lot.
do you mean simply doing :

self.model = nn.DataParallel(self.model)
self.model.load_state_dict(checkpoint['model'])

but would it not crash when done on CPU mode?

(it still generates the same error by the way)

Ah, sorry I misunderstood the error message and thought you are already wrapping the model in nn.DataParallel before.

If you want device-agnostic code (that runs on CPU, GPU, multiple GPUs), I would recommend to store the state_dict without the .module attribute, which would be the cleanest approach.

Thanks, but my current issue is that, the pretrained model I’m dealing with was saved liked this by someone else, I didnt train it myself! and now I’m looking for a way to get around this issue.

In that case your suggested approach using checkpoint['model'].module seems to be the best way.
Also, was the nn.DataParallel model stored directly or just the state_dict?

it was stored directly like this :

def save_checkpoint(epoch, epochs_since_improvement, model, metric_fc, optimizer, acc, is_best):
    print('saving checkpoint ...')
    state = {'epoch': epoch,
             'epochs_since_improvement': epochs_since_improvement,
             'acc': acc,
             'model': model,
             'metric_fc': metric_fc,
             'optimizer': optimizer}
    # filename = 'checkpoint_' + str(epoch) + '_' + str(loss) + '.tar'
    filename = 'checkpoint.tar'
    torch.save(state, filename)
    # If this checkpoint is the best so far, store a copy so it doesn't get overwritten by a worse checkpoint
    if is_best:
        torch.save(state, 'BEST_checkpoint.tar')

The issue with checkpoint['model'].module is that, I dont instantiate the model, and I cant seem to find a way to know exactly, what exact model is being loaded. thats why, I’m trying to instantiate the model myself and then load the weights. but as you can see, a hit a wall here!

I also tried :

self.model.load_state_dict(checkpoint['model'].module.state_dict())

which fails and it says :

File "/home/hossein/anaconda3/lib/python3.7/site-packages/FV/F_V.py", line 90, in _init_model
Mar 18 07:53:28 ubuntu python3[30473]:     self.model.load_state_dict(checkpoint['model'].module.state_dict())
Mar 18 07:53:28 ubuntu python3[30473]:   File "/home/user1/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
Mar 18 07:53:28 ubuntu python3[30473]:     self.__class__.__name__, "\n\t".join(error_msgs)))
Mar 18 07:53:28 ubuntu python3[30473]: RuntimeError: Error(s) in loading state_dict for ResNet:
Mar 18 07:53:28 ubuntu python3[30473]:         Unexpected key(s) in state_dict: "layer1.0.se.fc.0.weight", "layer1.0.se.fc.0.bias", "layer1.0.se.fc.1.weight", etc...

while, all keys are there!

Yeah, I see. Storing the model directly is not recommended, as you would need to create the same file and folder structure to load the model.
model.load_state_dict won’t work here, as checkpoint['model'] should contain the model.

Do you get errors using?

model = checkpoint['model']

Based on your description it seems you don’t have access to the model definition?

doing

self.model = checkpoint['model']

doesnt give any errors, but then the model weights are not initialized it seems (its random weights as the network prediction is completely off) (im not using cuda here by the way)

by the way, I have access to model definitions.

checkpoint['model'].module.state_dict() would yield the same parameters then.
If the model wasn’t stored properly, its state_dict will contain the same parameters.

1 Like

OK, here is the answer.

self.model.load_state_dict(checkpoint['model'].module.state_dict())

actually works and the reason it was failing earlier was that, I instantiated the models differently (assuming the use_se to be false as it was in the original training script) and thus the keys would differ.
Simply finding about this thanks to dear God, solved the issue.
here is the changed snippet :

use_se = False
# in order to see if the use_se switch was used in the model
 # we do this:
 for k, m in checkpoint['model'].module.state_dict().items():
    if 'se' in k:
        use_se = True
         break

if self.model_name == 'r18':
            self.model = resnet18(pretrained=False, use_se=use_se)
        elif self.model_name == 'r50':
            self.model = resnet50(pretrained=False, use_se=use_se)
        elif self.model_name == 'r101':
            self.model = resnet101(pretrained=False, use_se=use_se)
        else:
            raise Exception(f"Model name: '{self.model_name}' is not recognized.")
      

        # load the model weights
        # since the model was trained using DataParallel, we need to use module
        # to access the actual model and its parameters.
        self.model.load_state_dict(checkpoint['model'].module.state_dict())

Update:
As @ptrblck said, doing this still has the issues such as module/file/dir hierarchy dependency. In order to properly load such models, simply convert them into torch script and load the parameters that way.
That is :

Solution 1:

Try converting your model into torch script and then use that instead:

def convert_model(model, input=torch.tensor(torch.rand(size=(1,3,112,112)))):
        model = torch.jit.trace(self.model, input)
        torch.jit.save(model,'/home/Rika/Documents/models/model.tjm')

and then loaded this version instead:

# load the model 
self.model = torch.jit.load('/home/Rika/Documents/models/model.tjm')

Solution 2:

simply save the model’s state_dict() again and use that instead :
I myself ended up doing :

self.model = checkpoint['model'].module
# create the new checkpoint based on what you need 
torch.save({'state_dict' : self.model.state_dict(), 'use_se':True},
            '/home/Rika/Documents/BEST_checkpoint_r18_2.tar')

and started using the new checkpoint and so far everything has been good
2 Likes