ERROR - Exception caught while uploading models: Error(s) in loading state_dict for Ensemble: Unexpected key(s) in state_dict

Abhijit_Manepatil · February 11, 2022, 4:31pm

was trying to load yolov5s.pt on CPU first then want to transfer to GPU with the following way:

class Ensemble(nn.ModuleList):
    # Ensemble of models
    def __init__(self):
        super(Ensemble, self).__init__()

    def forward(self, x, augment=False):
        y = []
        for module in self:
            y.append(module(x, augment)[0])
        # y = torch.stack(y).max(0)[0]  # max ensemble
        # y = torch.stack(y).mean(0)  # mean ensemble
        y = torch.cat(y, 1)  # nms ensemble
        return y, None  # inference, train output

model = Ensemble()
params = torch.load('yolov5s.pt' map_location='cpu')
model.load_state_dict(params)
ckpt = model.to('cuda:0')

reference from link
but gives me the following error:

ERROR - Exception caught while uploading models: Error(s) in loading state_dict for Ensemble:
Unexpected key(s) in state_dict: “epoch”, “best_fitness”, “training_results”, “model”, “optimizer”, “wandb_id”.

Help needed, Thanks

ptrblck · February 11, 2022, 4:33pm

params seems to be a dict object containing the model’s state_dict in its ['model'] key as well as other objects such as the optimizer’s state_dict in ['optimizer'], the training results etc. as given in the error message.
Index the 'model' key and it might work,

Abhijit_Manepatil · February 11, 2022, 5:06pm

Hi, Thanks for your response,
I tried the above solution but not worked.

ptrblck · February 11, 2022, 5:08pm

Could you describe a bit more what exactly did not work?

Abhijit_Manepatil · February 11, 2022, 5:38pm

Yes, while checking with above solution, I got the following error:

ERROR - Exception caught while uploading models: ‘Model’ object has no attribute ‘copy’

File "/home/user/anaconda3/envs/dali/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1455, in load_state_dict
    state_dict = state_dict.copy()
  File "/home/user/anaconda3/envs/dali/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Model' object has no attribute 'copy'

can you please help and check?

ptrblck · February 11, 2022, 6:39pm

It seems you’ve saved the model object directly instead if its state_dict, so you could try to assign the model directly via model = params['model'].

Abhijit_Manepatil · February 14, 2022, 5:39am

I have got the above output as a param from torch.load(‘yolov5s.pt’ map_location=‘cpu’).
I have used model = params['model'] .
got the following issue:

 File "models/load.py", line 73, in attempt_load
    model.append(ckpt['ema' if ckpt.get('ema') else 'model'].float().fuse().eval()) 
  File "/home/user/anaconda3/envs/dali/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1177, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Model' object has no attribute 'append'

ptrblck · February 14, 2022, 6:38am

I assume models.load.py and attempt_load are custom methods, so you would need to check your code to see how the model loading is expected.
Based on the new error it seems as if the used model object is supposed to be a list.

Abhijit_Manepatil · February 14, 2022, 7:47am

while loading model using CPU, it only usages 17.5MB size of memory
see the following profiling result (check line number 35)

Line Mem usage Increment Occurrences Line Contents

30    386.0 MiB    386.0 MiB           1   @profile
31                                         def attempt_load(weights, map_location=None):
32                                             # Loads an ensemble of models weights=[a,b,c] or a single model weights=[a] or weights=a
33    386.0 MiB      0.0 MiB           1       model = Ensemble()
34    422.7 MiB      0.0 MiB           2       for w in weights if isinstance(weights, list) else [weights]:
35    403.6 MiB     17.5 MiB           1           ckpt = torch.load(w, map_location='cpu')

Now, I tried the load same model on GPU cuda:0 but memory behavior is drastically changed,
memory cpu ram surges to 2713.8 MiB, see the profiling result below(line no. 35):

Line Mem usage Increment Occurrences Line Contents

30    385.5 MiB    385.5 MiB           1   @profile
31                                         def attempt_load(weights, map_location=None):
32                                             # Loads an ensemble of models weights=[a,b,c] or a single model weights=[a] or weights=a
33    385.5 MiB      0.0 MiB           1       model = Ensemble()
34   3099.6 MiB      0.0 MiB           2       for w in weights if isinstance(weights, list) else [weights]:
35  3099.3 MiB   2713.8 MiB            1           ckpt = torch.load(w, map_location='cuda:0')

that’s why I was looking for some solution referred from (link) to load model first on cpu then transfer to gpu, with the above discussion I was trying to find a solution using state_dict but not resolved that yet, Want to understand why this memory surges even I changed the load device?

ptrblck · February 15, 2022, 12:22am

Assuming you are profiling the host RAM, the memory increase would be caused by loading all CUDA libs and is expected.
The linked issue targets an OOM on the device, which I don’t know if you are running into.

Abhijit_Manepatil · February 15, 2022, 11:16am

Hi @ptrblck Thanks for the response,
As discussed and suggested I tried with following way to get state_dict

model.load_state_dict(params['model'].state_dict(),strict=False)

with the above way, I was able to load state_dict which contains weights and biases, using load_state_dict to the existing model. which resolve my above-mentioned state_dict issue.

Now, I want to load the same model which I already loaded in CPU to GPU but the model is not able to migrate from cpu to gpu, even not getting any error but model stayed only on cpu even I tried to migrate it to GPU with following way:

params = torch.load('yolov5s.pt' map_location='cpu')
device = torch.device("cuda")
model.load_state_dict(params['model'].state_dict(),strict=False)  
model.to(device)

can you suggest, how should I do this from cpu loaded model to gpu?

ptrblck · February 15, 2022, 6:52pm

model.to(device) will move all registered parameters, buffers, and submodules to the specified device as seen here:

model = models.resnet18()

for name, param in model.named_parameters():
    print(name, param.device)
# > conv1.weight cpu
#   bn1.weight cpu
#   bn1.bias cpu
#   ...

for name, buf in model.named_buffers():
    print(name, buf.device)
# > bn1.running_mean cpu
#   bn1.running_var cpu
#   bn1.num_batches_tracked cpu
#   ...

model.to('cuda')

for name, param in model.named_parameters():
    print(name, param.device)
# > conv1.weight cuda:0
#   bn1.weight cuda:0
#   bn1.bias cuda:0
#   ...

for name, buf in model.named_buffers():
    print(name, buf.device)
# > bn1.running_mean cuda:0
#   bn1.running_var cuda:0
#   bn1.num_batches_tracked cuda:0
#   ...

If this mechanism is not working in your model, I guess you have somehow “unregistered” the parameters, buffers, etc.

Abhijit_Manepatil · February 16, 2022, 7:43am

I have tested model in above-mentioned way,
observation is : model.named_parameters() and model.named_buffers() shows the blank result, inside that no weights and bias. looks like as you mentioned it is “unregistered”.
So how can I resolve this to do register those?

ptrblck · February 16, 2022, 8:36am

Did you manipulate the internal parameters, buffers, or submodules somehow by e.g. using setattr etc.?
Parameters, buffers, and submodules (containing both) are directly registered once they are assigned to the parent module via:

self.param_name = nn.Parameter(torch.randn(shape))
self.register_buffer('name', torch.randn(shape))
self.submodule = nn.Linear(1, 1)