On a cpu device, how to load checkpoint saved on gpu device

apaszke · February 5, 2017, 11:23am

You can remap the Tensor location at load time using the map_location argument to torch.load. For example this will forcefully remap everything onto CPU:

torch.load('my_file.pt', map_location=lambda storage, location: 'cpu')

While this will only map storages from GPU0:

torch.load('my_file.pt', map_location={'cuda:0': 'cpu'})

mromaniuk · February 17, 2017, 12:50pm

I’m trying to load a GPU-trained model onto a CPU with the code you suggested:

torch.load('my_file.pt', map_location=lambda storage, location: 'cpu')

… and I get this error:

Traceback (most recent call last):
  File "net_predict.py", line 146, in <module>
    net = torch.load(f_net, map_location=(lambda storage, location: 'cpu'))
  File "/home/[...]/anaconda2/lib/python2.7/site-packages/torch/serialization.py", line 248, in load
    return _load(f, map_location, pickle_module)
  File "/home/[...]/anaconda2/lib/python2.7/site-packages/torch/serialization.py", line 340, in _load
    tensor = tensor_type._new_with_metadata_file(f, storage)
AttributeError: type object 'str' has no attribute '_new_with_metadata_file'

(I replaced my username with […])

Any idea what I’m doing wrong?

apaszke · February 17, 2017, 12:59pm

I’m sorry, my bad. This should work:

torch.load('my_file.pt', map_location=lambda storage, loc: storage)

mromaniuk · February 17, 2017, 1:42pm

It works - brilliant!

Out of curiosity: could you explain what this does? I’m not sure how it knows to remap storage to CPU, since the lambda returns the storage it got as an argument.

apaszke · February 17, 2017, 2:46pm

Sure. map_location can be either a dict where the locations corresponding to keys are remaped to their values. Alternatively, we support passing in a function, that will get a CPU storage and its serialized location, and it should return some storage that will replace the CPU one. If you just want to load everything onto the CPU, you can just return the first arugment, but you could do some more crazy stuff like sending all CUDA tensors to the next GPU, by parsing out the original device from the loc argument.

cyyyyc123 · March 7, 2017, 3:23am

@apaszke Hi! I am sorry to reopen this thread. I have encountered a problem when I used the above method to load a GPU-trained model on CPU mode. The code fragment is:

import torch

encoder = torch.load('encoder.pt', map_location=lambda storage, loc: storage)
decoder = torch.load('decoder.pt', map_location=lambda storage, loc: storage)

encoder.cpu()
decoder.cpu()

And the error I met was:

The full code can be viewed at seq2seq-translation/eval.py

How can I load a GPU-trained model on a CPU device (without any GPUs) correctly? Thank you for your great work!

apaszke · March 7, 2017, 11:12am

Hey, no problem! I only have a couple more questions:

What’s your PyTorch version? Do you have torch.__version__? If no, when did you install it?
When did you create that checkpoint?

cyyyyc123 · March 7, 2017, 12:29pm

Good morning!

The version of my PyTorch is 0.1.9+b46d5e0. I have compiled the PyTorch from source since I want to try to use half tensor with stateless methods. (You have mentioned it in this pull request. Excellent!)
I created the checkpoint about 12 hours before, which also used the 0.1.9+b46d5e0 version of PyTorch.

Thank you very much!

cyyyyc123 · March 7, 2017, 12:42pm

I have uploaded some test data to my github repo. If you have time maybe you can try it:

train a model:
python train_attn.py
load the model and do some inferences:
python eval.py

May this is useful to provide some information for solving the problem. Thank you!

cyyyyc123 · March 7, 2017, 2:16pm

@apaszke I suggest that it may be the version problem. I can’t reproduce this error when I use the 0.1.9_2 version of PyTorch. Thanks!

gaoking132 · March 31, 2017, 2:01am

Sorry to reopen the thread.

After running the code:

params = torch.load(input_file, lambda storage, loc: storage)

I met the same problem as Yangyu met before. The error message shows:

TypeError: set_ received an invalid combination of arguments - got (torch.FloatStorage, int, tuple, tuple), but expected one of:

no arguments

(torch.cuda.FloatTensor source)

(torch.cuda.FloatStorage storage)

(torch.cuda.FloatStorage sourceStorage, int storage_offset, int … size)
didn’t match because some of the arguments have invalid types: (!torch.FloatStorage!, int, !tuple!, !tuple!)

(torch.cuda.FloatStorage sourceStorage, int storage_offset, torch.Size size)

(torch.cuda.FloatStorage sourceStorage, int storage_offset, torch.Size size, tuple strides)

I just updated my pytorch to the latest version in the master branch. The version number is 0.1.11+761eef1. Any idea why?

Thanks,
Yaozong

denis_64 · March 31, 2017, 11:49am

Hello, I tried to load a snapshot from gpu-training to run it on CPU-mode, but faced with the same problem, that described above. Of course, tried to use given advice, but there is no effect.

torch.load('./snapshots/cpu_final_snapshot.pth', map_location=lambda storage, loc: storage)

I have the following traceback:

Traceback (most recent call last):
  File "predict.py", line 39, in <module>
    params = torch.load('./snapshots/cpu_final_snapshot.pth', map_location=lambda storage, loc: storage)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/torch/serialization.py", line 222, in load
    return _load(f, map_location, pickle_module)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/torch/serialization.py", line 370, in _load
    result = unpickler.load()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/torch/cuda/__init__.py", line 279, in __new__
    _lazy_init()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/torch/cuda/__init__.py", line 96, in _lazy_init
    _check_driver()
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/torch/cuda/__init__.py", line 63, in _check_driver
    raise AssertionError("Torch not compiled with CUDA enabled")

torch.__version__
'0.1.10_1'

Would be appreciated any help.

gaoking132 · March 31, 2017, 5:31pm

It seems that I found the problem that causes the error of “invalid combination of arguments”.

Yesterday I used the model trained on 0.1.9 version of pytorch, and loaded it to cpu using the latest version of 0.1.11. The error appeared.

Today I retrained the model using the latest version of 0.1.11 and loaded also using the latest version. Everything works.

So I guess that there are inconsistencies between different versions of pytorch models.

Eugenio_Culurciello · July 29, 2017, 3:33pm

We have trained an Alexnet with pytorch examples imagenet (https://github.com/pytorch/examples/blob/master/imagenet/main.py) and have been struggling to convert the model for use on CPU and for inference only. Here is a solution for AlexNet:
https://github.com/e-lab/pytorch-toolbox/blob/master/convert-save-load.md
It would be nice to have something more generic…

hazelnutsgz · January 10, 2019, 7:30am

When I use the torch 1.0.0, the given code will produce the result as the following:
torch.load(‘save/best_BiLSTMCRF_pos_2019-01-10 12-42-50’, map_location=lambda storage, location: ‘cpu’)
Traceback (most recent call last):
File “”, line 1, in
File “/home/jiaxin/.local/lib/python3.6/site-packages/torch/serialization.py”, line 367, in load
return _load(f, map_location, pickle_module)
File “/home/jiaxin/.local/lib/python3.6/site-packages/torch/serialization.py”, line 538, in _load
result = unpickler.load()
File “/home/jiaxin/.local/lib/python3.6/site-packages/torch/_utils.py”, line 135, in _rebuild_tensor_v2
tensor = _rebuild_tensor(storage, storage_offset, size, stride)
File “/home/jiaxin/.local/lib/python3.6/site-packages/torch/_utils.py”, line 129, in _rebuild_tensor
module = importlib.import_module(storage.module)
AttributeError: ‘str’ object has no attribute ‘module’

Is anything wrong with the new version of PyTorch?

colllin · February 22, 2019, 3:00am

Had the same thing. See the comments about using map_location=lambda storage, location: storage instead of 'cpu'

Zayd · April 4, 2019, 5:49am

If you want to force the map_location to cpu, you can eliminate the lambda and simply use:

torch.load(‘save/best_BiLSTMCRF_pos_2019-01-10 12-42-50’,map_location=‘cpu’)

This is discussed in the report for issue #9139.

Willem · December 12, 2019, 7:39pm

Sorry for reviving this post. I have a closely related question. I want to do the exact same thing, but using the C++ front-end. I.e. I want to save a model, trained using the C++ front-end on GPU, and then load in using the C++ front-end on a CPU device.

It is possible? The documentation on torch::load does not give the map_location? Thanks for any help.

Towsif_Ahamed · May 15, 2020, 11:36am

torch.load(WEIGHTS_FILE, map_location=torch.device(‘cpu’) )

newman_chiang · May 26, 2021, 3:39pm

Hi,

I’m using torch 1.8.1+cu101.
I train model on gpu and save it using torch.save(…) and load it back on cpu using torch.load(…, map_location=‘cpu’).
But the prediction result on cpu is totally different from that on gpu.
I then check the model parameters loaded and the parameters are different on cpu and on gpu.
Why is it?

Thanks