There is some chatter online that I can’t deepcopy a model… Is this right?
Additionally, is there a way after loading a model to move it between cpu and gpu?
You can deepcopy a model:
model = nn.Linear(1, 1)
model_copy = copy.deepcopy(model)
with torch.no_grad():
model.weight.fill_(1.)
print(model.weight)
> Parameter containing:
tensor([[10.]], requires_grad=True)
print(model_copy.weight)
> Parameter containing:
tensor([[-0.5596]], requires_grad=True)
To move a model, just call:
model.to('cuda:0') # moves model (its parameters) to GPU0
model.to('cpu') # moved model to CPU
Hey! Many thanks for the quick reply
Pleased about the deepcopy.
So the model.to syntax…
I was under the impression this didn’t move the model, just changed its formatting as the documentation here (https://pytorch.org/tutorials/beginner/saving_loading_models.html) suggests:
“converts the initialized model
to a CUDA optimized model using model.to(torch.device('cuda'))
.”
It sounds like the map_location bit in torch.load(PATH, map_location=device) specifies where the model is?
The map_location
argument specifies where to put the loaded parameters.
E.g. if you saved the model.state_dict()
of a model, which was pushed to the GPU, and would like to load this state_dict
on a CPU-only machine, you could specify map_location='cpu'
to restore the parameters.
Does that mean that if I load a model from file (saved using torch.save(model) on either a cpu or gpu) and set map_location=‘cpu’ when loading, I don’t need to call .to - if I’m using it on the cpu. I only need to .to it if I’m moving it onto the gpu once again?
I would recommend to save and load the mode.state_dict()
, not the model directly.
That being said, I prefer to push the model to CPU first before saving the state_dict
.
This approach makes sure that I’m able to restore the model on all systems, even when no GPU was found.
After loading the model, I use model.to('cuda')
to push it to the GPU again.
this is sufficient:
model = nn.Linear(1, 1)
model_copy = copy.deepcopy(model)
what about this code:
`
model = nn.Linear(1, 1)
model.to(‘cuda:0’)
model_copy = copy.deepcopy(model)
`
will this create a copy in the GPU? or will I now have two models pointing to the same location in the GPU? (hopefully not)
thanks
Hi, I have a customized model and cannot be deepcopy. What would be the major causes of a model cannot be deepcopied? could you shed some light for me to debug my model? Thanks a lot for your help.
The customized model is the squeezenet ssd lite model in this repo (GitHub - qfgaohao/pytorch-ssd: MobileNetV1, MobileNetV2, VGG based SSD/SSD-lite implementation in Pytorch 1.0 / Pytorch 0.4. Out-of-box support for retraining on Open Images dataset. ONNX and Caffe2 support. Experiment Ideas like CoordConv.)
The squeezenet model is defined under vision/nn
The ssd model is defined under vision/ssd
if I do:
base_net = squeezenet1_1(False).features
I can deep copy base_net no problem:
dbcpy = copy.deepcopy(base_net)
But if I do
net = create_squeezenet_ssd_lite(no_classes, is_test=True)
then I have trouble to deepcopy the net:
dnet = copy.deepcopy(net)
it will report the following error:
Traceback (most recent call last):
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/pydevd.py”, line 3129, in
main()
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/pydevd.py”, line 3122, in main
globals = debugger.run(setup[‘file’], None, None, is_module)
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/pydevd.py”, line 2195, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/pydevd.py”, line 2202, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/_pydev_imps/_pydev_execfile.py”, line 25, in execfile
exec(compile(contents+"\n", file, ‘exec’), glob, loc)
File “/home/paul/pytorch/od-ssd/ssd-quantized/test_3d.py”, line 50, in
dnet = copy.deepcopy(net)
File “/usr/lib/python3.6/copy.py”, line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File “/usr/lib/python3.6/copy.py”, line 280, in _reconstruct
state = deepcopy(state, memo)
File “/usr/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/usr/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/usr/lib/python3.6/copy.py”, line 169, in deepcopy
rv = reductor(4)
TypeError: can’t pickle module objects
The reason I am asking this question is when I do the QAT (Quantization Aware Training), and try to save the quantized model, using:
net.eval()
net_int8 = torch.quantization.convert(net)
net_int8.save(model_path)
I will encounter the above deepcopy error. So what shall I do? fix the model to make it deepcopy-able? or QAT doesn’t work for ssd model? could you help to point me to the right direction? Thanks a lot for your help!
Based on the error message it seems that pickle
is failing to copy the object, if it has a class attribute that references a module. I don’t know, if this is the case for QAT or if this is caused by your custom model, but could you check for module references inside your model?
Thanks for your help. I looked into the create_sqeezenet_ssd_lite() code below, do you think that ModuleList() causing the problem? If yes, how do I fix this type of issue and enable the model created by this code deepcopyable? This type of model creation is typically used in a SSD network creation. Does this imply QAT doesn’t work for SSD type of network? Thanks again for your help!
def create_squeezenet_ssd_lite(num_classes, is_test=False):
base_net = squeezenet1_1(False).features # disable dropout layersource_layer_indexes = [ 12 ] extras = ModuleList([ Sequential( Conv2d(in_channels=512, out_channels=256, kernel_size=1), ReLU(), SeperableConv2d(in_channels=256, out_channels=512, kernel_size=3, stride=2, padding=2), ), Sequential( Conv2d(in_channels=512, out_channels=256, kernel_size=1), ReLU(), SeperableConv2d(in_channels=256, out_channels=512, kernel_size=3, stride=2, padding=1), ), Sequential( Conv2d(in_channels=512, out_channels=128, kernel_size=1), ReLU(), SeperableConv2d(in_channels=128, out_channels=256, kernel_size=3, stride=2, padding=1), ), Sequential( Conv2d(in_channels=256, out_channels=128, kernel_size=1), ReLU(), SeperableConv2d(in_channels=128, out_channels=256, kernel_size=3, stride=2, padding=1), ), Sequential( Conv2d(in_channels=256, out_channels=128, kernel_size=1), ReLU(), SeperableConv2d(in_channels=128, out_channels=256, kernel_size=3, stride=2, padding=1) ) ]) regression_headers = ModuleList([ SeperableConv2d(in_channels=512, out_channels=6 * 4, kernel_size=3, padding=1), SeperableConv2d(in_channels=512, out_channels=6 * 4, kernel_size=3, padding=1), SeperableConv2d(in_channels=512, out_channels=6 * 4, kernel_size=3, padding=1), SeperableConv2d(in_channels=256, out_channels=6 * 4, kernel_size=3, padding=1), SeperableConv2d(in_channels=256, out_channels=6 * 4, kernel_size=3, padding=1), Conv2d(in_channels=256, out_channels=6 * 4, kernel_size=1), ]) classification_headers = ModuleList([ SeperableConv2d(in_channels=512, out_channels=6 * num_classes, kernel_size=3, padding=1), SeperableConv2d(in_channels=512, out_channels=6 * num_classes, kernel_size=3, padding=1), SeperableConv2d(in_channels=512, out_channels=6 * num_classes, kernel_size=3, padding=1), SeperableConv2d(in_channels=256, out_channels=6 * num_classes, kernel_size=3, padding=1), SeperableConv2d(in_channels=256, out_channels=6 * num_classes, kernel_size=3, padding=1), Conv2d(in_channels=256, out_channels=6 * num_classes, kernel_size=1), ]) return SSD(num_classes, base_net, source_layer_indexes, extras, classification_headers, regression_headers, is_test=is_test, config=config)
The nn.ModuleList
looks correct, but I also don’t know if QAT might be causing this issue. Are you seeing the same error without using QAT?
“Are you seeing the same error without using QAT?”
Yes.
I am seeing this link indicating that Can’t save a model with torch.save
if model has a torch.Device
attr #7545 and I tried to remove the self.device attribute in SSD class and still not working.
I did try to run the code you suggested in the link to exam where is the error by pickle_trick() it:
net = create_squeezenet_ssd_lite(no_classes, is_test=True)
print(pf(pickle_trick(net)))
But the code crashes indicating there is exception caught within the exception even I increase the maximum depth to 10,000.
after line by line elimination test, finally found out the root causes that make the SSD() not deepcopyable:
class SSD(nn.Module):
def init(self, num_classes: int, base_net: nn.ModuleList, source_layer_indexes: List[int],
extras: nn.ModuleList, classification_headers: nn.ModuleList,
regression_headers: nn.ModuleList, is_test=False, config=None, device=None):
“”“Compose a SSD model using the given components.
“””
super(SSD, self).init()self.num_classes = num_classes self.base_net = base_net self.source_layer_indexes = source_layer_indexes self.extras = extras self.classification_headers = classification_headers self.regression_headers = regression_headers self.is_test = is_test self.config = config
If I comment out the “self.config = config” line, then SSD() is deepcopyable.
Where the config is a module imported from:
from .config import squeezenet_ssd_config as config
And the squeezenet_ssd_config is as below:
import numpy as np
from vision.utils.box_utils import SSDSpec, SSDBoxSizes, generate_ssd_priors
image_size = 300
image_mean = np.array([127, 127, 127]) # RGB layout
image_std = 128.0
iou_threshold = 0.45
center_variance = 0.1
size_variance = 0.2specs = [
SSDSpec(17, 16, SSDBoxSizes(60, 105), [2, 3]),
SSDSpec(10, 32, SSDBoxSizes(105, 150), [2, 3]),
SSDSpec(5, 64, SSDBoxSizes(150, 195), [2, 3]),
SSDSpec(3, 100, SSDBoxSizes(195, 240), [2, 3]),
SSDSpec(2, 150, SSDBoxSizes(240, 285), [2, 3]),
SSDSpec(1, 300, SSDBoxSizes(285, 330), [2, 3])
]priors = generate_ssd_priors(specs, image_size)
How should I modify the code to keep the “config” and still make the SSD() deepcopyable? Any advice will be highly appreciated.
Thanks a lot for your help in advanced.
Is squeezenet_ssd_config
only containing the posted code or any class definitions?
Based on the import
statement I would assume it’s a class or any other object, but the posted code shows just executable Python code with more imports.
You could try to deepcopy
each imported class and see, if one of these classes might fail due to the previously mentioned reason.
Thank you so much for your help.
Eventually I “work around” the problem by doing the following:
avoid “self.config=config” referencing, but directly instantiate it’s content:
Was:
in init
self.config = config
…<later in the body, self.config were referenced as>
self.config.center_variance
self.config.size_variance
Change to:
in init
#self.config = config #comment out
self.config_center_variance = config.center_variance
self.config_size_variance = config.size_variance
…<later in the code, no more reference to self.config >
self.config_center_variance
self.config_size_variance
Then the SSD() is deepcopyable, but QAT still having problems…
although SSD() is deepcopyable, but still having problem to run QAT…, the problem is: cannot torch.save(net_int8) and it complaints:
torch.save(net_int8, model_path) #to save the entire model
File “/home/paul/pytorch/lib/python3.6/site-packages/torch/serialization.py”, line 370, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File “/home/paul/pytorch/lib/python3.6/site-packages/torch/serialization.py”, line 443, in _legacy_save
pickler.dump(obj)
AttributeError: Can’t pickle local object ‘_with_args.._PartialWrapper’
the code with problem area is:
net.eval() net_int8 = torch.quantization.convert(net) #torch.save(net_int8.state_dict(), model_path) torch.save(net_int8, model_path) #to save the entire model
If I only save state_dict(), there is no problem. But when save the entire model, the problem show.
Further debug, finding the net was loaded and torch.save-able, but after the net.qconfig and it becomes not torch.save-able anymore:
torch.save(net,"./net_before_qconfig") net.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') torch.save(net,"./net_after_qconfig")
As shown above, the torch.save(net, “./net_before_qconfig”) works, but after net.qconfig, the net is not torch.save-able anymore… ;((
And having the same complaints:
torch.save(net,"./net_after_qconfig")
File “/home/paul/pytorch/lib/python3.6/site-packages/torch/serialization.py”, line 370, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File “/home/paul/pytorch/lib/python3.6/site-packages/torch/serialization.py”, line 443, in _legacy_save
pickler.dump(obj)
AttributeError: Can’t pickle local object ‘_with_args.._PartialWrapper’
Is this QAT problem? torch.save cannot save the entire model after QAT? I need the entire model saved to port to a micro processor for speed up testing. If this is QAT limitation, what would be the work around to still save the entire quantization model (not just state_dict)? Thank you again for your help!
Based on your latest debugging, the issue might be related to QAT.
I’m not deeply familiar with QAT and the support to use torch.save
on the model directly.
However, generally I would not recommend to save the model directly, as it can break in various ways.
The better way would be to save the state_dict
(and additional configs etc.). When reloading you would then recreate the model object and use load_state_dict
.
I using Pytorch’s swa_utils
that internally calls deepcopy
.
However, I get the following error:
> Only Tensors created explicitly by the user (graph leaves)
support the deepcopy protocol at the moment
This seems to be caused by the weight dropout scheme I am using, where the original weight matrix is moved into _raw. The code for weight dropout is here:
import torch
from torch.nn import Parameter
import torch.nn.functional as F
class WeightDrop(object):
def __init__(self, name, dropout):
self.name = name
self.dropout = dropout
def compute_weight(self, module):
return F.dropout(
getattr(module, self.name + "_raw"),
p = self.dropout,
training = module.training,
inplace = False
)
@staticmethod
def apply(module, name, dropout):
for k, hook in module._forward_pre_hooks.items():
if isinstance(hook, WeightDrop) and hook.name == name:
raise RuntimeError(f"Cannot register two weight_dropout hooks with name '{name}'")
fn = WeightDrop(name, dropout)
weight = getattr(module, name)
del module._parameters[name]
## creating _raw parameter
module.register_parameter(name + "_raw", Parameter(weight.data))
setattr(module, name, fn.compute_weight(module))
module.register_forward_pre_hook(fn)
return fn
def remove(self, module):
weight = module._parameters[self.name + "_raw"]
delattr(module, self.name)
del module._parameters[self.name + "_raw"]
module.register_parameter(self.name, Parameter(weight.data))
def __call__(self, module, inputs):
if self.name in module._parameters:
del module._parameters[self.name]
setattr(module, self.name, self.compute_weight(module))
def weight_drop(module, name, dropout):
WeightDrop.apply(module, name, dropout)
return module
def apply_weight_drop(module, name, dropout):
wdrop = weight_drop(module, name, dropout)
fp = module.flatten_parameters
def decorator(*args, **kwargs):
device = getattr(module, name + "_raw").device
setattr(module, name, getattr(module, name).to(device))
return fp(*args, **kwargs)
module.flatten_parameters = decorator
return wdrop
Here is the code for applying deep copy on a GRU:
import copy
gru = torch.nn.GRU(10, 10)
gru_wd = apply_weight_drop(gru, "weight_hh_l0", 0.2)
gru_wd_copy = copy.deepcopy(gru_wd)
> RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment
Any ideas how to fix this error?
Thank you!
Any updates on this question?
I checked by doing the following:
model = nn.Linear(1, 1)
model.to(‘cuda:0’)
model_copy = copy.deepcopy(model)
print(next(model_copy.parameters()).is_cuda)
Output:
True
So the copy should be on CUDA as well.