Can I deepcopy a model?

ONTDave · July 31, 2019, 12:40pm

There is some chatter online that I can’t deepcopy a model… Is this right?
Additionally, is there a way after loading a model to move it between cpu and gpu?

ptrblck · July 31, 2019, 12:58pm

You can deepcopy a model:

model = nn.Linear(1, 1)
model_copy = copy.deepcopy(model)

with torch.no_grad():
    model.weight.fill_(1.)

print(model.weight)
> Parameter containing:
tensor([[10.]], requires_grad=True)

print(model_copy.weight)
> Parameter containing:
tensor([[-0.5596]], requires_grad=True)

To move a model, just call:

model.to('cuda:0')  # moves model (its parameters) to GPU0
model.to('cpu')  # moved model to CPU

ONTDave · July 31, 2019, 1:05pm

Hey! Many thanks for the quick reply
Pleased about the deepcopy.
So the model.to syntax…
I was under the impression this didn’t move the model, just changed its formatting as the documentation here (https://pytorch.org/tutorials/beginner/saving_loading_models.html) suggests:
“converts the initialized model to a CUDA optimized model using model.to(torch.device('cuda')) .”
It sounds like the map_location bit in torch.load(PATH, map_location=device) specifies where the model is?

ptrblck · July 31, 2019, 1:17pm

The map_location argument specifies where to put the loaded parameters.
E.g. if you saved the model.state_dict() of a model, which was pushed to the GPU, and would like to load this state_dict on a CPU-only machine, you could specify map_location='cpu' to restore the parameters.

ONTDave · July 31, 2019, 1:21pm

Does that mean that if I load a model from file (saved using torch.save(model) on either a cpu or gpu) and set map_location=‘cpu’ when loading, I don’t need to call .to - if I’m using it on the cpu. I only need to .to it if I’m moving it onto the gpu once again?

ptrblck · July 31, 2019, 1:26pm

I would recommend to save and load the mode.state_dict(), not the model directly.
That being said, I prefer to push the model to CPU first before saving the state_dict.
This approach makes sure that I’m able to restore the model on all systems, even when no GPU was found.

After loading the model, I use model.to('cuda') to push it to the GPU again.

pinocchio · August 19, 2020, 8:17pm

this is sufficient:

model = nn.Linear(1, 1)
model_copy = copy.deepcopy(model)

Ofri_Masad · December 14, 2020, 3:05pm

what about this code:

`
model = nn.Linear(1, 1)

model.to(‘cuda:0’)

model_copy = copy.deepcopy(model)
`
will this create a copy in the GPU? or will I now have two models pointing to the same location in the GPU? (hopefully not)

thanks

ynjiun_wang · March 18, 2021, 11:25pm

Hi, I have a customized model and cannot be deepcopy. What would be the major causes of a model cannot be deepcopied? could you shed some light for me to debug my model? Thanks a lot for your help.

The customized model is the squeezenet ssd lite model in this repo (GitHub - qfgaohao/pytorch-ssd: MobileNetV1, MobileNetV2, VGG based SSD/SSD-lite implementation in Pytorch 1.0 / Pytorch 0.4. Out-of-box support for retraining on Open Images dataset. ONNX and Caffe2 support. Experiment Ideas like CoordConv.)

The squeezenet model is defined under vision/nn
The ssd model is defined under vision/ssd

if I do:
base_net = squeezenet1_1(False).features
I can deep copy base_net no problem:
dbcpy = copy.deepcopy(base_net)

But if I do
net = create_squeezenet_ssd_lite(no_classes, is_test=True)
then I have trouble to deepcopy the net:
dnet = copy.deepcopy(net)

it will report the following error:
Traceback (most recent call last):
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/pydevd.py”, line 3129, in
main()
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/pydevd.py”, line 3122, in main
globals = debugger.run(setup[‘file’], None, None, is_module)
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/pydevd.py”, line 2195, in run
return self._exec(is_module, entry_point_fn, module_name, file, globals, locals)
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/pydevd.py”, line 2202, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File “/home/paul/.eclipse/360744286_linux_gtk_x86_64/plugins/org.python.pydev.core_7.5.0.202001101138/pysrc/_pydev_imps/_pydev_execfile.py”, line 25, in execfile
exec(compile(contents+"\n", file, ‘exec’), glob, loc)
File “/home/paul/pytorch/od-ssd/ssd-quantized/test_3d.py”, line 50, in
dnet = copy.deepcopy(net)
File “/usr/lib/python3.6/copy.py”, line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File “/usr/lib/python3.6/copy.py”, line 280, in _reconstruct
state = deepcopy(state, memo)
File “/usr/lib/python3.6/copy.py”, line 150, in deepcopy
y = copier(x, memo)
File “/usr/lib/python3.6/copy.py”, line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File “/usr/lib/python3.6/copy.py”, line 169, in deepcopy
rv = reductor(4)
TypeError: can’t pickle module objects

The reason I am asking this question is when I do the QAT (Quantization Aware Training), and try to save the quantized model, using:
net.eval()
net_int8 = torch.quantization.convert(net)
net_int8.save(model_path)
I will encounter the above deepcopy error. So what shall I do? fix the model to make it deepcopy-able? or QAT doesn’t work for ssd model? could you help to point me to the right direction? Thanks a lot for your help!

ptrblck · March 19, 2021, 4:26am

Based on the error message it seems that pickle is failing to copy the object, if it has a class attribute that references a module. I don’t know, if this is the case for QAT or if this is caused by your custom model, but could you check for module references inside your model?

ynjiun_wang · March 19, 2021, 3:39pm

Thanks for your help. I looked into the create_sqeezenet_ssd_lite() code below, do you think that ModuleList() causing the problem? If yes, how do I fix this type of issue and enable the model created by this code deepcopyable? This type of model creation is typically used in a SSD network creation. Does this imply QAT doesn’t work for SSD type of network? Thanks again for your help!

def create_squeezenet_ssd_lite(num_classes, is_test=False):
base_net = squeezenet1_1(False).features # disable dropout layer

source_layer_indexes = [
    12
]
extras = ModuleList([
    Sequential(
        Conv2d(in_channels=512, out_channels=256, kernel_size=1),
        ReLU(),
        SeperableConv2d(in_channels=256, out_channels=512, kernel_size=3, stride=2, padding=2),
    ),
    Sequential(
        Conv2d(in_channels=512, out_channels=256, kernel_size=1),
        ReLU(),
        SeperableConv2d(in_channels=256, out_channels=512, kernel_size=3, stride=2, padding=1),
    ),
    Sequential(
        Conv2d(in_channels=512, out_channels=128, kernel_size=1),
        ReLU(),
        SeperableConv2d(in_channels=128, out_channels=256, kernel_size=3, stride=2, padding=1),
    ),
    Sequential(
        Conv2d(in_channels=256, out_channels=128, kernel_size=1),
        ReLU(),
        SeperableConv2d(in_channels=128, out_channels=256, kernel_size=3, stride=2, padding=1),
    ),
    Sequential(
        Conv2d(in_channels=256, out_channels=128, kernel_size=1),
        ReLU(),
        SeperableConv2d(in_channels=128, out_channels=256, kernel_size=3, stride=2, padding=1)
    )
])

regression_headers = ModuleList([
    SeperableConv2d(in_channels=512, out_channels=6 * 4, kernel_size=3, padding=1),
    SeperableConv2d(in_channels=512, out_channels=6 * 4, kernel_size=3, padding=1),
    SeperableConv2d(in_channels=512, out_channels=6 * 4, kernel_size=3, padding=1),
    SeperableConv2d(in_channels=256, out_channels=6 * 4, kernel_size=3, padding=1),
    SeperableConv2d(in_channels=256, out_channels=6 * 4, kernel_size=3, padding=1),
    Conv2d(in_channels=256, out_channels=6 * 4, kernel_size=1),
])

classification_headers = ModuleList([
    SeperableConv2d(in_channels=512, out_channels=6 * num_classes, kernel_size=3, padding=1),
    SeperableConv2d(in_channels=512, out_channels=6 * num_classes, kernel_size=3, padding=1),
    SeperableConv2d(in_channels=512, out_channels=6 * num_classes, kernel_size=3, padding=1),
    SeperableConv2d(in_channels=256, out_channels=6 * num_classes, kernel_size=3, padding=1),
    SeperableConv2d(in_channels=256, out_channels=6 * num_classes, kernel_size=3, padding=1),
    Conv2d(in_channels=256, out_channels=6 * num_classes, kernel_size=1),
])

return SSD(num_classes, base_net, source_layer_indexes,
           extras, classification_headers, regression_headers, is_test=is_test, config=config)

ptrblck · March 20, 2021, 4:52am

The nn.ModuleList looks correct, but I also don’t know if QAT might be causing this issue. Are you seeing the same error without using QAT?

ynjiun_wang · March 20, 2021, 4:08pm

“Are you seeing the same error without using QAT?”

Yes.

I am seeing this link indicating that Can’t save a model with torch.save if model has a torch.Device attr #7545 and I tried to remove the self.device attribute in SSD class and still not working.

I did try to run the code you suggested in the link to exam where is the error by pickle_trick() it:

net = create_squeezenet_ssd_lite(no_classes, is_test=True)
print(pf(pickle_trick(net)))

But the code crashes indicating there is exception caught within the exception even I increase the maximum depth to 10,000.

ynjiun_wang · March 24, 2021, 3:56am

@ptrblck

after line by line elimination test, finally found out the root causes that make the SSD() not deepcopyable:

class SSD(nn.Module):
def init(self, num_classes: int, base_net: nn.ModuleList, source_layer_indexes: List[int],
extras: nn.ModuleList, classification_headers: nn.ModuleList,
regression_headers: nn.ModuleList, is_test=False, config=None, device=None):
“”“Compose a SSD model using the given components.
“””
super(SSD, self).init()
    self.num_classes = num_classes
    self.base_net = base_net
    self.source_layer_indexes = source_layer_indexes
    self.extras = extras
    self.classification_headers = classification_headers
    self.regression_headers = regression_headers
    self.is_test = is_test
    self.config = config  

If I comment out the “self.config = config” line, then SSD() is deepcopyable.

Where the config is a module imported from:

from .config import squeezenet_ssd_config as config

And the squeezenet_ssd_config is as below:

import numpy as np

from vision.utils.box_utils import SSDSpec, SSDBoxSizes, generate_ssd_priors

image_size = 300
image_mean = np.array([127, 127, 127]) # RGB layout
image_std = 128.0
iou_threshold = 0.45
center_variance = 0.1
size_variance = 0.2

specs = [
SSDSpec(17, 16, SSDBoxSizes(60, 105), [2, 3]),
SSDSpec(10, 32, SSDBoxSizes(105, 150), [2, 3]),
SSDSpec(5, 64, SSDBoxSizes(150, 195), [2, 3]),
SSDSpec(3, 100, SSDBoxSizes(195, 240), [2, 3]),
SSDSpec(2, 150, SSDBoxSizes(240, 285), [2, 3]),
SSDSpec(1, 300, SSDBoxSizes(285, 330), [2, 3])
]

priors = generate_ssd_priors(specs, image_size)

How should I modify the code to keep the “config” and still make the SSD() deepcopyable? Any advice will be highly appreciated.

Thanks a lot for your help in advanced.

ptrblck · March 24, 2021, 5:13am

Is squeezenet_ssd_config only containing the posted code or any class definitions?
Based on the import statement I would assume it’s a class or any other object, but the posted code shows just executable Python code with more imports.
You could try to deepcopy each imported class and see, if one of these classes might fail due to the previously mentioned reason.

ynjiun_wang · March 24, 2021, 5:48pm

@ptrblck

Thank you so much for your help.

Eventually I “work around” the problem by doing the following:

avoid “self.config=config” referencing, but directly instantiate it’s content:

Was:

in init
self.config = config
…<later in the body, self.config were referenced as>
self.config.center_variance
self.config.size_variance

Change to:

in init
#self.config = config #comment out
self.config_center_variance = config.center_variance
self.config_size_variance = config.size_variance
…<later in the code, no more reference to self.config >
self.config_center_variance
self.config_size_variance

Then the SSD() is deepcopyable, but QAT still having problems…

ynjiun_wang · March 24, 2021, 11:22pm

@ptrblck

although SSD() is deepcopyable, but still having problem to run QAT…, the problem is: cannot torch.save(net_int8) and it complaints:

torch.save(net_int8, model_path) #to save the entire model
File “/home/paul/pytorch/lib/python3.6/site-packages/torch/serialization.py”, line 370, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File “/home/paul/pytorch/lib/python3.6/site-packages/torch/serialization.py”, line 443, in _legacy_save
pickler.dump(obj)
AttributeError: Can’t pickle local object ‘_with_args.._PartialWrapper’

the code with problem area is:

            net.eval()
            net_int8 = torch.quantization.convert(net)
             
            #torch.save(net_int8.state_dict(), model_path)
            torch.save(net_int8, model_path) #to save the entire model

If I only save state_dict(), there is no problem. But when save the entire model, the problem show.

Further debug, finding the net was loaded and torch.save-able, but after the net.qconfig and it becomes not torch.save-able anymore:

torch.save(net,"./net_before_qconfig")
net.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
torch.save(net,"./net_after_qconfig")

As shown above, the torch.save(net, “./net_before_qconfig”) works, but after net.qconfig, the net is not torch.save-able anymore… ;((

And having the same complaints:

torch.save(net,"./net_after_qconfig")
File “/home/paul/pytorch/lib/python3.6/site-packages/torch/serialization.py”, line 370, in save
_legacy_save(obj, opened_file, pickle_module, pickle_protocol)
File “/home/paul/pytorch/lib/python3.6/site-packages/torch/serialization.py”, line 443, in _legacy_save
pickler.dump(obj)
AttributeError: Can’t pickle local object ‘_with_args.._PartialWrapper’

Is this QAT problem? torch.save cannot save the entire model after QAT? I need the entire model saved to port to a micro processor for speed up testing. If this is QAT limitation, what would be the work around to still save the entire quantization model (not just state_dict)? Thank you again for your help!

ptrblck · March 25, 2021, 12:52am

Based on your latest debugging, the issue might be related to QAT.
I’m not deeply familiar with QAT and the support to use torch.save on the model directly.
However, generally I would not recommend to save the model directly, as it can break in various ways.
The better way would be to save the state_dict (and additional configs etc.). When reloading you would then recreate the model object and use load_state_dict.

n40x1 · June 17, 2021, 11:14pm

I using Pytorch’s swa_utils that internally calls deepcopy.
However, I get the following error:

> Only Tensors created explicitly by the user (graph leaves) 
support the deepcopy protocol at the moment

This seems to be caused by the weight dropout scheme I am using, where the original weight matrix is moved into _raw. The code for weight dropout is here:

import torch
from torch.nn import Parameter
import torch.nn.functional as F 

class WeightDrop(object):
    def __init__(self, name, dropout):
        self.name = name
        self.dropout = dropout

    def compute_weight(self, module):
        return F.dropout(
            getattr(module, self.name + "_raw"),
            p        = self.dropout,
            training = module.training,
            inplace  = False
        )

    @staticmethod
    def apply(module, name, dropout):
        for k, hook in module._forward_pre_hooks.items():
            if isinstance(hook, WeightDrop) and hook.name == name:
                raise RuntimeError(f"Cannot register two weight_dropout hooks with name '{name}'")
        fn = WeightDrop(name, dropout)
        weight = getattr(module, name)

        del module._parameters[name]

        ## creating _raw parameter
        module.register_parameter(name + "_raw", Parameter(weight.data))
        setattr(module, name, fn.compute_weight(module))

        module.register_forward_pre_hook(fn)
        return fn

    def remove(self, module):
        weight = module._parameters[self.name + "_raw"]
        delattr(module, self.name)
        del module._parameters[self.name + "_raw"]
        module.register_parameter(self.name, Parameter(weight.data))

    def __call__(self, module, inputs):
        if self.name in module._parameters:
            del module._parameters[self.name]
        setattr(module, self.name, self.compute_weight(module))


def weight_drop(module, name, dropout):
    WeightDrop.apply(module, name, dropout)
    return module

def apply_weight_drop(module, name, dropout):
    wdrop = weight_drop(module, name, dropout)
    fp = module.flatten_parameters

    def decorator(*args, **kwargs):
        device = getattr(module, name + "_raw").device
        setattr(module, name, getattr(module, name).to(device))
        return fp(*args, **kwargs)

    module.flatten_parameters = decorator
    return wdrop

Here is the code for applying deep copy on a GRU:

import copy
gru = torch.nn.GRU(10, 10)
gru_wd = apply_weight_drop(gru, "weight_hh_l0", 0.2)
gru_wd_copy = copy.deepcopy(gru_wd)
> RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

Any ideas how to fix this error?
Thank you!

Nikola_Andro · November 17, 2022, 9:26pm

Any updates on this question?

I checked by doing the following:

model = nn.Linear(1, 1)

model.to(‘cuda:0’)

model_copy = copy.deepcopy(model)

print(next(model_copy.parameters()).is_cuda)

Output:

True

So the copy should be on CUDA as well.