Data_ptr when moving across devices

This question wraps together several question I have about how data_ptr is expected to work.

I discovered today that moving the same tensor to the same device twice leads to different .data_ptrs. This makes sense, since could be copying the tensors over separately.

x = torch.tensor([1]).cuda()
x.cpu().data_ptr() == x.cpu().data_ptr()
# False

My issue is that this complicates matters when we expect the data_ptr to be the same. For instance, for models that share weights in different modules, such as word embeddings in input and output layers. In these cases, the input and output embeddings would have two entries in the model state_dict, but both would have the same data_ptr. However, if we use the typical recipe for moving state_dicts across devices:

new_state_dict = {k: v.to(device) for k, v in state_dict.items()}

The input and output embeddings will now have different data_ptrs (in addition to taking up twice the memory).

My question is: is there a built-in method that handles this correctly, such that the moved dictionary of tensors correctly retains tensors with the same data_ptrs?

1 Like

How did you apply the weight sharing? Could you share the code for it?
If you just stored the parameter once and apply the functional API call using this parameter, there should be no problem I think.

Here’s an artificial example

import torch
import torch.nn as nn

class ExampleModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer1 = nn.Linear(2, 2, bias=False)
        self.layer2 = nn.Linear(2, 2, bias=False)
        self.layer3 = nn.Linear(2, 2, bias=False)
        self.layer3 = self.layer1
        # Same result if you do
        #     self.layer3.weight = self.layer1.weight
    
# Moving model to device: works!
example_module = ExampleModule().cuda()
state_dict = example_module.cpu().state_dict()
assert state_dict["layer1.weight"].data_ptr() == state_dict["layer3.weight"].data_ptr()  
# passes

# Moving individual tensors: doesn't work
example_module = ExampleModule().cuda()
state_dict = {k: v.cpu() for k, v in example_module.state_dict().items()}
assert state_dict["layer1.weight"].data_ptr() == state_dict["layer3.weight"].data_ptr()  
# fails

In the first case, moving the whole model to a device correctly handles this. But there are other use-cases where you don’t want to move the model (e.g. if you’re just storing a checkpoint of the state_dict). In cases where you need to manually move the weights to a different device, the data_ptrs end up being different.

Thanks for the code example.
I think the observed failure is due to the creation of a new dict, which will trigger a cope, wouldn’t it?

a = {'a': 1, 'b': 2}
a_ref = a
a_copy = {k: v for k, v in a.items()}

a_copy['a'] = 3
print(a, a_ref, a_copy)
> {'a': 1, 'b': 2} {'a': 1, 'b': 2} {'a': 3, 'b': 2}

a_ref['a'] = 4
print(a, a_ref, a_copy)
> {'a': 4, 'b': 2} {'a': 4, 'b': 2} {'a': 3, 'b': 2}

In the first example, you are comparing the same internal state_dict, while the second code snippet created copies of all values.

I think this issue is somewhat unrelated to copying/references. In my examples, I am always comparing within a dictionary.

The question I’m asking is: what is a way to make a copy while still having resultant tensors that share the same data_ptrs (within the copy)?

For what it’s worth, I know that copy.deepcopy handles this correctly. However, I do not know of a solution that works when I need to move the state_dict across devices (other than moving the whole model to a different device first).

Bumping this. (Another way to frame this question is: What is the right and recommended way to move state_dicts across devices?)