This question wraps together several question I have about how data_ptr is expected to work.
I discovered today that moving the same tensor to the same device twice leads to different .data_ptrs. This makes sense, since could be copying the tensors over separately.
x = torch.tensor([1]).cuda()
x.cpu().data_ptr() == x.cpu().data_ptr()
# False
My issue is that this complicates matters when we expect the data_ptr to be the same. For instance, for models that share weights in different modules, such as word embeddings in input and output layers. In these cases, the input and output embeddings would have two entries in the model state_dict, but both would have the same data_ptr. However, if we use the typical recipe for moving state_dicts across devices:
new_state_dict = {k: v.to(device) for k, v in state_dict.items()}
The input and output embeddings will now have different data_ptrs (in addition to taking up twice the memory).
My question is: is there a built-in method that handles this correctly, such that the moved dictionary of tensors correctly retains tensors with the same data_ptrs?
How did you apply the weight sharing? Could you share the code for it?
If you just stored the parameter once and apply the functional API call using this parameter, there should be no problem I think.
import torch
import torch.nn as nn
class ExampleModule(nn.Module):
def __init__(self):
super().__init__()
self.layer1 = nn.Linear(2, 2, bias=False)
self.layer2 = nn.Linear(2, 2, bias=False)
self.layer3 = nn.Linear(2, 2, bias=False)
self.layer3 = self.layer1
# Same result if you do
# self.layer3.weight = self.layer1.weight
# Moving model to device: works!
example_module = ExampleModule().cuda()
state_dict = example_module.cpu().state_dict()
assert state_dict["layer1.weight"].data_ptr() == state_dict["layer3.weight"].data_ptr()
# passes
# Moving individual tensors: doesn't work
example_module = ExampleModule().cuda()
state_dict = {k: v.cpu() for k, v in example_module.state_dict().items()}
assert state_dict["layer1.weight"].data_ptr() == state_dict["layer3.weight"].data_ptr()
# fails
In the first case, moving the whole model to a device correctly handles this. But there are other use-cases where you don’t want to move the model (e.g. if you’re just storing a checkpoint of the state_dict). In cases where you need to manually move the weights to a different device, the data_ptrs end up being different.
I think this issue is somewhat unrelated to copying/references. In my examples, I am always comparing within a dictionary.
The question I’m asking is: what is a way to make a copy while still having resultant tensors that share the same data_ptrs (within the copy)?
For what it’s worth, I know that copy.deepcopy handles this correctly. However, I do not know of a solution that works when I need to move the state_dict across devices (other than moving the whole model to a different device first).