CUDA parameter not included in state_dict

I found that if I create a new parameter and then move it to the GPU, it is not included in the model’s state dict:

In [18]: class A(nn.Module): 
    ...:     def __init__(self): 
    ...:         super(A, self).__init__() 
    ...:         self.linear = nn.Linear(10, 5) 
    ...:         tensor = torch.randn(10) 
    ...:         self.extra_param = nn.Parameter(tensor).cuda() 
    ...:                                                                                                                                                                                                                                    

In [19]: a = A()                                                                                                                                                                                                                            

In [20]: a.state_dict()                                                                                                                                                                                                                     
Out[20]: 
OrderedDict([('linear.weight',
              tensor([[-0.0893,  0.0630, -0.2128, -0.0140,  0.3072,  0.2019,  0.0481,  0.2880,
                        0.0624, -0.0043],
                      [ 0.2454,  0.1217,  0.2595,  0.3157,  0.2656,  0.2769, -0.1300, -0.1529,
                        0.0543,  0.0360],
                      [-0.0441,  0.2719,  0.2384, -0.0806, -0.2584,  0.0320,  0.0625,  0.2518,
                        0.3118, -0.0236],
                      [-0.2696,  0.0842, -0.0695,  0.1962, -0.1226,  0.2223, -0.1517,  0.0508,
                        0.1038, -0.1919],
                      [ 0.0244, -0.1074, -0.2447, -0.3143,  0.2876,  0.1700,  0.2029,  0.0547,
                       -0.1917, -0.0090]])),
             ('linear.bias',
              tensor([-0.0291, -0.2407, -0.1699,  0.2179,  0.0099]))])

But if I first move the data tensor to GPU and then create the parameter, it is included:

In [21]: class A(nn.Module): 
    ...:     def __init__(self): 
    ...:         super(A, self).__init__() 
    ...:         self.linear = nn.Linear(10, 5) 
    ...:         tensor = torch.randn(10).cuda() 
    ...:         self.extra_param = nn.Parameter(tensor) 
    ...:          
    ...:                                                                                                                                                                                                                                    

In [22]: a = A()     
                                                                                                                                                                                                                       
In [28]: a.state_dict()                                                                                                                                                                                                                     
Out[28]: 
OrderedDict([('extra_param',
              tensor([-1.9553, -1.1765,  0.7294,  0.4493, -0.5090,  0.6166,  1.1763, -1.3857,
                       1.5061,  2.3040], device='cuda:0')),
             ('linear.weight',
              tensor([[-0.2709, -0.0524, -0.1386, -0.0545, -0.2749,  0.2852, -0.2150,  0.2570,
                       -0.0642, -0.0965],
                      [-0.1773, -0.1679, -0.2528,  0.0592, -0.0932, -0.2876,  0.1598,  0.0908,
                        0.0042,  0.1190],
                      [ 0.1387,  0.1521, -0.2607, -0.0160,  0.1109,  0.0388,  0.1286,  0.0189,
                       -0.2227, -0.2227],
                      [-0.2643, -0.1320,  0.2406, -0.2356, -0.0214,  0.1120,  0.1150,  0.1287,
                        0.2106,  0.2788],
                      [-0.1570,  0.0586,  0.0764, -0.0502,  0.1969, -0.0535,  0.2546,  0.0415,
                        0.1470,  0.0871]])),
             ('linear.bias',
              tensor([-0.1364,  0.0380,  0.0421,  0.2488,  0.3090]))])

Is there a reason for that? Why isn’t it included in the state_dict in the first case?

1 Like

Hi,

The thing is that .cuda() is an out of place operation. And so the result is not a Parameter anymore, it’s just a Tensor. Since it’s not a Parameter, it is not included in the state_dict.
You can use tensor = torch.randn(10, device="cuda") to create the tensor directly on gpu and avoid such problems.

2 Likes

you are probably better off using:

class A(nn.Module): 
    ...:     def __init__(self): 
    ...:         super(A, self).__init__() 
    ...:         self.linear = nn.Linear(10, 5) 
    ...:         tensor = torch.randn(10) 
    ...:         self.extra_param = nn.Parameter(tensor.to(self.linear.weight.device))

This way it will work for both GPU and CPU