I found that if I create a new parameter and then move it to the GPU, it is not included in the model’s state dict:
In [18]: class A(nn.Module):
...: def __init__(self):
...: super(A, self).__init__()
...: self.linear = nn.Linear(10, 5)
...: tensor = torch.randn(10)
...: self.extra_param = nn.Parameter(tensor).cuda()
...:
In [19]: a = A()
In [20]: a.state_dict()
Out[20]:
OrderedDict([('linear.weight',
tensor([[-0.0893, 0.0630, -0.2128, -0.0140, 0.3072, 0.2019, 0.0481, 0.2880,
0.0624, -0.0043],
[ 0.2454, 0.1217, 0.2595, 0.3157, 0.2656, 0.2769, -0.1300, -0.1529,
0.0543, 0.0360],
[-0.0441, 0.2719, 0.2384, -0.0806, -0.2584, 0.0320, 0.0625, 0.2518,
0.3118, -0.0236],
[-0.2696, 0.0842, -0.0695, 0.1962, -0.1226, 0.2223, -0.1517, 0.0508,
0.1038, -0.1919],
[ 0.0244, -0.1074, -0.2447, -0.3143, 0.2876, 0.1700, 0.2029, 0.0547,
-0.1917, -0.0090]])),
('linear.bias',
tensor([-0.0291, -0.2407, -0.1699, 0.2179, 0.0099]))])
But if I first move the data tensor to GPU and then create the parameter, it is included:
In [21]: class A(nn.Module):
...: def __init__(self):
...: super(A, self).__init__()
...: self.linear = nn.Linear(10, 5)
...: tensor = torch.randn(10).cuda()
...: self.extra_param = nn.Parameter(tensor)
...:
...:
In [22]: a = A()
In [28]: a.state_dict()
Out[28]:
OrderedDict([('extra_param',
tensor([-1.9553, -1.1765, 0.7294, 0.4493, -0.5090, 0.6166, 1.1763, -1.3857,
1.5061, 2.3040], device='cuda:0')),
('linear.weight',
tensor([[-0.2709, -0.0524, -0.1386, -0.0545, -0.2749, 0.2852, -0.2150, 0.2570,
-0.0642, -0.0965],
[-0.1773, -0.1679, -0.2528, 0.0592, -0.0932, -0.2876, 0.1598, 0.0908,
0.0042, 0.1190],
[ 0.1387, 0.1521, -0.2607, -0.0160, 0.1109, 0.0388, 0.1286, 0.0189,
-0.2227, -0.2227],
[-0.2643, -0.1320, 0.2406, -0.2356, -0.0214, 0.1120, 0.1150, 0.1287,
0.2106, 0.2788],
[-0.1570, 0.0586, 0.0764, -0.0502, 0.1969, -0.0535, 0.2546, 0.0415,
0.1470, 0.0871]])),
('linear.bias',
tensor([-0.1364, 0.0380, 0.0421, 0.2488, 0.3090]))])
Is there a reason for that? Why isn’t it included in the state_dict in the first case?