Named_parameters: Missing parameter layer because .to(device)


I am encountering a for me very strange issue with the function self.named_parameters().
Long story short:

I am trying to create the following layer:
self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd)).to(self.device)

After creation I generate a param_dict while creating a optimizer with this function:

def get_param_dict(self):
        return {pn: p for pn, p in self.named_parameters()}

The strange behavior is that because of the device movement to self.device the pos_emb layer no longer shows up within the named_parameters dictionary:

self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd)).to(self.device)
--> dict_keys([])

If I remove the .to(self.device) part named_parameters behaves as expected:

self.pos_emb = nn.Parameter(torch.zeros(1, config.block_size, config.n_embd))
--> dict_keys(['pos_emb'])

Why is this the case and how to fix it?
Somehow it affects only this nn.Parameter layer, the other layers are listed correctly with device movement or without…
I need a working get_param_dict function for optimizer configuration.

I already tried to train my model without the movement to self.device for the pos_emb layer but then the training fails because the tensor is on the wrong device (obviously).

Thanks a lot for any kind of hint or solution! :smiley:

To autograd, .to is computation, so you didn’t assign a parameter to self.pos_emb but an intermediate tensor.