Autograd function for .to(device) at initialization

I assume you’ve called the to() operation on your nn.Parameter not the internal tensor?
If so you would create a non-leaf variable, as the result is created by the to() operation, which is differentiable (so that the gradients can flow between different devices).
Try to call the to() operation on the tensor before wrapping them in an nn.Parameter.

Have a look at this post for some more examples.

1 Like