You could use copy_ instead of fill_ or assign an nn.Parameter.
Also, the usage of the .data attribute is discouraged. Wrap the assignment in a torch.no_grad() block instead:
Then the intermediate activations, which are needed to calculate the gradients during the backward pass will be stored and will use memory, which is not needed during inference/testing.