I associate the weights of two modules as module1.weight = module2.weight, and put them in one optimizer=torch.optim.SGD([module1,module2], lr=learning_rate). I find after the loss.backward() and optimizer.step(), their weights are updated by -2*learning_rate*accumulated_gradient. I understand that the loss.backward() will accumulate gradients for these two modules. But does it make sense to update twice the weight using the accumulated gradient? That means, if in a network, we make some modules share weights, then these modules will equivalently have several times larger learning rate?

you can use the functional interface, where:

```
import torch.nn.functional as F
# in __init__
weight = nn.Parameter(torch.randn(20, 30))
bias = nn.Parameter(torch.randn(30))
# and later in forward(input1, input2):
out1 = F.linear(input1, weight, bias)
out2 = F.linear(input2, weight, bias)
```