I associate the weights of two modules as module1.weight = module2.weight, and put them in one optimizer=torch.optim.SGD([module1,module2], lr=learning_rate). I find after the loss.backward() and optimizer.step(), their weights are updated by -2learning_rateaccumulated_gradient. I understand that the loss.backward() will accumulate gradients for these two modules. But does it make sense to update twice the weight using the accumulated gradient? That means, if in a network, we make some modules share weights, then these modules will equivalently have several times larger learning rate?
you can use the functional interface, where:
import torch.nn.functional as F # in __init__ weight = nn.Parameter(torch.randn(20, 30)) bias = nn.Parameter(torch.randn(30)) # and later in forward(input1, input2): out1 = F.linear(input1, weight, bias) out2 = F.linear(input2, weight, bias)