How optimizer recognize the parameters while setting different learning rates for different layers

There’re so many ways to set different learning rates for different layers.
By changing the "params" in the optimizer,we could get what we want.
I would like to know how this whole thing work.
By this tutorial:
How to use an optimizer
It seems that we pass the tensor and specify the learning rate for these parameters,so I write some code for testing:

    large_lr_layers = nn.Sequential(*list(model.children())[:-4]).parameters()  
    small_lr_layers = nn.Sequential(*list(model.children())[-4:]).parameters()  

    optimizer = optim.SGD([{"params": large_lr_layers,"lr":0.1},
                           {"params": torch.randn((3,64,64))}],lr =*1000000000000000.,momentum=0.85)

The last line is that I create a tensor randomly and start training ,and the whole process is not ruined by it,How does Pytorch tell the randomly created tensor should be ignored?Thanks


Because the optimizer is model-agnostic. It creates groups, one group per dictionary you pass. I don’t know the details but it simply update parameters which are passed to that dictionary with the gradients those tensors contain. You can pass several model’s parameters to a single optimizer and it will work. You can pass random tensors as it grads for those tensors are None.

Yeah, I know the rule you talked about,it would update the parameters in dictionary by it’s corresponding learning late.But I still wonder how exactly it tells which parameters you have gradient and which return None,I guess it has something to do with grad_fn? @albanD would you please give USA details information? Thanks

You don’t “tell” itself. I assume optimizer simply iterates over those parameters. Any model parameter will have the attribute .grad (and any tensor). So you can always agnostically iterate over passed modules and tensors and upgrading with the corresponding gradient.

Just look at source code

            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            weight_decay = group['weight_decay']
            momentum = group['momentum']
            dampening = group['dampening']
            nesterov = group['nesterov']

            for p in group['params']:
                if p.grad is None:
                d_p =
                if weight_decay != 0:
                if momentum != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        buf = param_state['momentum_buffer'] = torch.clone(d_p).detach()
                        buf = param_state['momentum_buffer']
                        buf.mul_(momentum).add_(1 - dampening, d_p)
                    if nesterov:
                        d_p = d_p.add(momentum, buf)
                        d_p = buf

      ['lr'], d_p)

        return loss
1 Like

This depends on the optimizer, but most of them use the logic described above by @JuanFMontesinos.
Iterate over the groups and the params in each groups and check that param.grad is not None.

@albanD @JuanFMontesinos Yeah,I should’ve looked at the source code!
One more question to ask:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.bn1  = nn.BatchNorm2d(32)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.bn2  = nn.BatchNorm2d(64)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.bn1(x)
        x = self.conv2(x)
        x = self.bn2(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

for idx,param in enumerate(model.parameters()):
    block = idx//3
    lr =**block)
optimizer = optim.SGD(layer_collector,momentum=0.85,weight_decay=1e-5)

The code is above is tje model and how I set the optimizer.
And I try to do this :

for group_param in optimizer.param_groups:
    for param in group_param["params"]:
        for each_set in param:

The result gives something like :

tensor([-0.0048,  0.0155, -0.0403, -0.0363,  0.0766,  0.0209, -0.0341,  0.0499,
         0.0590,  0.0267, -0.0719, -0.0199, -0.0585, -0.0632,  0.0872,  0.0512,
        -0.0528, -0.0671, -0.0278, -0.0009,  0.0570,  0.0187,  0.0411, -0.0460,
        -0.0667, -0.0144, -0.0228, -0.0181, -0.0388,  0.0235,  0.0766, -0.0179,
         0.0664,  0.0092, -0.0698, -0.0850,  0.0578,  0.0419, -0.0034,  0.0127,
         0.0263, -0.0060, -0.0120, -0.0356, -0.0429,  0.0169,  0.0010, -0.0227,
        -0.0736, -0.0817,  0.0567,  0.0305,  0.0839, -0.0474, -0.0297,  0.0020,
        -0.0025,  0.0800, -0.0758, -0.0859,  0.0695,  0.0839, -0.0218,  0.0392,
        -0.0795, -0.0634,  0.0108, -0.0785, -0.0103,  0.0450, -0.0252, -0.0361,
        -0.0161,  0.0106,  0.0832, -0.0706, -0.0487, -0.0190,  0.0364, -0.0272,
        -0.0750,  0.0834,  0.0687, -0.0866, -0.0550, -0.0435, -0.0807, -0.0501,
         0.0329,  0.0694,  0.0330,  0.0613, -0.0204, -0.0302,  0.0559,  0.0098,
        -0.0497, -0.0153, -0.0212,  0.0377,  0.0154,  0.0512, -0.0469, -0.0496,
         0.0218, -0.0187,  0.0294, -0.0480,  0.0234,  0.0040, -0.0665, -0.0831,
        -0.0799, -0.0845,  0.0759,  0.0807,  0.0492,  0.0455, -0.0201, -0.0427,
         0.0037, -0.0168,  0.0463,  0.0782,  0.0066,  0.0759, -0.0852,  0.0401],

What does grad_fn mean here? As another try is:

for group_param in optimizer.param_groups:
        for param in group_param["params"]:

It just return None,why do this happen?Thanks!

You did an extra for loop. param is already a single parameter from the list stored in group_param["params"].
When you do with a tensors:

for subset in tensor:
  # Stuff

It is the same as:

for idx in range(tensor.size(0)):
    subset =, idx)
    # Stuff

And since your Tensor requires gradient here, you get the result of the select operation.

For the parameters, the grad_fn will always be None and their .grad field will contain a Tensor that contains their gradient after you call .backward().

For the parameters, the grad_fn will always be None

Is this is a reason that I get None for the last lines of code?

And another interesting thing I found out :

 large_lr_layers = nn.Sequential(*list(model.children())[:-4]).parameters()
 small_lr_layers = nn.Sequential(*list(model.children())[-4:]).parameters()
 for param in large_lr_layers:
 optimizer = optim.Adadelta([{"params": large_lr_layers,"lr":0.01},
                             {"params": small_lr_layers}],lr =
 for param in large_lr_layers:

After setting the optimizer ,the second loop literally prints nothing? Why does this happen?

Is this is a reason that I get None for the last lines of code?

Yes. All the leaf Tensors will have None as a grad_fn. This is expected.

Why does this happen?

Because .parameter() returns a python iterator. So you can only iterate through it once. If you want to be able to go through it multiple times, you can use the iterator to create a list to be able to go through the list later one: large_lr_layers = list(large_lr_layers).

This got me a bit confused.

for group_param in optimizer.param_groups:
    for param in group_param["params"]:
        for each_set in param:

The first iteration of group_param["params"] returns the tensor A with size (32,1,3,3) which doesn’t have grad_fn and when I iterate through tensor A ,each one of them with size (1,3,3) has grad_fn.
But doesn’t this tensor A is just 32 sets of 3*3 filters for convolution layer?Why tensor A is not leaf Tensor and A[idx] where idx is 0-31 is a leaf tensor? What’s a big picture of this ? Thanks

It is the opposite, A is a leaf (A.is_leaf==True) and A[idx] is not (A[idx].is_leaf==False).

The thing is that A is the Tensor that contains all the parameters for which the gradients need to be computed.
If you do A[idx], then you access a subset of A in a diffentiable manner. So you get a new Tensor, that looks at a subset of A. It is not a leaf anymore as it was generated by a differentiable function (select in this case).

1 Like

OKay,I really appreciate your explanation,It helps me a lot !