SGD compatibility with other frameworks

I am trying to write a SGD that would be more similar to other frameworks.

This is what I currently have:

import torch

class CaffeSGD(torch.optim.SGD):
    def __init__(self, *args, **kwargs):
        super(CaffeSGD, self).__init__(*args, **kwargs)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            weight_decay = group['weight_decay']
            momentum = group['momentum']
            dampening = group['dampening']
            nesterov = group['nesterov']

            for p in group['params']:
                if p.grad is None:
                d_p =
                if weight_decay != 0:
                if momentum != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        buf = param_state['momentum_buffer'] = torch.zeros_like(
                        buf = param_state['momentum_buffer']
                        buf.mul_(momentum).add_(1 - dampening, d_p)
                    if nesterov:
                        d_p = d_p.add(momentum, buf)
                        d_p = buf


        return loss

It looks like it is working, but I wanted to check if this is the right approach.

I found the reason SGD is different here.

1 Like

Great. Have you try your sgd and compare with caffe? Is it same?

Yes it is the same.
And if you have lr_mult just add those parameters to a different optimization group.

1 Like

So, could you tell me what is different with SGD in pytorch and caffe?
If in pytorch, I use

optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay= 0.0005, momentum =0.9)

Then it will be in caffe. Am I right?

optimizer = CaffeSGD(model.parameters(), lr=0.01, weight_decay= 0.0005, momentum =0.9)

Caffe also has this notion of lr_mult in prototxt. If not specified it will be 1, and you do not need to do anything else. IOW, it also depends on particular prototxt.

You mean lr_mult

layer {
  name: "conv1a"
  type: "Convolution"
  bottom: "data"
  top: "conv1a"
  param {
    lr_mult: 1
  param {
    lr_mult: 2
  convolution_param {
    num_output: 64
    kernel_size: 3
    pad: 1
    stride: 1    
    bias_filler {
      type: "constant"
      value: 0

For that, you to set lr_mult by 2 and by 1 using your caffeSGD?

You should find which parameters need to be multiplied by 2 then instead of passing model.parameters() to CaffeSGD, use param_groups then you also need a custom scheduler that applies the lr_mult to different groups

This is to create 3 groups: 2 different weight decay (for bias and non-bias params), and a differen lr for the bias param of conv1a:

        decay, no_decay, lr2 = [], [], []
        initial_lr = 0.0001
        for name, param in model.named_parameters():
            if not param.requires_grad:
            if "conv1a" in name and name.endswith(".bias"):
            elif "scale" in name:
            elif len(param.shape) == 1 or name.endswith(".bias"):

        param_groups = [{'params': no_decay, 'weight_decay': 0., 'initial_lr': initial_lr, 'lr_mult': 1.},
                 {'params': decay, 'weight_decay': 0.0005, 'initial_lr': initial_lr, 'lr_mult': 1.},
                 {'params': lr2, 'weight_decay': 0., 'initial_lr': initial_lr * 2., 'lr_mult': 2.}]

        optimizer = CaffeSGD(param_groups, lr=lr, momentum=momentum, weight_decay=weight_decay)

As you can see in the above code for CaffeSGD it only uses params and weight_decay so I use lr_mult to also apply different lr in a custom scheduler,.

class CustomScheduler(MultiStepLR):
    def step(self, iterations=None):
        new_lr = self.get_lr()
        for i, param_group in enumerate(self.optimizer.param_groups):
            param_group['lr'] = param_group['lr_mult'] * new_lr
        super(CustomScheduler, self).step(iterations=iterations)
1 Like

Thanks so much @dashesy. For your condition

if "conv1a" in name and name.endswith(".bias"):

How about if I change it to all layer has name as convolution. For example, densenet. I want to add the learning strategy for all nn.Conv2d. How should I modify your condition?

According to the only difference:

Isn’t it a bit overkill to rewrite the whole SGD? Wouldn’t it be sufficient to scale the momentum term by the learning rate to get the equivalent when needed sometimes?

They are actually equivalent when the learning rate does not change and given that initialisation is simply v = 0.
The only difference is when the learning rate changes. In this case, you would need to re-initialize the moment so that

v = old_lr / new_lr * v

to get equivalent behaviour (if I don’t mistake).

I wanted to verify parity, so had to have the same SGD as caffe. Otherwise PyTorch’s SGD is as good.

Perhaps, but that also needs a new class anyways because you cannot just pass a modified lr to get that behavior.