SGD compatibility with other frameworks

dashesy · June 20, 2018, 12:41am

I am trying to write a SGD that would be more similar to other frameworks.

This is what I currently have:

import torch


class CaffeSGD(torch.optim.SGD):
    def __init__(self, *args, **kwargs):
        super(CaffeSGD, self).__init__(*args, **kwargs)

    def step(self, closure=None):
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            weight_decay = group['weight_decay']
            momentum = group['momentum']
            dampening = group['dampening']
            nesterov = group['nesterov']

            for p in group['params']:
                if p.grad is None:
                    continue
                d_p = p.grad.data
                if weight_decay != 0:
                    d_p.add_(weight_decay, p.data)
                d_p.mul_(group['lr'])
                if momentum != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        buf = param_state['momentum_buffer'] = torch.zeros_like(p.data)
                        buf.mul_(momentum).add_(d_p)
                    else:
                        buf = param_state['momentum_buffer']
                        buf.mul_(momentum).add_(1 - dampening, d_p)
                    if nesterov:
                        d_p = d_p.add(momentum, buf)
                    else:
                        d_p = buf

                p.data.sub_(d_p)

        return loss

It looks like it is working, but I wanted to check if this is the right approach.

dashesy · June 20, 2018, 8:34pm

I found the reason SGD is different here.

John1231983 · February 2, 2019, 4:11pm

Great. Have you try your sgd and compare with caffe? Is it same?

dashesy · February 2, 2019, 7:21pm

Yes it is the same.
And if you have lr_mult just add those parameters to a different optimization group.

John1231983 · February 3, 2019, 4:13am

So, could you tell me what is different with SGD in pytorch and caffe?
If in pytorch, I use

optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay= 0.0005, momentum =0.9)

Then it will be in caffe. Am I right?

optimizer = CaffeSGD(model.parameters(), lr=0.01, weight_decay= 0.0005, momentum =0.9)

dashesy · February 3, 2019, 4:50am

Yes.
Caffe also has this notion of lr_mult in prototxt. If not specified it will be 1, and you do not need to do anything else. IOW, it also depends on particular prototxt.

John1231983 · February 3, 2019, 4:53am

You mean lr_mult

layer {
  name: "conv1a"
  type: "Convolution"
  bottom: "data"
  top: "conv1a"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  convolution_param {
    num_output: 64
    kernel_size: 3
    pad: 1
    stride: 1    
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}

For that, you to set lr_mult by 2 and by 1 using your caffeSGD?

dashesy · February 5, 2019, 7:19pm

You should find which parameters need to be multiplied by 2 then instead of passing model.parameters() to CaffeSGD, use param_groups then you also need a custom scheduler that applies the lr_mult to different groups

This is to create 3 groups: 2 different weight decay (for bias and non-bias params), and a differen lr for the bias param of conv1a:

        decay, no_decay, lr2 = [], [], []
        initial_lr = 0.0001
        for name, param in model.named_parameters():
            if not param.requires_grad:
                continue
            if "conv1a" in name and name.endswith(".bias"):
                lr2.append(param)
            elif "scale" in name:
                decay.append(param)
            elif len(param.shape) == 1 or name.endswith(".bias"):
                no_decay.append(param)
            else:
                decay.append(param)

        param_groups = [{'params': no_decay, 'weight_decay': 0., 'initial_lr': initial_lr, 'lr_mult': 1.},
                 {'params': decay, 'weight_decay': 0.0005, 'initial_lr': initial_lr, 'lr_mult': 1.},
                 {'params': lr2, 'weight_decay': 0., 'initial_lr': initial_lr * 2., 'lr_mult': 2.}]

        optimizer = CaffeSGD(param_groups, lr=lr, momentum=momentum, weight_decay=weight_decay)

As you can see in the above code for CaffeSGD it only uses params and weight_decay so I use lr_mult to also apply different lr in a custom scheduler,.

class CustomScheduler(MultiStepLR):
    def step(self, iterations=None):
        new_lr = self.get_lr()
        for i, param_group in enumerate(self.optimizer.param_groups):
            param_group['lr'] = param_group['lr_mult'] * new_lr
        super(CustomScheduler, self).step(iterations=iterations)

John1231983 · February 7, 2019, 11:07pm

Thanks so much @dashesy. For your condition

if "conv1a" in name and name.endswith(".bias"):

How about if I change it to all layer has name as convolution. For example, densenet. I want to add the learning strategy for all nn.Conv2d. How should I modify your condition?

rasbt · February 8, 2019, 12:42am

According to the only difference:

Isn’t it a bit overkill to rewrite the whole SGD? Wouldn’t it be sufficient to scale the momentum term by the learning rate to get the equivalent when needed sometimes?

mrTsjolder · February 8, 2019, 11:04am

They are actually equivalent when the learning rate does not change and given that initialisation is simply v = 0.
The only difference is when the learning rate changes. In this case, you would need to re-initialize the moment so that

v = old_lr / new_lr * v

to get equivalent behaviour (if I don’t mistake).

dashesy · February 13, 2019, 7:09pm

I wanted to verify parity, so had to have the same SGD as caffe. Otherwise PyTorch’s SGD is as good.

Perhaps, but that also needs a new class anyways because you cannot just pass a modified lr to get that behavior.