I am trying to write a SGD that would be more similar to other frameworks.
This is what I currently have:
import torch
class CaffeSGD(torch.optim.SGD):
def __init__(self, *args, **kwargs):
super(CaffeSGD, self).__init__(*args, **kwargs)
def step(self, closure=None):
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
weight_decay = group['weight_decay']
momentum = group['momentum']
dampening = group['dampening']
nesterov = group['nesterov']
for p in group['params']:
if p.grad is None:
continue
d_p = p.grad.data
if weight_decay != 0:
d_p.add_(weight_decay, p.data)
d_p.mul_(group['lr'])
if momentum != 0:
param_state = self.state[p]
if 'momentum_buffer' not in param_state:
buf = param_state['momentum_buffer'] = torch.zeros_like(p.data)
buf.mul_(momentum).add_(d_p)
else:
buf = param_state['momentum_buffer']
buf.mul_(momentum).add_(1 - dampening, d_p)
if nesterov:
d_p = d_p.add(momentum, buf)
else:
d_p = buf
p.data.sub_(d_p)
return loss
It looks like it is working, but I wanted to check if this is the right approach.
Yes.
Caffe also has this notion of lr_mult in prototxt. If not specified it will be 1, and you do not need to do anything else. IOW, it also depends on particular prototxt.
You should find which parameters need to be multiplied by 2 then instead of passing model.parameters() to CaffeSGD, use param_groups then you also need a custom scheduler that applies the lr_mult to different groups
This is to create 3 groups: 2 different weight decay (for bias and non-bias params), and a differen lr for the bias param of conv1a:
decay, no_decay, lr2 = [], [], []
initial_lr = 0.0001
for name, param in model.named_parameters():
if not param.requires_grad:
continue
if "conv1a" in name and name.endswith(".bias"):
lr2.append(param)
elif "scale" in name:
decay.append(param)
elif len(param.shape) == 1 or name.endswith(".bias"):
no_decay.append(param)
else:
decay.append(param)
param_groups = [{'params': no_decay, 'weight_decay': 0., 'initial_lr': initial_lr, 'lr_mult': 1.},
{'params': decay, 'weight_decay': 0.0005, 'initial_lr': initial_lr, 'lr_mult': 1.},
{'params': lr2, 'weight_decay': 0., 'initial_lr': initial_lr * 2., 'lr_mult': 2.}]
optimizer = CaffeSGD(param_groups, lr=lr, momentum=momentum, weight_decay=weight_decay)
As you can see in the above code for CaffeSGD it only uses params and weight_decay so I use lr_mult to also apply different lr in a custom scheduler,.
class CustomScheduler(MultiStepLR):
def step(self, iterations=None):
new_lr = self.get_lr()
for i, param_group in enumerate(self.optimizer.param_groups):
param_group['lr'] = param_group['lr_mult'] * new_lr
super(CustomScheduler, self).step(iterations=iterations)
How about if I change it to all layer has name as convolution. For example, densenet. I want to add the learning strategy for all nn.Conv2d. How should I modify your condition?
Isn’t it a bit overkill to rewrite the whole SGD? Wouldn’t it be sufficient to scale the momentum term by the learning rate to get the equivalent when needed sometimes?
They are actually equivalent when the learning rate does not change and given that initialisation is simply v = 0.
The only difference is when the learning rate changes. In this case, you would need to re-initialize the moment so that