Losses not matching when training using hand-written SGD and torch.optim.SGD

vishwakftw · May 27, 2017, 5:40am

Hi there!

I am working on implementing my own versions of SGD, but when I checked if the losses match with every iteration, they seem to not. Could anyone please help? Thanks in advance.

The code used is below:

import torch as t
from torch.autograd import Variable as V
from copy import deepcopy

x = V(t.randn(100, 4))
y = V(t.randn(100))

model_1 = t.nn.Sequential(t.nn.Linear(4, 8), t.nn.Linear(8, 4), t.nn.Linear(4, 2), t.nn.Linear(2, 1))
model_2 = deepcopy(model_1)

loss_1	= t.nn.MSELoss()
loss_2 	= deepcopy(loss_1)

opt	= t.optim.SGD(model_2.parameters(), lr=0.001)

for i in range(0, 10):
	print('Not using OPTIM: %f\tUsing OPTIM: %f' % (loss_1(model_1(x), y).data[0], loss_2(model_2(x), y).data[0]))

	loss_1.zero_grad()
	loss_1(model_1(x), y).backward()
	for param in model_1.parameters():
		param.data = param.data - 0.001*param.grad.data
	
	opt.zero_grad()
	loss_2(model_2(x), y).backward()
	opt.step()

The output that I got in one such instance is as follows:

Not using OPTIM: 1.185485	Using OPTIM: 1.185485
Not using OPTIM: 1.183839	Using OPTIM: 1.183839
Not using OPTIM: 1.180592	Using OPTIM: 1.182216
Not using OPTIM: 1.175832	Using OPTIM: 1.180614
Not using OPTIM: 1.169687	Using OPTIM: 1.179034
Not using OPTIM: 1.162323	Using OPTIM: 1.177475
Not using OPTIM: 1.153939	Using OPTIM: 1.175938
Not using OPTIM: 1.144764	Using OPTIM: 1.174421
Not using OPTIM: 1.135046	Using OPTIM: 1.172924
Not using OPTIM: 1.125052	Using OPTIM: 1.171448

smth · May 28, 2017, 4:58am

the source code of optim.SGD might be of inspiration here, and it is quite simple to understand:

github.com

pytorch/pytorch/blob/master/torch/optim/sgd.py

import torch
from .optimizer import Optimizer, required


class SGD(Optimizer):
    r"""Implements stochastic gradient descent (optionally with momentum).

    Nesterov momentum is based on the formula from
    `On the importance of initialization and momentum in deep learning`__.

    Args:
        params (iterable): iterable of parameters to optimize or dicts defining
            parameter groups
        lr (float): learning rate
        momentum (float, optional): momentum factor (default: 0)
        weight_decay (float, optional): weight decay (L2 penalty) (default: 0)
        dampening (float, optional): dampening for momentum (default: 0)
        nesterov (bool, optional): enables Nesterov momentum (default: False)

    Example:

This file has been truncated. show original

aman2304 · April 6, 2020, 8:11am

Instead of loss_1.zero_grad(), use model_1.zero_grad(). I get identical results between hand-written SGD and torch.optim.SGD once I made that change.