Losses not matching when training using hand-written SGD and torch.optim.SGD

Hi there!

I am working on implementing my own versions of SGD, but when I checked if the losses match with every iteration, they seem to not. Could anyone please help? Thanks in advance.

The code used is below:

import torch as t
from torch.autograd import Variable as V
from copy import deepcopy

x = V(t.randn(100, 4))
y = V(t.randn(100))

model_1 = t.nn.Sequential(t.nn.Linear(4, 8), t.nn.Linear(8, 4), t.nn.Linear(4, 2), t.nn.Linear(2, 1))
model_2 = deepcopy(model_1)

loss_1	= t.nn.MSELoss()
loss_2 	= deepcopy(loss_1)

opt	= t.optim.SGD(model_2.parameters(), lr=0.001)

for i in range(0, 10):
	print('Not using OPTIM: %f\tUsing OPTIM: %f' % (loss_1(model_1(x), y).data[0], loss_2(model_2(x), y).data[0]))

	loss_1(model_1(x), y).backward()
	for param in model_1.parameters():
		param.data = param.data - 0.001*param.grad.data
	loss_2(model_2(x), y).backward()

The output that I got in one such instance is as follows:

Not using OPTIM: 1.185485	Using OPTIM: 1.185485
Not using OPTIM: 1.183839	Using OPTIM: 1.183839
Not using OPTIM: 1.180592	Using OPTIM: 1.182216
Not using OPTIM: 1.175832	Using OPTIM: 1.180614
Not using OPTIM: 1.169687	Using OPTIM: 1.179034
Not using OPTIM: 1.162323	Using OPTIM: 1.177475
Not using OPTIM: 1.153939	Using OPTIM: 1.175938
Not using OPTIM: 1.144764	Using OPTIM: 1.174421
Not using OPTIM: 1.135046	Using OPTIM: 1.172924
Not using OPTIM: 1.125052	Using OPTIM: 1.171448

the source code of optim.SGD might be of inspiration here, and it is quite simple to understand:

Instead of loss_1.zero_grad(), use model_1.zero_grad(). I get identical results between hand-written SGD and torch.optim.SGD once I made that change.