Difference between 2 methods for accumulate gradients

Kim_Anthony · November 26, 2018, 7:00am

Is there any difference with these 2 methods for accumulate gradients??

accumulate with averaged loss

accum_loss = 0

for _ in range(10):
	out = model(x)
	loss = get_loss(out, y)
	accum_loss += loss
	
optimizer.zero_grad()
accum_loss /= 10
accum_loss.backward(retain_graph=True)
optimizer.step()

accumulate with autograd’s bakward function.

optimizer.zero_grad()
for _ in range(10):
	out = model(x)
	loss = get_loss(out, y)
	loss.backward()

optimizer.step()

help me please…

InnovArul · November 26, 2018, 8:22am

These are the differences I could see between the 2 methods of accumulating gradients:

Method1 uses more memory as it keeps the computation graphs of all 10 iterations in memory.
Where as, Method2 is memory efficient
In Method1, you are dividing gradient by 10 (i.e., averaging for 10 iterations). In Method2, there is no averaging taking place.

Kim_Anthony · November 26, 2018, 8:29am

Thank you very much!