How to implement accumulated gradient in pytorch (i.e. iter_size in caffe prototxt)

Hengck · April 30, 2017, 1:41pm

how to can i accumulate gradient during gradient descent in pytorch (i.e. iter_size in caffe prototxt).
Currently, my code is:

     for iter, (images, labels, indices) in enumerate(train_loader, 0):
 
            optimizer.zero_grad() 
            outputs = net(Variable(images.cuda()))
            loss    = criterion(outputs, Variable(labels.cuda()))
            loss.backward()
            optimizer.step()

Do i do this?

     for iter in range(N):
 
            optimizer.zero_grad() 
            loss = 0

            for i in range(M):
                  (images, labels, indices)=train_loader.next():
                   outputs = net(Variable(images.cuda()))
                   loss    +=  criterion(outputs, Variable(labels.cuda()))
                   loss.backward()

            optimizer.step()
            loss = loss/M

smth · April 30, 2017, 2:54pm

yes, you are on point. not zeroing the grad will keep accumulating the gradient in

zhuyi490 · June 2, 2017, 11:01pm

@Hengck @smth Hi, I have a quick question. As mentioned in here,

loss += criterion(outputs, Variable(labels.cuda()))

this will build the graph again and again inside the loop, which may increase memory usage. So should I just write

loss = criterion(outputs, Variable(labels.cuda()))

This will also accumulate the gradients, right? I am confusing about which one to use, “=” or “+=”? I just want to have the effect of “iter_size” in Caffe to train large models. Thanks.

Hengck · September 15, 2017, 3:49pm

Here is the corrected code

 for i in range(num_iters):
 
            optimizer.zero_grad() 
            batch_loss_value = 0

            for m in range(M):
                  (images, labels, indices) = train_loader.next():
                   outputs = net(Variable(images.cuda()))
                   loss    = criterion(outputs, Variable(labels.cuda()))
                   loss.backward()
                  
                  batch_loss_value += loss.cpu().numpy()[0]

            optimizer.step()
            batch_loss_value = batch_loss_value/M

Zijun_Wei · September 19, 2017, 4:48pm

A follow up question: how is the result via this way different from feed a batch-size to M GPUs?
Is this only a matter of speed?

will_soon · October 21, 2017, 2:12pm

Have you tried it, does it perform better than smaller batch_size?

kenfehling · March 2, 2018, 9:23am

What do you do then with batch_loss_value, call backward() on it?

riccardosamperna · August 20, 2018, 4:45pm

Hey Heng,

I am having a similar problem.

I am trying to build a recurrent neural network that accumulates the gradient over each sequence and performs backpropagation through time. Do you think I can achieve that with the code you posted? If you prefer I can show you what I have so far in my training.

Mata_Fu · September 17, 2018, 4:33pm

Hi, have you solved this problem?

saransh_karira · October 26, 2018, 10:33am

Also be sure to either scale down the gradients before update or decrease the learning rate

Ray_Wong · May 15, 2019, 5:06am

Hello, riccard. Do you solve the problem please? I have a similar problem. If following the code example here, you make it, please let me know. Thanks

singleroc · June 4, 2019, 8:03am

One issue of the iter_size is the BatchNorm. A bigger BS (for example bs=128, iter_size=1) doesn’t give the same result of bs=64 and iter_size=2.

peter_wills · July 24, 2019, 6:27pm

I’m guessing that he plots it.

DTCancri · September 18, 2019, 10:30pm

I’m having the same issue. I use this method to increase batch_size by 100.

I plotted the training loss with 3 different settings (batch_size 200, batch_size 1 * 200 multiplier, batch_size 10 * 20 multiplier) in the following picture:

As you see, these 3 loss curves are almost equivalent. But I don’t know why the test accuracy is much worse when I use this large batch multiplier. Thanks!

MaxT123 · May 11, 2020, 5:18pm

Hi Heng,

I am dealing with a similar problem and you post helps. Thanks!
There is one thing I would like to point out: You probably want to divide gradient by M, if you intended to average gradients for M iterations