How to implement accumulated gradient in pytorch (i.e. iter_size in caffe prototxt)

how to can i accumulate gradient during gradient descent in pytorch (i.e. iter_size in caffe prototxt).
Currently, my code is:

     for iter, (images, labels, indices) in enumerate(train_loader, 0):
 
            optimizer.zero_grad() 
            outputs = net(Variable(images.cuda()))
            loss    = criterion(outputs, Variable(labels.cuda()))
            loss.backward()
            optimizer.step()

Do i do this?

     for iter in range(N):
 
            optimizer.zero_grad() 
            loss = 0

            for i in range(M):
                  (images, labels, indices)=train_loader.next():
                   outputs = net(Variable(images.cuda()))
                   loss    +=  criterion(outputs, Variable(labels.cuda()))
                   loss.backward()

            optimizer.step()
            loss = loss/M
5 Likes

yes, you are on point. not zeroing the grad will keep accumulating the gradient in

2 Likes

@Hengck @smth Hi, I have a quick question. As mentioned in here,

loss += criterion(outputs, Variable(labels.cuda()))

this will build the graph again and again inside the loop, which may increase memory usage. So should I just write

loss = criterion(outputs, Variable(labels.cuda()))

This will also accumulate the gradients, right? I am confusing about which one to use, “=” or “+=”? I just want to have the effect of “iter_size” in Caffe to train large models. Thanks.

Here is the corrected code

 for i in range(num_iters):
 
            optimizer.zero_grad() 
            batch_loss_value = 0

            for m in range(M):
                  (images, labels, indices) = train_loader.next():
                   outputs = net(Variable(images.cuda()))
                   loss    = criterion(outputs, Variable(labels.cuda()))
                   loss.backward()
                  
                  batch_loss_value += loss.cpu().numpy()[0]

            optimizer.step()
            batch_loss_value = batch_loss_value/M
8 Likes

A follow up question: how is the result via this way different from feed a batch-size to M GPUs?
Is this only a matter of speed?

Have you tried it, does it perform better than smaller batch_size?

What do you do then with batch_loss_value, call backward() on it?

Hey Heng,

I am having a similar problem.

I am trying to build a recurrent neural network that accumulates the gradient over each sequence and performs backpropagation through time. Do you think I can achieve that with the code you posted? If you prefer I can show you what I have so far in my training.

Hi, have you solved this problem?

Also be sure to either scale down the gradients before update or decrease the learning rate

Hello, riccard. Do you solve the problem please? I have a similar problem. If following the code example here, you make it, please let me know. Thanks

One issue of the iter_size is the BatchNorm. A bigger BS (for example bs=128, iter_size=1) doesn’t give the same result of bs=64 and iter_size=2.

I’m guessing that he plots it.

I’m having the same issue. I use this method to increase batch_size by 100.

I plotted the training loss with 3 different settings (batch_size 200, batch_size 1 * 200 multiplier, batch_size 10 * 20 multiplier) in the following picture:

As you see, these 3 loss curves are almost equivalent. But I don’t know why the test accuracy is much worse when I use this large batch multiplier. Thanks!

Hi Heng,

I am dealing with a similar problem and you post helps. Thanks!
There is one thing I would like to point out: You probably want to divide gradient by M, if you intended to average gradients for M iterations