How to implement accumulated gradient in pytorch (i.e. iter_size in caffe prototxt)

(Heng Cher Keng) #1

how to can i accumulate gradient during gradient descent in pytorch (i.e. iter_size in caffe prototxt).
Currently, my code is:

     for iter, (images, labels, indices) in enumerate(train_loader, 0):
 
            optimizer.zero_grad() 
            outputs = net(Variable(images.cuda()))
            loss    = criterion(outputs, Variable(labels.cuda()))
            loss.backward()
            optimizer.step()

Do i do this?

     for iter in range(N):
 
            optimizer.zero_grad() 
            loss = 0

            for i in range(M):
                  (images, labels, indices)=train_loader.next():
                   outputs = net(Variable(images.cuda()))
                   loss    +=  criterion(outputs, Variable(labels.cuda()))
                   loss.backward()

            optimizer.step()
            loss = loss/M
5 Likes
How to implement accumulated gradient?
Multiple forward before backward call
Multiple forward before backward, where backward depends on all forward calls
Calculating loss for entire batch using NLLLoss in 0.4.0
Non scalar backward and self mini batch implementation
#2

yes, you are on point. not zeroing the grad will keep accumulating the gradient in

2 Likes
#3

@Hengck @smth Hi, I have a quick question. As mentioned in here,

loss += criterion(outputs, Variable(labels.cuda()))

this will build the graph again and again inside the loop, which may increase memory usage. So should I just write

loss = criterion(outputs, Variable(labels.cuda()))

This will also accumulate the gradients, right? I am confusing about which one to use, “=” or “+=”? I just want to have the effect of “iter_size” in Caffe to train large models. Thanks.

(Heng Cher Keng) #4

Here is the corrected code

 for i in range(num_iters):
 
            optimizer.zero_grad() 
            batch_loss_value = 0

            for m in range(M):
                  (images, labels, indices) = train_loader.next():
                   outputs = net(Variable(images.cuda()))
                   loss    = criterion(outputs, Variable(labels.cuda()))
                   loss.backward()
                  
                  batch_loss_value += loss.cpu().numpy()[0]

            optimizer.step()
            batch_loss_value = batch_loss_value/M
6 Likes
How to accumulate gradients and update altogether?
Non scalar backward and self mini batch implementation
(Zijun Wei) #5

A follow up question: how is the result via this way different from feed a batch-size to M GPUs?
Is this only a matter of speed?

(Will) #6

Have you tried it, does it perform better than smaller batch_size?

(Ken Fehling) #7

What do you do then with batch_loss_value, call backward() on it?

#8

Hey Heng,

I am having a similar problem.

I am trying to build a recurrent neural network that accumulates the gradient over each sequence and performs backpropagation through time. Do you think I can achieve that with the code you posted? If you prefer I can show you what I have so far in my training.

(Mata Fu) #9

Hi, have you solved this problem?

(Saransh Karira) #10

Also be sure to either scale down the gradients before update or decrease the learning rate

#11

Hello, riccard. Do you solve the problem please? I have a similar problem. If following the code example here, you make it, please let me know. Thanks