how to can i accumulate gradient during gradient descent in pytorch (i.e. iter_size in caffe prototxt).
Currently, my code is:
for iter, (images, labels, indices) in enumerate(train_loader, 0):
optimizer.zero_grad()
outputs = net(Variable(images.cuda()))
loss = criterion(outputs, Variable(labels.cuda()))
loss.backward()
optimizer.step()
Do i do this?
for iter in range(N):
optimizer.zero_grad()
loss = 0
for i in range(M):
(images, labels, indices)=train_loader.next():
outputs = net(Variable(images.cuda()))
loss += criterion(outputs, Variable(labels.cuda()))
loss.backward()
optimizer.step()
loss = loss/M
loss += criterion(outputs, Variable(labels.cuda()))
this will build the graph again and again inside the loop, which may increase memory usage. So should I just write
loss = criterion(outputs, Variable(labels.cuda()))
This will also accumulate the gradients, right? I am confusing about which one to use, “=” or “+=”? I just want to have the effect of “iter_size” in Caffe to train large models. Thanks.
for i in range(num_iters):
optimizer.zero_grad()
batch_loss_value = 0
for m in range(M):
(images, labels, indices) = train_loader.next():
outputs = net(Variable(images.cuda()))
loss = criterion(outputs, Variable(labels.cuda()))
loss.backward()
batch_loss_value += loss.cpu().numpy()[0]
optimizer.step()
batch_loss_value = batch_loss_value/M
I am trying to build a recurrent neural network that accumulates the gradient over each sequence and performs backpropagation through time. Do you think I can achieve that with the code you posted? If you prefer I can show you what I have so far in my training.
I’m having the same issue. I use this method to increase batch_size by 100.
I plotted the training loss with 3 different settings (batch_size 200, batch_size 1 * 200 multiplier, batch_size 10 * 20 multiplier) in the following picture:
As you see, these 3 loss curves are almost equivalent. But I don’t know why the test accuracy is much worse when I use this large batch multiplier. Thanks!
I am dealing with a similar problem and you post helps. Thanks!
There is one thing I would like to point out: You probably want to divide gradient by M, if you intended to average gradients for M iterations