How do you average the related variables holding the gradients? loss / len(minibatch)
?
Yes. Divide it by the number of iterations in the for loop.
Is there a performance penalty to running backward() multiple times vs. just using a bigger batch (in situations where its an option)?
Yes, it’s always going to be slower, but it’s a tradeoff between performance and memory usage. Try to do as few iterations as you can (you can split each batch into smaller sub-batches, so that they nearly fill up the memory).
But it doesn’t affect the performance (accuracy) of the model using the accumulated gradient method right?
@apaszke, @albanD, I also tried to achieve this. As you have said, doing backward() for each sample is slow compared to accumulating the loss, doing one average and then doing backward. Here is my code
num_epoch = 10
real_batchsize = 100 # I want to update weight every `real_batchsize`
for epoch in range(num_epoch):
total_loss = Variable(torch.zeros(1).cuda(), requires_grad=True)
for batch_idx, (data, target) in enumerate(train_loader):
data, target = Variable(data.cuda()), Variable(target.cuda())
output = net(data)
total_loss = total_loss + loss
if batch_idx % real_batchsize == 0:
ave_loss = total_loss/real_batchsize
ave_loss.backward()
optimizer.step()
total_loss.data.zero_()
optimizer.zero_grad()
The above code will produce an error:
RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.
I have looked up this issue, but I am not very clear. Just feels that we need a new total_loss
after each weight update, so I replace the line
total_loss.data.zero_()
by
total_loss = Variable(torch.zeros(1).cuda(), requires_grad=True)
Now it seems to work. But I am 100% sure if I have done it right. Can you give any advice on how to do it properly?
I think a simpler way to do this would be:
num_epoch = 10
real_batchsize = 100 # I want to update weight every `real_batchsize`
for epoch in range(num_epoch):
total_loss = 0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = Variable(data.cuda()), Variable(target.cuda())
output = net(data)
loss = crit(output, target)
total_loss = total_loss + loss
if batch_idx % real_batchsize == 0:
ave_loss = total_loss/real_batchsize
optimizer.zero_grad()
ave_loss.backward()
optimizer.step()
total_loss = 0
Can we directly add an int value with torch Variable? Will the average loss and backward gradient calculation be on GPU?
yes you can add a python number to a Variable
(or a Tensor
) and the output is going to be a Variable
(or a Tensor
) of the same type as the input (so if it was on gpu, the output will be on gpu).
Just a note (which you may be aware of):
Doing this is not the same as @albanD 's first answer.
When you call ave_loss.backward() you propagate errors with respect to your (correct) loss, but these errors are functions of what the activations are when the .backward method was called. Since you’ve thrown away all but the last 10 samples, you are making the assumption that the first 90 samples were the same as the last 10.
Regarding this tradeoff, do you save time/memory by using retain_graph=True
in that situation ?
For example, my current code looks like this:
x = tensor(x0,requires_grad=True)
loss = 0
for i in range(inputs.numel()): # For my apps, it's between 5 and 50.
rec = f(x,i)
loss += loss_func(inputs[i], rec)
loss.backward()
g = x.grad
My current problem is that the computational graph takes too much memory because the function f
does a lot of computation. So a solution would be to do as @albanD suggested:
x = tensor(x0, requires_grad=True)
loss = 0
for i in range(inputs.numel()):
rec = f(x,i)
loss += loss_func(inputs[i], rec)
loss.backward()
g = x.grad
But I feel like the computationnal graph for each iteration of that loop is the same, it’s just the numbers on which we apply it that change. So maybe we could reuse the previous iteration’s graph (by specifying retain_graph=True
), could that save some time ? If not, what would happen (in terms of time/memory loss/gain) ?
x = tensor(x0, requires_grad=True, retain_graph=True)
loss = 0
for i in range(inputs.numel()):
rec = f(x,i)
loss += loss_func(inputs[i], rec)
loss.backward()
g = x.grad
Hi,
This is the expected use case: the graph structure is mostly the same, only the values change. The whole framework is built to make this use case efficient.
And you cannot reuse the graph as the graph is associated with the values of each Tensor and so if the values change, you need to recreate it (which is cheap).
Hi @albanD,
I have a question about the params
in optimizer.state_dict()
and weights in model.state_dict()
From the source code, I found that the grad is computed during backward()
and the weights are updated during the optimizer.step()
.
I try to output the model.state_dict
and optimizer.state_dict
before and after backward()
and optimizer.step()
respectively.
If I save the state_dict by
state_dict = model.state_dict()[key]
And they are the same, before and after backward()
, it means that assignment operation of tensors is shared memory operation?
Another question is what does params
in optimizer.state_dict
mean? There is not any change before and after backward
and optimizer.step
, does it mean an address to the weight?
Thanks in advance
.backward()
will just populate the .grad
fields of the parameters. These gradients are not saved in the state dicts and so nothing will change there.
after opt.step()
the values of the parameters will be changed inplace. So if you want to see the difference before and after, you need to clone the original Tensor.
optimizer.state_dict() is dependant on the optimizer itself. It will contain whatever is needed for this optimizer to continue working as if it was not stopped (saving things like momentum terms or statistics).
Oh I see, I guess I was confused by the name retain_graph
. I’ve searched a bit and see that it was called retain_variables
before. So I guess if I use retain_graph=True
while putting loss.backward()
inside the for loop, it defeats the purpose of saving memory because it will keep in memory the temporary tensors needed for the previous gradient, right ?
Yes it will retain them until you actually destroy the computational graph. This will increase the peak memory usage during the backward.
sorry for reply to so old post, i encountered a problem recently:
in Pytorch distribution code, how can i keep the gradient graph (autograd.grad) while using dist.all_reduce() or dist.all_gather()? to avoid the situation that I need to manually calculate the gradient, then backward.
Hi,
You most likely want to open a new topic for this and add the distributed tag.
I don’t know if these constructs are differentiable or not tbh.
@albanD Hi,
I got the same error, and I view some examples. However, I still don’t know how to solve my problem.
Here is code,
for iter, input in enumerate(train_loader):
template = input['template'] #read input
search = input['search']
label_cls = input['out_label']
reg_label = input['reg_label']
reg_weight = input['reg_weight']
cfg_cnn = [(2, 16, 2, 0, 3),
(16, 32, 2, 0, 3),
(32, 64, 2, 0, 3),
(64, 128, 1, 1, 3),
(128, 256, 1, 1, 3)]
cfg_kernel = [127, 63, 31, 31, 31]
cfg_kernel_first = [63,31,15,15,15]
c1_m = c1_s = torch.zeros(1, cfg_cnn[0][1], cfg_kernel[0], cfg_kernel[0]).to(device)
c2_m = c2_s = torch.zeros(1, cfg_cnn[1][1], cfg_kernel[1], cfg_kernel[1]).to(device)
c3_m = c3_s = torch.zeros(1, cfg_cnn[2][1], cfg_kernel[2], cfg_kernel[2]).to(device)
trans_snn = [c1_m, c1_s, c2_m, c2_s, c3_m, c3_s] # use this list
for i in range(search.shape[-1]):
cls_loss_ori, cls_loss_align, reg_loss, trans_snn = model(template.squeeze(-1), \
search[:,:,:,:,i], trans_snn,\
label_cls[:,:,:,i], \
reg_target=reg_label[:,:,:,:,i], reg_weight=reg_weight[:,:,:,i])
.......
loss = cls_loss_ori + cls_loss_align + reg_loss
optimizer.zero_grad()
loss.backward()
I think the reason why this code is error is that in the loop, I keep updating the value of the variable trans_snn. However, I have no idea about how to solve it by renaming trans_snn. Looking for your help. Thank you very much!
if I remove trans_snn = [c1_m, c1_s, c2_m, c2_s, c3_m, c3_s]
into the loop,
the error will not happen. However, I need the updated trans_snn .