cont_loss_weight = torch.autograd.grad(outputs= cont_loss, inputs= weight, retain_graph=(True))
is getting executed but it neither show any error nor it returns something because its following line: print ("Shape:", cont_loss_weight .shape) doesn’t get printed.
This line would print at least “Shape:” irrelevant of the result from the previous line. So these lines just never get run.
Also autograd.grad always returns a tuple, even if you have a single Tensor to inputs.
As mentioned before, I don’t think the custom Function approach is the simplest here. Especially if you’re not already familiar with the mathematical definition and the specific constructs.
Sorry, I am not getting. Why doesn’t it reach to the line print ("Shape:", cont_loss_weight .shape) and print the shape? I don’t know why autograd running the same loop again and again ( I suspect something similar to this thread) because the same loss getting printed as you can see the output I was getting
What the output I expected when I print ("Shape:", cont_loss_weight[0].shape) is because it should return a tuple of size 1 which has to be basically tensor of shape: ([128,96,5,5]) :
Loss: tensor([37.218], device='cuda:0', grad_fn=<AddBackward0>) True
Shape: ([128,96,5,5])
Loss: tensor([different loss value for next batch], device='cuda:0', grad_fn=<AddBackward0>) True
Shape: ([128,96,5,5])
Loss: tensor([different loss value for next to next batch], device='cuda:0', grad_fn=<AddBackward0>) True
.
.
But I don’t think this is the case here. Because you see
Static forward: Batch comes as input, output from convolution returned
Static backward:
We have saved tensors of weight, bias, output. cont_loss is calculated from output of forward. and within static backward only the backpropagation (cont_loss w.r.t weight,bias) happens. And for one batch static backward for this layer will be called once only (Hence, within backward backpropagation also happens once for one batch.)
Right?
Isn’t the cont_loss that you compute during the backward computed based on the forward’s output?
If so, when you try to get gradients wrt to the weights, it will backrprop through this custom Function again trying to get gradients for the weights given gradients from the output. And so you will infinite recurse right?
I had that doubt, that’s why I tried to to this forward like in this code.
I am getting correct shapes of cont_loss_weight and cont_loss_bias, but I doubt if that’s a correct way!! What are your thoughts?
This code is a bit confusing to me…
But again, it seems to just be calling into the autograd. Why not just the autograd end to end?? I really think you should drop the custom Function and compute that you need using the autograd!
But the thing is I have to include gradient of cont_loss somehow into to backward i.e. grad_weight += cont_loss_weight. I am not sure what you mean by this.
out, F = model(inp)
# here F is whatever information you need from the model to compute the contrastive loss
loss = criterion(out, label)
cont_loss = compute_cont_loss(out, F)
model.zero_grad()
loss.backward(retain_graph=True) # Use retain graph because you do two backward
cont_loss.backward(inputs=model.special_layer.parameters())
opt.step()
If you don’t have latest nightly. You can use autograd.grad() like you do your code and manually add these gradients to the params by doing something like
cont_params = list(model.special_layer.parameters())
grads = autograd.grad(cont_loss, inputs=cont_params)
for p, g in zip(cont_params, grads):
p.grad += g
Hi, based on what I understood I am doing this, to make sure that the gradients of conv1-3, 5 get populated and updated according to cross entropy loss and that of conv4 accoring to both CE and cont_loss.
But I saw
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.
Here’s the code.
optimizer = torch.optim.Adam(**parameters of conv 1-3, 5**, lr)
optimizer_ = torch.optim.Adam(**parameters of conv 4**, lr)
# Training
.
.
loss = ct(output, labels)
Cont_loss = ct2(conv4_out, labels)
loss.backward(retain_graph=True)
cont_params = list(conv4.parameters())
grads = autograd.grad(Cont_loss, inputs= cont_params)
for p, g in zip(cont_params, grads):
p.grad += g
optimizer_.step()
optimizer.step()
optimizer_.zero_grad()
optimizer.zero_grad()
As you can see that for cont_loss calculation, I am giving output of 4th layer as input. And for calculating this output, obviously the parameters of 4th layers are used.
So, ultimately the parameters are used in computing the loss.
But for this use case, bigger batch sizes yield good performance as per earlier research work. So, what should I do to make sure that GPU doesn’t go out of memory when bigger batch sizes are used?
From the name of the variables, it does look like you do. But the error is still there. Are you sure you don’t have extra Parameters on that Module that are not actually used during the forward?
So, what should I do to make sure that GPU doesn’t go out of memory when bigger batch sizes are used?
You can check on this forum for solutions.
You can do things like splitting the batch and doing several forward/backward before doing the optimizer step.
Or you can use tools like torch.utils.checkpoint to try and reduce the memory usage at the cost of more computations.