As an running example, I want to run SGD until T iterations and then after that I want to continue training but perhaps apply some operation Op(mdl) on the mdl first and then continue training.How do I make sure Op is not registered as an operation to take the derivative?
Would something as follow work:
y = Op(y) # do NOT include in computation graph
y.zeroes() #hopefully this is the line that does the trick
the real pseudo-code:
mdl.zero_grads() #hopefully this is the line that does the trick
for something trivial like Op(z) = z + val, if val was a function of the model weights (say gaussian noise sampled from the norm of the current weights), I was concerned that it would take the derivative of val instead of just considering val to be a constant. If its a constant I guess the derivative is zero and we wouldn’t have to worry, but how can I make sure it actually works as I want it to work? i.e. not include some Op in the computation graph where derivatives are being computed.
while i, (input, label) in enumerate(data_loader):
input = Variable(input)
label = Variable(label)
output = mdl(input)
loss = criterion(output, label)
# Every once in a while, modify model
if i % 100 == 0:
mdl.some_layer.weight.data = some_function(mdl.some_layer.weight.data, output.data)
I guess, acting directly on the parameter tensors does not affect the computation graph (which would not be affected anyway, as the computation has been performed and at the next iteration you would create a new Variable).
I am using a loss designed by hand so there is no .step() they are just operations written in pytorch. Not sure how that changes things.
According to my pseudocode it seems we had a similar solution? My worry was that I was not sure what mdl.zero_grad() did exactly (besides putting W.grad to zero). Does it also make sure that Op(mdl) does not get included in the computation graph?
I don’t have an optimizer. Everything is coded manually using pytorch, perhaps even using a custom optimizer.
To answer your questions directly, I honestly don’t know if your suggestion works the way I want or not (hence my question), it looks essentially the same to the pseudocode that I suggested. My issue is when i % 100 == 0 is true, how do you know that those operations done by some_function won’t be take into account on the next iteration when loss.backward() is called. The reason I suspect it will affect the backward pass is because its modifying the model (or at least it should be in the context of the cases I find problematic). If I am wrong then your example might be enough! Do you think it solves that issue? i.e. that it does not include the operation as part of the computation graph?
Lets take the simplest example I can think of. We train a mdl for T iterations. After that we perturb the model with some noise that is dependent on the norm of the weights/params. Thus, the noise would be a function of W. However, I don’t actually want the framework/engine to take a derivative wrt to the added noise, I just want to train with some algorithm, then perturb it and then train again for another T. In pseudocode:
mdl = mdl + GaussNoise(0,0.25*mdl.W.norm(2)) # defintion of Op(mdl)
mdl.zero_grads() #hopefully this is the line that avoids that Op(mdl) is included in the backward pass
does it make sense in the context of this example at least?
As far as I understand, what Simone said is correct: the computational graph only considers operations between Variables and you are creating a new graph every iteration, since when you instantiate new Variables at the beginning of the training iteration, so it should be fine. Also, if you operate directly on the weight’s tensors (weight.data), it should not be included in the graph.
Taking your last example, I think you should change it to use W.data property if you want to be sure. Also, I don’t think mdl.zero_grads() is doing anything where it is, because you haven’t called backward() yet to calculate the gradients anywhere. However, you should make sure train_for_T_iterations function does call zero_grads() before calling backward(), but I believe you are already doing it, otherwise you would be always considering the past iterations’ gradients.
Actually, do you mind spelling out one detail that is particularly confusing to me? When we loop and do SGD, when exactly are new computation graphs created? I think thats the part that is confusing to me and even made me post the following question:
however it seems that the answer to is is just “acting on mdl.W.data does not add things to the computation graph”. Then when is the computational graph formed?
Actually, to make it clear this is the code I plan to run:
for i in range(nb_iter):
# Forward pass: compute predicted Y using operations on Variables
batch_xs, batch_ys = get_batch2(data.X_train,data.Y_train,M,dtype) # [M, D], [M, 1]
## FORWARD PASS
y_pred = mdl.forward(batch_xs)
batch_loss = (1/M)*(y_pred - batch_ys).pow(2).sum()
## BACKARD PASS
batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
## SGD update
for W in mdl.parameters():
delta = eta*W.grad.data
W.data.copy_(W.data - delta) # W - eta*g + A*gdl_eps
## Manually zero the gradients after updating weights
## DO OP that should NOT be part of the computation graph
if i % 100 == 0:
in the above code when exactly is a new computation graph created?
Am I right to assume you are getting your data wrapped in Variables when you call get_batch2?
As far as I understand, a new computational graph is constructed when you instantiate new Variables (leaf variables) and do operations on them. Everytime you reuse the variable names, you get new objects and, hence, a new computational graph will be constructed to record the operations you are executing.