How can one modify a model during training without affecting the computation tree that does backprop?

Brando_Miranda · November 30, 2017, 4:32pm

As an running example, I want to run SGD until T iterations and then after that I want to continue training but perhaps apply some operation Op(mdl) on the mdl first and then continue training.How do I make sure Op is not registered as an operation to take the derivative?

Would something as follow work:

y=3*x
y.backward()
y = Op(y) # do NOT include in computation graph
y.zeroes() #hopefully this is the line that does the trick
y=4*x
y.backward()

the real pseudo-code:

train_for_T_iterations(mdl)
Op(mdl)
mdl.zero_grads() #hopefully this is the line that does the trick
train_for_T_iterations(mdl)

for something trivial like Op(z) = z + val, if val was a function of the model weights (say gaussian noise sampled from the norm of the current weights), I was concerned that it would take the derivative of val instead of just considering val to be a constant. If its a constant I guess the derivative is zero and we wouldn’t have to worry, but how can I make sure it actually works as I want it to work? i.e. not include some Op in the computation graph where derivatives are being computed.

simopal6 · November 30, 2017, 4:47pm

Do you mean something like this?

while i, (input, label) in enumerate(data_loader):
    # Train
    input = Variable(input)
    label = Variable(label)
    output = mdl(input)
    loss = criterion(output, label)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    # Every once in a while, modify model
    if i % 100 == 0:
        mdl.some_layer.weight.data = some_function(mdl.some_layer.weight.data, output.data)

I guess, acting directly on the parameter tensors does not affect the computation graph (which would not be affected anyway, as the computation has been performed and at the next iteration you would create a new Variable).

simopal6 · November 30, 2017, 4:49pm

Or maybe you want to modify the model parameters between two backpropagations for the same input sample?

Brando_Miranda · November 30, 2017, 5:28pm

I am using a loss designed by hand so there is no .step() they are just operations written in pytorch. Not sure how that changes things.

According to my pseudocode it seems we had a similar solution? My worry was that I was not sure what mdl.zero_grad() did exactly (besides putting W.grad to zero). Does it also make sure that Op(mdl) does not get included in the computation graph?

simopal6 · November 30, 2017, 5:32pm

step() should be called on the optimizer, not on the loss. It’s fine if you defined the loss yourself, as long as it’s the outcome of operations between Variables, you can call backward() on it.

Can you specify what does not satisfy your needs in the code I wrote? The thing is I’m not really sure what you intend to do…

Brando_Miranda · November 30, 2017, 7:00pm

I don’t have an optimizer. Everything is coded manually using pytorch, perhaps even using a custom optimizer.

To answer your questions directly, I honestly don’t know if your suggestion works the way I want or not (hence my question), it looks essentially the same to the pseudocode that I suggested. My issue is when i % 100 == 0 is true, how do you know that those operations done by some_function won’t be take into account on the next iteration when loss.backward() is called. The reason I suspect it will affect the backward pass is because its modifying the model (or at least it should be in the context of the cases I find problematic). If I am wrong then your example might be enough! Do you think it solves that issue? i.e. that it does not include the operation as part of the computation graph?

Lets take the simplest example I can think of. We train a mdl for T iterations. After that we perturb the model with some noise that is dependent on the norm of the weights/params. Thus, the noise would be a function of W. However, I don’t actually want the framework/engine to take a derivative wrt to the added noise, I just want to train with some algorithm, then perturb it and then train again for another T. In pseudocode:

train_for_T_iterations(mdl)
mdl = mdl + GaussNoise(0,0.25*mdl.W.norm(2)) # defintion of Op(mdl)
mdl.zero_grads() #hopefully this is the line that avoids that Op(mdl) is included in the backward pass 
train_for_T_iterations(mdl)

does it make sense in the context of this example at least?

fabiocapsouza · November 30, 2017, 8:13pm

As far as I understand, what Simone said is correct: the computational graph only considers operations between Variables and you are creating a new graph every iteration, since when you instantiate new Variables at the beginning of the training iteration, so it should be fine. Also, if you operate directly on the weight’s tensors (weight.data), it should not be included in the graph.

Taking your last example, I think you should change it to use W.data property if you want to be sure. Also, I don’t think mdl.zero_grads() is doing anything where it is, because you haven’t called backward() yet to calculate the gradients anywhere. However, you should make sure train_for_T_iterations function does call zero_grads() before calling backward(), but I believe you are already doing it, otherwise you would be always considering the past iterations’ gradients.

Brando_Miranda · November 30, 2017, 8:23pm

Actually, do you mind spelling out one detail that is particularly confusing to me? When we loop and do SGD, when exactly are new computation graphs created? I think thats the part that is confusing to me and even made me post the following question:

however it seems that the answer to is is just “acting on mdl.W.data does not add things to the computation graph”. Then when is the computational graph formed?

Brando_Miranda · November 30, 2017, 8:25pm

I think thats the part I don’t get.

Actually, to make it clear this is the code I plan to run:

for i in range(nb_iter):
    # Forward pass: compute predicted Y using operations on Variables
    batch_xs, batch_ys = get_batch2(data.X_train,data.Y_train,M,dtype) # [M, D], [M, 1]
    ## FORWARD PASS
    y_pred = mdl.forward(batch_xs)
    ## LOSS
        batch_loss = (1/M)*(y_pred - batch_ys).pow(2).sum()
    ## BACKARD PASS
    batch_loss.backward() # Use autograd to compute the backward pass. Now w will have gradients
    ## SGD update
    for W in mdl.parameters():
        delta = eta*W.grad.data
        W.data.copy_(W.data - delta) # W - eta*g + A*gdl_eps
    ## Manually zero the gradients after updating weights
    mdl.zero_grad()
    ## DO OP that should NOT be part of the computation graph
    if i % 100 == 0:
        Op(mdl)

in the above code when exactly is a new computation graph created?

fabiocapsouza · November 30, 2017, 11:59pm

Am I right to assume you are getting your data wrapped in Variables when you call get_batch2?

As far as I understand, a new computational graph is constructed when you instantiate new Variables (leaf variables) and do operations on them. Everytime you reuse the variable names, you get new objects and, hence, a new computational graph will be constructed to record the operations you are executing.

Take a look at the gif about autograd in this page, I think it’s helpful: http://pytorch.org/about/

Brando_Miranda · December 3, 2017, 11:40pm

in the end think the answer is just to operate on the Tensors instead of the Variables because only ops that act on the variables are added to the computation graph.