Autograd isn't functioning when networks's parameters are taken from other networks

I have 3 networks with same architecture - A, B & C
The weights of C are set as convex combination of weights of A & B as shown below

for a_param, b_param, c_param in zip(a.parameters(), b.parameters(),
                                               c.parameters()):
    c_param.data = weight * a_param.data + (1 - weight) * b_param.data

On doing forward pass through C and using backward as shown below, gradients for A & B aren’t being calculated.

y = c.forward(x)
optimiser.zero_grad()
y.backward()
optimiser.step()

Can anyone tell which step is blocking the gradient flow?

EDIT - After changing the code according to @SimonW 's answer

for a_param, b_param, c_param in zip(a.parameters(), b.parameters(),
                                               c.parameters()):
    c_param.data = weight * a_param + (1 - weight) * b_param

Not that if data of c isn’t updated, then it doesn’t reflect the changes in actual pass, hence it is retained
This still doesn’t work. What’s causing the block now?

Using .data means that you don’t want autograd though. Read first half of https://pytorch.org/blog/pytorch-0_4_0-migration-guide/

1 Like

Is there some function that allows backprop when accessing the parameters directly?
One workaround is to access the protected members _parameters and _modules directly

For example

c._modules['linear_layers'][0]._parameters['weight'] = weight * a._modules['linear_layers'][0]._parameters['weight'] + (1-weight) * b._modules['linear_layers'][0]._parameters['weight']

Do I understand correctly what you want to do:
You have 3 nets A, B and C, each with it’s own set of parameters.
The parameters of C are given as a combination of the ones from A and B.
Which are the paramters you actually want to learn? The ones from A and B (the ones from C should not be learnt as they are given by A and B)? What do you give to your optimizer?

I want to learn the parameters of A & B.
I am trying to implement a Phase Functioned Neural Net as described in Appendix B of this paper.
I am passing parameters of A and B to the optimiser.
I don’t care about parameters of C since I use it as a temporary network (probably don’t need the C network)
C shouldn’t be a network (or at least shouldn’t change in place) since then 2 forward passes would be a problem.

I am looking for a simple way to make a forward pass using linear combination of parameters of A & B, and then a backward pass to update parameters of both these networks

Does the following work ?

for a_param, b_param, c_param in zip(a.parameters(), b.parameters(),
                                               c.parameters()):
    # Don't use data of A and B params for the gradient to flow back to them
    # Use .copy_ here to change the value of the params of C, don't use .data to keep gradient flow
    c_param.copy_(weight * a_param + (1 - weight) * b_param)
1 Like

I get the following error

RuntimeError: a leaf Variable that requires grad has been used in an in-place operation.

I should probably initialise parameters of C with requires grad = False

In this case it would help if nn.Linear had a requires_grad option in its interface
It could be passed to Parameter() call while making weights and biases.

@albanD @SimonW. Does this seem useful ? Can we implement it? Will it cause any problems to the existing code-base?

@albanD Your solution with in-place copy worked correctly after setting requires_grad=False for Linear layers of C

Yes, you need to make networks C parameters non-leaf variable by setting their requires_grad field to False.
Glad it works !

I don’t think adding a requires_grad argument to the nn.Module creation would be useful as this is not the intended use for nn.Module. They are supposed to be learnable components only.

It would work just like Parameter has a requires_grad True but still provides an option to change it. This way same network class(architecture) can be reused as required in the above scenario.

Or is there any other way to create non learnable modules without having to rewrite their forward passes.

In nn.Modules, non learnable parameters are registered as buffers. And this done automatically any time you do self.foo = my_tensor. Any learnable parameter should be an instance of a nn.Parameter (that is basically a Tensor that always require grad). So yes you can have non-learnable parameters easily. The thing is that the nn.Linear() layer is a learnable thing. If you go through the process of setting it’s weights by hand, you can also set the requires_grad to False in that process without any extra complexity.

But it still does allow setting require grad to False. (link)
Do you know why this is provided then?

This seems right, it’s negligible overhead, but still setting it during intialisation wouldn’t hurt

Wouldn’t this be useful for hypernetworks, networks that generate weights for other networks?

I don’t think there is a strong reason here. Historically, it used to have the exact same API as Tensors. But now the Tensor is much more complex so that does not apply anymore.

You could use the .apply() method on network C just after creating it with a function that sets all parameters to not require grad. Or loop through it’s parameters and set it by hand.

But then why not allow for an option to set it while the parameters get initialised. I don’t think it would cause a problem for the regular use as trainable parameters. So is there any reason why we shouldn’t have it? I am supporting it since it makes defining non-trainable nets cleaner.

I guess we can have @smth 's input on this?

Perhaps, this looks to me like a use case for using nn.functional APIs. I mean, you can write C network fully interms of functional APIs. A bit more effort though. But you could get away from spending effort to juggle with requires_grad stuffs.

P. S. I hope I understood the question correctly. Pardon me if I’m way too off from the original problem:)

In your case, c_param is not really a parameter, but an intermediate result. You shouldn’t really be using the Parameter anyways. Using the functional interface is perferred. At least, you can assign them as attributes to the correct modules in c.

Exactly. Wouldn’t I have to write another class for it that uses the functional API.

Check my code right now, using requires_grad.
I am able to reuse the same class LinearNetwork for both basis networks (A & B) and also for main network ( C ).

Suppose instead of LinearNetwork, I wanted to generate weights for this PFNN itself using another network, then I would need to rewrite the whole class using functional API, wouldn’t I?

I feel requires_grad prevents this redundancy.

Yeah but to do a forward pass easily, I could set it as parameters of the net and call it’s forward method. Either this or I have to re-write the network using functional API.

I was wrong in using a network for C.
@SimonW’s point is correct. Making more than 1 forward pass wont be possible if I use in place copy and that limits the usability of this net.
Sorry for all this trouble

No worries. Doing these sort of stuff is tricky!