Imagine I have a neural network `f(x; w)`

, where `x`

is my input and `w`

is its weights vector.

Weights vector `w`

itself depends on some parameter `u`

, i.e. `w = g(u)`

, where `g(.)`

is some function. I want to do optimization for `u`

. What is the most convenient way to do it?

To provide some example, let `w = sin(u) / ||u||`

(here `||u||`

is the norm of `u`

). And our model architecture `f(x; w)`

is ResNet-18. How can we optimize for `u`

in this case? As far as I understand, if I just make model to take additional parameter `u`

during initialization, compute `w`

and set layers parameters to `w`

this will not work. The problem here is that each layer uses `nn.Parameter()`

under the hood and `nn.Parameter()`

ignores the history of computation (we have computed `w = torch.sin(u) / torch.norm(u)`

and set layers parameters to `w`

) and gradients for `u`

will not be computed. Currently, I see the following two solutions:

- Before each iteration, compute
`w`

from`u`

and update each layer parameters. Then, when gradient with respect to`w`

is computed, update`u`

manually (if the gradient`dw/du`

is not that difficult to compute manually). - Rewrite ResNet-18 completely from scratch in such a way, that it takes computed
`w`

as input and does not use nn.Parameter() and calls everything via torch.nn.functional. In this way the gradient with respect to`u`

will be computed automatically (so, it’s like a tensorflow before 2.0 version).

Both of these solutions are quite tedious. Are there any better alternatives?