I have a model f parametrized by w
, input data x
and labelled output y
. I’m searching for a synthetic input and output pair (x0, y0)
, and a step-size e such that – letting w' = w - e d(loss(f(x0, w), y0))/dw
– loss(f(w', x), y)
is minimized. To find those through gradient descent, I need to compute the gradient of loss(f(w', x), y)
with respect to x0
, y0
, and e
.
When f
is a relatively simple neural network I can compute the update directly fairly easily, but I would like to use torch to compute those gradients directly. Is it possible to differentiate through the taking of a gradient like this? I’m new to Torch, and I understand this might be possible by using keep_graph = True
. However, I’ve pained a bit to make it work.
Concretely, if f
is a nn.Sequential
model piling on various layers, how would one write the update to the parameters so that I can then take a gradient with respect to x0
and e
. It seems straightforward to update the parameters of the various layers found in nn.
when using an off-the-shelf optimizer but this requires something a bit less standard. Is it still possible?
The general idea is to greedily construct a synthetic “curriculum” which can be used to train a network from scratch quickly. It might be possible to devise a curriculum on a small network and use it to quickly initialize a much larger network for instance. It helps if you think as x
and y
as the whole dataset, and x0
and y0
as a synthetic mini-batch, typically with only one sample.