I have a model f parametrized by
w, input data
x and labelled output
y. I’m searching for a synthetic input and output pair
(x0, y0), and a step-size e such that – letting
w' = w - e d(loss(f(x0, w), y0))/dw –
loss(f(w', x), y) is minimized. To find those through gradient descent, I need to compute the gradient of
loss(f(w', x), y) with respect to
f is a relatively simple neural network I can compute the update directly fairly easily, but I would like to use torch to compute those gradients directly. Is it possible to differentiate through the taking of a gradient like this? I’m new to Torch, and I understand this might be possible by using
keep_graph = True. However, I’ve pained a bit to make it work.
f is a
nn.Sequential model piling on various layers, how would one write the update to the parameters so that I can then take a gradient with respect to
e. It seems straightforward to update the parameters of the various layers found in
nn. when using an off-the-shelf optimizer but this requires something a bit less standard. Is it still possible?
The general idea is to greedily construct a synthetic “curriculum” which can be used to train a network from scratch quickly. It might be possible to devise a curriculum on a small network and use it to quickly initialize a much larger network for instance. It helps if you think as
y as the whole dataset, and
y0 as a synthetic mini-batch, typically with only one sample.