How are gradient updates made with DCGs? I’m interested in the nitty-gritty implementation details under Pytorch’s hood and for a mathematical explanation. I’m also really hoping that they’ll explain that in detail in the Pytorch white paper (hopefully to be released this month?).
The fact that we use a DCG does not change at all how the gradients updates are made.
In pytorch the classic way to handle gradient updates is to use the optimisers in the
Even the way the gradients are computed does not change compared to a system with static graph: we use the chain rule for every operation (also called backpropagation).
The only difference with static graph systems is that instead of building a graph of what computations should be performed, then perform them, the graph is built while the computations are performed. But the computations and backward passes represented by both these graphs are exactly the same and will give you exactly the same gradients.
So even if the shape of the graph dramatically changes like changing the shape of arbitrary hidden layers, the derivatives are computed the same and the sudden change in dimensions doesn’t suddenly disrupt convergence? Interesting.
The think is that is you change the number of hidden units in your network, you need to change the layer before and after so that they handle this new size. And thus their weight matrices will have to change. For the rest of the graph, nothing changes and all derivatives will be computed the same.
That being said, when changing hidden layers, you change your optimization objective. So if you change it too often, it may be complex for your optimization to converge properly as the point where it tries to converge to moves every time you change the structure of your model.
This is really really interesting. So is this problem similar to the credit assignment problem with BP/BPTT, or just completely different? Are Tree-LSTMs a special case of this? Could you name a concrete example of when you would want to change the shape of the hidden layers besides when using neuroevolution or neural architecture search (although if I remember correctly, they just completely retrain from scratch each time they grow the network)? Is this problem also somehow related to transfer learning, like would changing the hidden layer repeatedly be kinda like doing transfer learning repeatedly during training? Also if the hidden layers change, is the new hidden layer initialized with the parameters from the hidden layer from the previous iteration and the additional weights - if the hidden layer grew - randomly initialized and if the hidden layer shrunk - some of the parameters removed, or is the new hidden layer randomly initialized?
I think you’re overthinking this really. from the point of view of pytorch, pytorch is procedural, rather than declarative. So, every step, every single step, you have to first run each operation. Like:
h1 = input @ W1 + b1 h2 = h1 @ W2 + b2 out = h2 @ W3 + b3
When you do this, you can freely choose what you’re going to calculate
- the value of W1, b1, input etc is whatever they are at the time each line is executed. They’re just normal torch tensors more or less, with a couple of extra properties tagged on, but basically you can do what you like. If you switch one weight variable for another, the contents and dimensions of that variable are whatever you have assigned them to be previosuly.
Whenever you choose, you can take some gradOutput, and pass it to the output of your network:
It’ll then propagate backwards, and increment the gradients for your various W variables, and all the variables essentially, more or less.
Then, you’ll ask the optimizer to do something with these gradients (I think), but the optimizer will basically conceptually do something like add
W.grad * learningRate to each
Changing tack a bit, there is a ton of original research, and research opportunities, around growing networks and so on. However, pytorch itself is not opinionated on this subject: it gives you the ability to calculate stuff, and run gradients through that stuff, but it wont magically grow your network, or transfer learning from one set of weights to another, and so on. You’ll need to write code to handle this. Or maybe write a library, so other people can re-use the same code.
Another question, so say if I wanted to collect the gradients of a DCG and want to keep a single list of tensors to reflect all parameters in the network as it trains, how would I go about that? Should I make a new weight matrix and bias vector each time a layer changes shape or would I extend the matrix and bias at the existing layer to accommodate the new dimensions? Similarly if the layer shrinks, should I create a new matrix or only modify a subset of those parameters in my running list of tensors? Thanks.