Hey,
I’m trying to reproduce something along the lines of the recent ICLR18 paper “Lifelong Learning with Dynamically Expandable Networks”: https://arxiv.org/abs/1708.01547 where the idea is to add units in-place to a trained network while further training.
I have a decent amount of what I think is functional code. However, there is two big questions I have left:
1.) When I resize the weights of a convolutional layer to let’s say add “x” filter, I will have to apply resize_() twice (note that I left out the bias for simplicity which also needs to be resized):
a) in my current layer i’s output dimensionality: layer[i].weight.data.resize_(layer[i].weight.data.size(0) + x, layer[i].weight.data.size(1), layer[i].weight.data.size(2), layer[i].weight.data.size(3))
b) in the consequent layer (i+1)'s input dimensionality:
layer[i+1].weight.data.resize_(layer[i+1].weight.data.size(0), layer[i+1].weight.data.size(1) + x, layer[i+1].weight.data.size(2), layer[i+1].weight.data.size(3))
Followed by some initialization of whatever has just been added.
Now executing a) doesn’t seem be difficult and I believe it can be done in-place as is. Given the underlying storage (If I understand correctly flattened 1-d C-array) this resize_() operation will simply append to the array. Thus the previous information should be preserved and we can just initialize newly added units by slicing.
Doing operation b) however seems to be difficult as the arrangement of information in the tensor isn’t preserved.
Currently I am dealing with this issue by doing an entire clone() of the weight, and then copying the corresponding slice back. Needless to say this is pretty bad in terms of memory and if necessary to copy back to CPU also in terms of time. Is there any other more efficient way of doing such a thing?
2.) The operations done in 1. don’t automatically seem to update shapes of gradient and pass information to autograd.
I thus do the same resizing operations in respective dimensions of layer[i].weight.grad.data
to make sure the backward will match the forward pass.
Now this alone doesn’t seem to be enough during training and it seems I also need to create a new instance of my optimizer every time I change some dimensionality in the graph in order to update its parameters.
The above code is therefore always followed by a new instance of optimizer = torch.optim.SGD(...)
. I guess I could also resize the parameters here instead if it matters.
Empirically this gradient resizing and new optimizer instance seem to work, but I would like to know if anyone could tell me whether there is any pitfalls I am not thinking of? Is this implementation correct? Alternatively, is there any other way of doing this?
I think these questions are somewhat in-depth and it is hard for me to further figure out whether such an implementation is correct beyond empirical observations. So I would really appreciate any feedback!
PS: If anyone is interest I implemented the resizing in a forward_pre_hook. But I don’t think it affects the questions.