If there are more than two optimizers, we will have many opt.steps
Maybe it’s good to code some wrapper for optimizers, which will update different model parameters with different optimizers, as we do it in case with different learning rates and etc for different model parameters using one optimizer.
For example, for faster SkipGram word2vec training it is better to use sparse embeddings, and sparse embeddings must be updated using sparseadam or sparsesgd optimizers, other parameters by common optimizers, so there are two optimizers which will work together.
You can store them in a list. The code will still look fairly simple It will be some work, but I don’t think this is a common use case so we are not probably not implementing something for multiple optimizers in this regard, at least in near future.
class MultipleOptimizer(object):
def __init__(*op):
self.optimizers = op
def zero_grad(self):
for op in self.optimizers:
op.zero_grad()
def step(self):
for op in self.optimizers:
op.step()
opt = MultipleOptimizer(optimizer1(params1, lr=lr1),
optimizer2(params2, lr=lr2))
loss.backward()
opt.zero_grad()
opt.step()
Let’s say I want to train a deep net for classification of 10 classes.
Let’s say the model is few Convolution Layers and then Fully Connected Layers.
Then, in order to push data to the first layers I can create an exit point for, let’s say the 2 first convolution layers.
This will drive information to it.
Let’s assume we describe a net with 2 main blocks:
The main layers.
The final layer which takes output of the previous block and its output size is according to the task.
For instance, for 10 classes classification problem we can do what ever we want in the first block and the second block is a fully connected with output size of 10.
Now imaging we have have Block A1 and block B1.
The net is:
input -> A1 -> B1
Now we have A2 as well and we build a net like:
input -> A1 -> A2 -> B2
But in this case gradients doesn’t affect A1 as much as we want.
So we build:
input -> A1 -> A2 -> B2
|--> B1
Now we have exit point at B1 and B2.
In training we first do a step of the net input -> A1 -> B1.
Then a step for input -> A1 -> A2 -> B2.
I don’t mind calculating twice.
I just want to understand how to do it.
I prefer the most readable code over “Tricky” one to save some computations.
The problem I face is how to define the optimizers as optimizer is defined on a net yet both use the same net.
The first one puts all gradients from both loss1 and loss2 on A1 together. The second one separates them so each optimizer can operate on gradients from different “source”.