Two optimizers for one model

What is not beautiful about this?


If there are more than two optimizers, we will have many opt.steps :smiley:
Maybe it’s good to code some wrapper for optimizers, which will update different model parameters with different optimizers, as we do it in case with different learning rates and etc for different model parameters using one optimizer.

What would you do for training a network with 2 exit points?

For example, for faster SkipGram word2vec training it is better to use sparse embeddings, and sparse embeddings must be updated using sparseadam or sparsesgd optimizers, other parameters by common optimizers, so there are two optimizers which will work together.

You can store them in a list. The code will still look fairly simple It will be some work, but I don’t think this is a common use case so we are not probably not implementing something for multiple optimizers in this regard, at least in near future.

Use the basic knowledge of software engineering.

class MultipleOptimizer(object):
    def __init__(*op):
        self.optimizers = op

    def zero_grad(self):
        for op in self.optimizers:

    def step(self):
        for op in self.optimizers:

opt = MultipleOptimizer(optimizer1(params1, lr=lr1), 
                        optimizer2(params2, lr=lr2))


How would you implement 2 exit points?

1 Like

What do you mean by exit point?

Let’s say I want to train a deep net for classification of 10 classes.
Let’s say the model is few Convolution Layers and then Fully Connected Layers.

Then, in order to push data to the first layers I can create an exit point for, let’s say the 2 first convolution layers.
This will drive information to it.

Thank You.

Sorry, I still can’t understand. Could you explain what an exit point is in context of deep learning?

I will try again.

Let’s assume we describe a net with 2 main blocks:

  1. The main layers.
  2. The final layer which takes output of the previous block and its output size is according to the task.

For instance, for 10 classes classification problem we can do what ever we want in the first block and the second block is a fully connected with output size of 10.

Now imaging we have have Block A1 and block B1.
The net is:

input -> A1 -> B1

Now we have A2 as well and we build a net like:

input -> A1 -> A2 -> B2

But in this case gradients doesn’t affect A1 as much as we want.

So we build:

input -> A1 -> A2 -> B2
             |--> B1

Now we have exit point at B1 and B2.
In training we first do a step of the net input -> A1 -> B1.
Then a step for input -> A1 -> A2 -> B2.

How can we do that in PyTorch?

1 Like

Do you mean like this?

def forward(self, input):
  t = self.A1(input)
  res1 = self.B1(input)
  res2 = self.B2(self.A2(input))
  return res1, res2

Then in train script

res1, res2 = net(input)
loss1 = criterion(res1, target)
loss2 = criterion(res2, target)
loss = loss1 + loss2
1 Like

First, you taught me that forward can output 2 output items which I wasn’t aware.

My intuition, based on what you showed me, would be:

res1, res2 = net(input)
loss1 = criterion(res1, target)
loss2 = criterion(res2, target)

Namely, I would like to have 2 optimizer.
One updates the A1 block according to output of B1 and another updated A1, A2 according to B2.

Does it make any sense?

1 Like

Oh I see that you want to use two optimizers for two paths. The simplest way is to activate twice, and backward+step after each activate.

It’s kinda tricky if you don’t wanft to calculate A1(input) twice. You would need to do something like:

temp = A1(input)
temp_d = temp.detach()
temp_d.requires_grad = True
res1 = B1(temp_d)
loss1 = critertion(res1, target)
temp.backward(autograd.grad(loss1, temp_d, only_input=False)[0], retain_graph=True)

res2 = B2(A2(temp))
loss2 = critertion(res2, target)

I don’t mind calculating twice.
I just want to understand how to do it.
I prefer the most readable code over “Tricky” one to save some computations.

The problem I face is how to define the optimizers as optimizer is defined on a net yet both use the same net.

@SimonW, By the way, could you explain the difference between what you posted at first and the second?

Hi @SimonW.
Any chance you address my last post?

Thank You.

The first one puts all gradients from both loss1 and loss2 on A1 together. The second one separates them so each optimizer can operate on gradients from different “source”.

don’t forget to put self inside of __init__(), like__init__(self, *op)

opt.zero_grad() should be after backward

Also, you won’t be able to pass this into a scheduler if you are using one.