Two optimizers for one model

agadetsky · December 12, 2017, 11:46pm

Is there a way to use two optimizers for one model in more beautiful way?
Now, as I understand we should do something like these:

net = Model()
part_1_parameters = ...
part_2_parameters = ...
opt1 = optimizer_1(part_1_parameters)
opt2 = optimizer_2(part_1_parameters)

### train epoch
opt1.zero_grad()
opt2.zero_grad()

loss = ...
loss.backward()

opt1.step()
opt2.step()

SimonW · December 13, 2017, 1:09am

What is not beautiful about this?

agadetsky · December 13, 2017, 1:27am

If there are more than two optimizers, we will have many opt.steps
Maybe it’s good to code some wrapper for optimizers, which will update different model parameters with different optimizers, as we do it in case with different learning rates and etc for different model parameters using one optimizer.

Royi · December 13, 2017, 4:51am

What would you do for training a network with 2 exit points?

agadetsky · December 13, 2017, 5:04am

For example, for faster SkipGram word2vec training it is better to use sparse embeddings, and sparse embeddings must be updated using sparseadam or sparsesgd optimizers, other parameters by common optimizers, so there are two optimizers which will work together.

SimonW · December 13, 2017, 5:48am

You can store them in a list. The code will still look fairly simple It will be some work, but I don’t think this is a common use case so we are not probably not implementing something for multiple optimizers in this regard, at least in near future.

matthew_zeng · December 13, 2017, 6:13am

Use the basic knowledge of software engineering.


class MultipleOptimizer(object):
    def __init__(*op):
        self.optimizers = op

    def zero_grad(self):
        for op in self.optimizers:
            op.zero_grad()

    def step(self):
        for op in self.optimizers:
            op.step()


opt = MultipleOptimizer(optimizer1(params1, lr=lr1), 
                        optimizer2(params2, lr=lr2))

loss.backward()
opt.zero_grad()
opt.step()

Royi · December 25, 2017, 6:37pm

How would you implement 2 exit points?

SimonW · December 25, 2017, 6:49pm

What do you mean by exit point?

Royi · December 25, 2017, 6:53pm

Let’s say I want to train a deep net for classification of 10 classes.
Let’s say the model is few Convolution Layers and then Fully Connected Layers.

Then, in order to push data to the first layers I can create an exit point for, let’s say the 2 first convolution layers.
This will drive information to it.

Thank You.

SimonW · December 25, 2017, 7:07pm

Sorry, I still can’t understand. Could you explain what an exit point is in context of deep learning?

Royi · December 25, 2017, 7:18pm

I will try again.

Let’s assume we describe a net with 2 main blocks:

The main layers.
The final layer which takes output of the previous block and its output size is according to the task.

For instance, for 10 classes classification problem we can do what ever we want in the first block and the second block is a fully connected with output size of 10.

Now imaging we have have Block A1 and block B1.
The net is:

input -> A1 -> B1

Now we have A2 as well and we build a net like:

input -> A1 -> A2 -> B2

But in this case gradients doesn’t affect A1 as much as we want.

So we build:

input -> A1 -> A2 -> B2
             |--> B1

Now we have exit point at B1 and B2.
In training we first do a step of the net input -> A1 -> B1.
Then a step for input -> A1 -> A2 -> B2.

How can we do that in PyTorch?

SimonW · December 25, 2017, 7:29pm

Do you mean like this?

def forward(self, input):
  t = self.A1(input)
  res1 = self.B1(input)
  res2 = self.B2(self.A2(input))
  return res1, res2

Then in train script

res1, res2 = net(input)
loss1 = criterion(res1, target)
loss2 = criterion(res2, target)
loss = loss1 + loss2
loss.backward()

Royi · December 25, 2017, 7:43pm

First, you taught me that forward can output 2 output items which I wasn’t aware.

My intuition, based on what you showed me, would be:

res1, res2 = net(input)
loss1 = criterion(res1, target)
loss2 = criterion(res2, target)
loss1.backward()
loss2.backward()
hNetOptimizer1.step()
hNetOptimizer2.step()

Namely, I would like to have 2 optimizer.
One updates the A1 block according to output of B1 and another updated A1, A2 according to B2.

Does it make any sense?

SimonW · December 25, 2017, 7:55pm

Oh I see that you want to use two optimizers for two paths. The simplest way is to activate twice, and backward+step after each activate.

It’s kinda tricky if you don’t wanft to calculate A1(input) twice. You would need to do something like:

optim1.zero_grad()
temp = A1(input)
temp_d = temp.detach()
temp_d.requires_grad = True
res1 = B1(temp_d)
loss1 = critertion(res1, target)
temp.backward(autograd.grad(loss1, temp_d, only_input=False)[0], retain_graph=True)
optim1.step()

optim2.zero_grad()
res2 = B2(A2(temp))
loss2 = critertion(res2, target)
loss2.backward()
optim2.step()

Royi · December 25, 2017, 8:01pm

I don’t mind calculating twice.
I just want to understand how to do it.
I prefer the most readable code over “Tricky” one to save some computations.

The problem I face is how to define the optimizers as optimizer is defined on a net yet both use the same net.

@SimonW, By the way, could you explain the difference between what you posted at first and the second?

Royi · December 28, 2017, 9:11pm

Hi @SimonW.
Any chance you address my last post?

Thank You.

SimonW · January 2, 2018, 3:54pm

The first one puts all gradients from both loss1 and loss2 on A1 together. The second one separates them so each optimizer can operate on gradients from different “source”.

MrPositron · January 13, 2019, 2:06am

don’t forget to put self inside of __init__(), like__init__(self, *op)

falmasri · August 19, 2019, 10:16pm

opt.zero_grad() should be after backward