How to train specific layers with different optimizers

I am using an autoencoder,

   def __init__(self):
       super(autoencoder, self).__init__()
       #self.channel = channel
       self.encoder = nn.Sequential(
           nn.Conv2d(in_channels = 3, out_channels = 32, kernel_size = 3, stride = 1), # 26
           nn.ReLU(),
           nn.MaxPool2d(2, stride = 2), # 13
           nn.Conv2d(in_channels = 32, out_channels = 16, kernel_size = 3, stride = 1), # 11
           nn.ReLU(),
           nn.MaxPool2d(2, stride = 2), # 5
           nn.Conv2d(in_channels = 16, out_channels = 8, kernel_size = 3, stride = 1), # 3
           nn.ReLU(),
       ) # 8x3x3
   
       self.decoder = nn.Sequential(
           nn.ConvTranspose2d(in_channels = 8, out_channels = 16, kernel_size = 3, stride = 1), #5
           nn.ReLU(),
           nn.ConvTranspose2d(in_channels = 16, out_channels = 32, kernel_size = 3, stride = 1), #7
           nn.ReLU(),
           nn.ConvTranspose2d(in_channels = 32, out_channels = 32, kernel_size = 3, stride = 2), #15
           nn.ReLU(),
           nn.ConvTranspose2d(in_channels= 32, out_channels = 3, kernel_size = 3, stride = 2,padding=2, output_padding=1), #28
           nn.Tanh()
       )


       self.classifier = nn.Sequential(
           # nn.AdaptiveAvgPool2d((1,1)),
           nn.Flatten(),
           nn.Linear(8*3*3,8),
           nn.ReLU(),
           nn.Linear(8,10)
       )

   def forward(self, x):
       x= self.encoder(x)
       out = self.classifier(x)
       # feature = self.flatten(x)
       x = self.decoder(x)
       
       return  x,out

now i want to update the weights of the autoencoder without the classifier using adam optimizer and calculating the L2 loss between the reconstruction images and the original image.
and want to train only the classifier using adam only, but using different loss i.e, crossentropyloss. now how can I train encoder, decoder, and the classifier separately in one go?

I want to do something like this:

for epoch in range(10):
    print(f"Epoch {epoch+1}")
    
    for images in tqdm(trainloader):
        optimizer1.zero_grad()
        optimizer2.zero_grad()
        
        images,Ys = images
        images = images.cuda()
        Ys = Ys.cuda()
        x, out = model(images)
        
        loss = loss_fn1(x, images) # l2 norm
        loss1 = loss_fn2(out, Ys) # cross entropy
        total_loss+=loss
        total_loss1+=loss1
        
        # print('Loss b/w images: ',loss.item())
        # print('Loss b/w labels: ',loss1.item())
        # break
        loss.backward() # l2 norm
        loss1.backward() # cross en

        optimizer1.step()
        optimizer2.step()
    

But it is showing error, which I fixed by setting retrain_graph=True, but that was not feasible since it is taking all my GPU memory. Is there any other alternative?

Hi GSauce!

As written, I don’t believe that this does what you want. loss.backward()
accumulates gradients, so your two .backward() calls will accumulate
gradients from both of your loss functions into the parameters of both of
your optimizers.

If I understand your use case, you want something like:

        loss = loss_fn1(x, images) # l2 norm
        loss1 = loss_fn2(out, Ys) # cross entropy

        optimizer1.zero_grad()
        loss.backward (retain_graph = True)  # l2 norm
        optimizer1.step()   # only takes a step with loss gradients

        optimizer2.zero_grad()   # zeroes out gradients from loss
        loss1.backward() # cross en
        optimizer2.step()   # only takes a step with loss1 gradients

Yes, as a general rule if you call backward() more than once for the same
forward pass, you will need to use retain_graph = True for all but the
last call.

And yes, retaining the graph does keep the graph in memory.

This performance-tuning guide does mention some things you can do
to reduce memory usage.

Also, the following thread discusses a use case where some parts of
a shared computation graph could be released “early” while keep other
parts that were necessary for a subsequent .backward() call. It may
or may not be relevant to your use case, but it might be worth a look.

Best.

K. Frank

1 Like

Thank you for replying, okay understood, I tried one more thing,

l = loss + loss1
l.backward()
optim1.step()
optim2.step()
model = autoencoder().cuda()

parameters = list()
parameters.extend(model.decoder.parameters())
parameters.extend(model.encoder.parameters())

loss_fn1 = nn.MSELoss()
loss_fn2 = nn.CrossEntropyLoss()
optimizer1 = optim.Adam(parameters, lr=lr)
optimizer2 = optim.Adam(model.parameters(), lr=3e-4)

    for images in tqdm(trainloader):
        optimizer1.zero_grad()
        optimizer2.zero_grad()
        
        images,Ys = images
        images = images.cuda()
        Ys = Ys.cuda()
        x, out = model(images)
        
        loss = loss_fn1(x, images) # l2 norm
        loss1 = loss_fn2(out, Ys) # cross entropy
        
        l = loss+loss1
        
        l.backward() 
        total_loss+=l
        total_loss1+=loss1
        optimizer1.step()
        optimizer2.step()

this is roughly what I did. but I had one doubt, that will .backward() generates a single graph even though there are 2 loss functions?

Yes, there will be one backward (and one forward) graph for the tensor l.

1 Like

Also note that what you are willing to do here isn’t what’s exactly being done with this code of yours:

Here, l.backward() shall calculate gradients for the model parameters based on both loss and loss1 as opposed to what you stated in the first post.

Thank you for your response.
Exactly, that is what I was thinking, I tried doing that using retain_graph=True at expense of my GPU memory, and I wasn’t able to train due to lack of memory. which is why I tried this one out, after googling.

what I want to achieve is, the entire encoder and decoder to update their weights based on the MSELoss calculated by comparing each and every pixel of the input image and the reconstruction image.
But I want to simultaneously train the classifier based on CrossEntropyLoss, and I am using a different optimizer to train this.

NOTE:: the 2nd optimizer should not change the weights of the encoder or decoder it should only update the classifier.

So this is what I want to achieve but still couldn’t find an efficient way.

In this case, change optimizer2 as:

optimizer1 = optim.Adam(parameters, lr=lr)  # encoder and decoder parameters
optimizer2 = optim.Adam(model.classifier.parameters(), lr=3e-4)  # parameters from only the classifier layer

Still, keep in mind that l.backward() will calculate the gradients based on both loss and loss1.
The only straight forward solution I could think of is what @KFrank has coded in their post. Did you check out the tuning guide they linked for your memory issues?