Using Two Optimizers for Encoder and Decoder respectively vs using a single Optimizer for Both

I have a theoretical doubt regarding the use of a different instance of optimizers for encoder and decoder respectively vs using a single optimizer for the complete autoencoder network.

I initially thought it was the same, but then I conducted an experiment on this and found, the loss of multiple optimizers tends to remain lower comparatively.

Code for the two versions is as follows:

def get_optimizer(config, net):

    if config.OPTIMIZER == 'Adam' or config.OPTIMIZER == 'adam':
        optimizer = optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), \
                                   lr=config.BASE_LR, betas=(0.9, 0.999), 
                                   weight_decay=config.WEIGHT_DECAY)
    elif config.OPTIMIZER == 'SGD' or config.OPTIMIZER == 'sgd':
        optimizer = optim.SGD(net.parameters(), lr=config.BASE_LR, momentum=config.LEARNING_MOMENTUM, weight_decay=config.WEIGHT_DECAY)

    else:
        raise NotImplementedError
    
    return optimizer

For Multi Optimizers

def create_optimizers(config, enc, dec):
    net_encoder, net_decoder = enc, dec
    if config.OPTIMIZER == 'SGD' or config.OPTIMIZER == 'sgd':
        optimizer_encoder = optim.SGD(
            group_weight(net_encoder),
            lr=config.BASE_LR,
            momentum=config.LEARNING_MOMENTUM,
            weight_decay=config.WEIGHT_DECAY)
        optimizer_decoder = optim.SGD(
            group_weight(net_decoder),
            lr=config.BASE_LR,
            momentum=config.LEARNING_MOMENTUM,
            weight_decay=config.WEIGHT_DECAY)
    elif config.OPTIMIZER == 'Adam' or config.OPTIMIZER == 'adam':
        optimizer_encoder = optim.Adam(
            group_weight(net_encoder),
            lr=config.BASE_LR,
            betas=(0.9, 0.999),
            weight_decay=config.WEIGHT_DECAY)
        optimizer_decoder = optim.Adam(
            group_weight(net_decoder),
            lr=config.BASE_LR,
            betas=(0.9, 0.999),
            weight_decay=config.WEIGHT_DECAY)
    return (optimizer_encoder, optimizer_decoder)

The Loss curves are as follows

Can somebody explain this or is this just a coincidence.

1 Like

I Think it’s just a coincidence unless you have some bug , for example, you missed some parameter, you are wronly calling zero_grad and backward.

For example, in your code above you are only passing those parameters which are “trainable”, however gradients keep being accumulated as they won’t be affected by optim.zero_grad(). Parameters “after” those ones will be exposed to accumulated gradients. I doubt that’s intentional.

If a parameter is not trainable, will its accumulated gradient affect the training or testing phase in any manner ?

I would say it won’t affect for most of cases, unless you are working with techniques which involves gradients (like gradient alignment or so).

Can you check the ordering you are calling optimizers steps and zero grads? I think that’s probably the issue.

This is the Call

if self.config.MULTI_OPTIM:
					net.zero_grad()
					loss.backward()
					for opt in optimizer:
						clip_gradient(opt, self.config.GRADIENT_CLIP_NORM)
						opt.step()
				else:
					optimizer.zero_grad()
					loss.backward()
					clip_gradient(optimizer, self.config.GRADIENT_CLIP_NORM)
					optimizer.step()

I am training a simple autoencoder

It’s a bit strange, it should be exactly the same (statistically talking).
I would check group_weight, perhaps you can rewrite create_optimizers in function of get_optimizer to be sure you are doing the same in both cases.

Are you keeping the same validation set in both cases?

I mean, optimizer does nothing but iterating through parameters and setting new values.

    def step(self, closure=None):
        """Performs a single optimization step.

        Arguments:
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        """
        loss = None
        if closure is not None:
            with torch.enable_grad():
                loss = closure()

        for group in self.param_groups:
            weight_decay = group['weight_decay']
            momentum = group['momentum']
            dampening = group['dampening']
            nesterov = group['nesterov']

            for p in group['params']:
                if p.grad is None:
                    continue
                d_p = p.grad
                if weight_decay != 0:
                    d_p = d_p.add(p, alpha=weight_decay)
                if momentum != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        buf = param_state['momentum_buffer'] = torch.clone(d_p).detach()
                    else:
                        buf = param_state['momentum_buffer']
                        buf.mul_(momentum).add_(d_p, alpha=1 - dampening)
                    if nesterov:
                        d_p = d_p.add(buf, alpha=momentum)
                    else:
                        d_p = buf

                p.add_(d_p, alpha=-group['lr'])

        return loss

So as you can see it’s nothing but a for loop.

If the optimizer uses other paramters values to calculate the gradient of one parameter then it will affect the training process.