What is the purpose/advantage of having two optimizers for a single loss function?

I’m looking at a seq2seq model as described in this sample project: https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation.ipynb

To train this network, two SGD optimizers are being initialized (one for decode, one for encode). In the train function (c.f. prompt 16), the loss function is calculated at the bottom of the full stack (i.e. after encode + decode) and then the optimizer is being steped for both.

Is this simply because there’s no way to initialize an optimizer with a union of different modules, or is there something more going on here that I’m not aware of?

Namely: if the encoder and decoder networks were in a ModuleList() property (i.e. both were inside a single model object), would it be sufficient/equivalent to have a single optimizer?


I did that to make it clear what is happening in the model (two separate networks being optimized) - it also makes it easy to use different optimizers per network. No technical reason though.

I am applying seq2seq models to generate conversation responses, and I am noticing that the NLL criterion is lesser when optimizers are separate for encoder and decoder. Would you suggest any ways to debug this scenario?