Call `step()` and `zero_grad()` for each layer?

Normally, optimizer.step() and optimizer.zero_grad() are called after the backward pass for the entire model is done. This means that all the optimizer state will be in memory at the same time.

My question is, if we have a sequential stack of modules, is it possible to have a separate optimizer for each module and call step() and zero_grad(set_to_none=True) immediately after the backward pass is done for each of them?

Seems like this should decrease memory usage since all the gradients and optimizer statistics won’t need to be held in memory at once.

In other words, why not perform the weight update and free the memory taken by gradients as soon as possible?

You can take a look at this tutorial.