How Adam optimizer works while using Pipeline Parallelism?

Aerithy · October 31, 2024, 9:25am

The Adam Optimizer employs the gradient after a complete batch to compute the L2 and L1 norms. However, it is unclear how Adam can be used in pipeline parallelism, such as in Gpipe or 1F1B. The pipeline parallelism utilises simple SGD to upgrade each weight partition owned by each node, which requires the gradient corresponding to their owned weight.