According to my understanding, one/multiple CPUs could serve as Parameter Server storing model parameters. In forward step, each GPU could mount a batch of data and part of model parameters (let’s say first 10 layers’), do the calculation. Then, abandon the first 10 layers’ and mount 11-20 layers’ and do calculation again. We repeat this until get the final output of forward(). The benefit is model size is no longer limited by GPU memory. I guess typical distributed training framework is implemented this way, right?
If my understanding is incorrect, what’s the best way to train a model with parameter size exceeds GPU memory?