Hi there,
I usually work with smaller models, but now have the following problem where I am unsure how to continue. I have a very large pre-trained model, which I want to run. The model is too large to fit on a single A100 with gradients for a single sample. Luckily I have multiple A100, so I want to split the model up on two GPUs. However, there are two difficulties:
- The model is not sequential, since It has a very long skip connection. Also the architectures is quite complex with many sublayers of very different sizes.
- I can not modify the underlying code of the model, since I only downloaded a pre-trained version, don’t have the resources to fully train it again (it was trained on roughly 5k A100-hours), and there is not even publicly available code.
I would like to know if (and how) I can still execute a forward pass on my GPUs. I am thinking about
- Pipeline-parallelism. But
- Does that work even though there are skip connections? I think not, since it says explicitly in the docs ( Pipeline Parallelism — PyTorch 2.9 documentation ) that PipelineStage does not work with skip connections.
- FSDP2
- Do I need the definition of the model to do that?
- It looks quite complicated. Maybe there is an easier way?
For clarification: I try to use this model Accurate medium-range global weather forecasting with 3D neural networks | Nature