Multi-GPU for pre-trained model

JosAr · December 16, 2025, 9:02am

Hi there,

I usually work with smaller models, but now have the following problem where I am unsure how to continue. I have a very large pre-trained model, which I want to run. The model is too large to fit on a single A100 with gradients for a single sample. Luckily I have multiple A100, so I want to split the model up on two GPUs. However, there are two difficulties:

The model is not sequential, since It has a very long skip connection. Also the architectures is quite complex with many sublayers of very different sizes.
I can not modify the underlying code of the model, since I only downloaded a pre-trained version, don’t have the resources to fully train it again (it was trained on roughly 5k A100-hours), and there is not even publicly available code.

I would like to know if (and how) I can still execute a forward pass on my GPUs. I am thinking about

Pipeline-parallelism. But
1. Does that work even though there are skip connections? I think not, since it says explicitly in the docs ( Pipeline Parallelism — PyTorch 2.9 documentation ) that PipelineStage does not work with skip connections.
FSDP2
1. Do I need the definition of the model to do that?
2. It looks quite complicated. Maybe there is an easier way?

For clarification: I try to use this model Accurate medium-range global weather forecasting with 3D neural networks | Nature