Inference on large image using model parallelism?

Innat · October 20, 2021, 5:01pm

Hello.
Let’s say you have a very large image to infer and a single GPU can load the model but throws OOM for large input. In such a situation, what do you think if we split the model’s layers on multiple GPU / machines and run forward pass accordingly.

For example, an image X and model B with layer 1, layer 2, and layer 3. For a single GPU inference with this model on image X, it gives OOM. So, how about we place these 3 model layers (layer 1, layer 2, and layer 3) on 3 different GPUs. And sequentially we pass the data X on layer 1 (GPU:1) and pass the output of GPU:1 to GPU:2 and output of GPU:2 to GPU:3. Now, the Image can’t fit into one GPU itself.

ptrblck · October 20, 2021, 8:44pm

Your approach is known as model sharding and would work.
However, since only a single GPU would be used during this approach, there are more advanced techniques such as pipeline parallelism, which would try to overlap the workloads more.