Hello.
Let’s say you have a very large image to infer and a single GPU can load the model but throws OOM for large input. In such a situation, what do you think if we split the model’s layers on multiple GPU / machines and run forward pass accordingly.
For example, an image X and model B with layer 1, layer 2, and layer 3. For a single GPU inference with this model on image X, it gives OOM. So, how about we place these 3 model layers (layer 1, layer 2, and layer 3) on 3 different GPUs. And sequentially we pass the data X on layer 1 (GPU:1) and pass the output of GPU:1 to GPU:2 and output of GPU:2 to GPU:3. Now, the Image can’t fit into one GPU itself.