Tensor parallelism in image models like Unet

cporrasn · June 12, 2024, 11:17am

Hello!!!
I describe my problem.

I have a PC with 2 Tesla P40 GPUs. I have a Unet model that segments an image for me. Given a resizable image, I get the binary mask.

However, I have 4k images, which a single GPU does not fit, and I get a memory error simply with the processing of the 1st layer.

I have tried the FSDP module of pytorch, however, when I put 2 processes in it, it does the segmentation 2 times, and gives me a memory error for the 4k images. Therefore I assume that model parallelism alone is not enough, since a single layer (the first) does not fit on a single GPU.

I was thinking about tensor parallelism, with references like:
1- GitHub - NVIDIA/Megatron-LM: Ongoing research training transformer models at scale (but I interpret that they only focus on LLM text and not images)
2- GitHub - tunib-ai/parallelformers: Parallelformers: An Efficient Model Parallelization Toolkit for Deployment (which I also have the impression that they only cover NLP models)
3- model-parallelism/B_unet_model_sharding.py at main · garg-aayush/model-parallelism · GitHub (I see that they apply model parallelism here but I don’t see that they use tensor parallelism in images.)

I have also tried HuggingFace’s accelerae library, but it gives me incorrect output (see issue: Incorrect output when using accelerate in a pytorch Unet model · Issue #2849 · huggingface/accelerate · GitHub).

Do you know if there is any solution already implemented for tensor parallelism in image models? I also want to know if a model already trained on a single GPU needs to be retrained to apply this parallelism, how would that be done?

KFrank · June 14, 2024, 1:11am

Hi Cyn!

I don’t have anything to say about “tensor parallelism.” However …

If you hew to the careful design detailed in the original U-Net paper, you
can use the “tiling strategy” described in that paper to process images of
arbitrary size.

You could presumably distribute the tiles across your two gpus, but
following the basic use case of pytorch’s DistributedDataParallel,
you could distribute batch elements across your gpus (with the tiling
for any given image occurring on a single gpu).

Best.

K. Frank

Soumya_Kundu · August 27, 2024, 9:12am

Maybe reduce the batch size?

tianyu · August 27, 2024, 11:48pm

Thanks for the question!

To make sure I understand your question let me rephrase: You have a trained image model, and would like to do inference on 4,000 images using the model. And feeding all the 4,000 images together into GPU would cause out-of-memory error.

If that’s the case, there might be no need to do FSDP/TP on the model because the model can fit in GPU memory. Can you simply separate the 4,000 images in to groups and only feed a subset of the image into the model at one time? E.g. can you feed 1 image into the model and obtain correct result? If so you can binary search the largest group size that fit into memory. You can still parallelize inference by running on two GPUs simultaneously, each on 2,000 images.