I describe my problem.
I have a PC with 2 Tesla P40 GPUs. I have a Unet model that segments an image for me. Given a resizable image, I get the binary mask.
However, I have 4k images, which a single GPU does not fit, and I get a memory error simply with the processing of the 1st layer.
I have tried the FSDP module of pytorch, however, when I put 2 processes in it, it does the segmentation 2 times, and gives me a memory error for the 4k images. Therefore I assume that model parallelism alone is not enough, since a single layer (the first) does not fit on a single GPU.
I was thinking about tensor parallelism, with references like:
1- GitHub - NVIDIA/Megatron-LM: Ongoing research training transformer models at scale (but I interpret that they only focus on LLM text and not images)
2- GitHub - tunib-ai/parallelformers: Parallelformers: An Efficient Model Parallelization Toolkit for Deployment (which I also have the impression that they only cover NLP models)
3- model-parallelism/B_unet_model_sharding.py at main · garg-aayush/model-parallelism · GitHub (I see that they apply model parallelism here but I don’t see that they use tensor parallelism in images.)
I have also tried HuggingFace’s accelerae library, but it gives me incorrect output (see issue: Incorrect output when using accelerate in a pytorch Unet model · Issue #2849 · huggingface/accelerate · GitHub).
Do you know if there is any solution already implemented for tensor parallelism in image models? I also want to know if a model already trained on a single GPU needs to be retrained to apply this parallelism, how would that be done?