How to load large model with multiple GPU cards?

This might be a simple question, but bugged me the whole afternoon.

I was trying to use a pretained m2m 12B model for language processing task (44G model file). I have 8 Tesla-V100 GPU cards, each of which has 32GB graphics memory. The program OOMed at:

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100-12B-avg-5-ckpt")

Error being:

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 31.75 GiB total capacity; 30.49 GiB already allocated; 177.75 MiB free; 30.50 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I know the problem is one single GPU card’s memory is not big enough to load the whole model, but how can I leverage all my 8 cards memory to load the model and do predictions/generations? There must be someway to do this, otherwise if we have models that 's really huge, we eventually can’t have a GPU card with enough memory to load the model. I would really appreciate if someone can point me some directions or show me the path. Thanks in advance!

Thanks so much for the help!

You could load the model on the CPU first (using your RAM) and push parts of it to specific GPUs to shard the model. This would of course also need changes to the forward pass as you would need to push the intermediate activations to the corresponding GPU using this naive model sharding approach, so I would expect to find some model sharding / pipeline parallel scripts in the repository (unless the authors trained the model on e.g. an A100 80GB).

Thank you for your kindly response, and it seems like a complicated issue. One thing I keep thinking is, for instance, if I want to use Facebook’s OPT model or OpenAPI’s GPT3 model, which is pre-trained and huge like several hundred G in size. How can I find a single GPU card with that much GPU memory to do question answering tasks or sentence prediction? So the only way to do this is to use CPU RAM or single GPU card with huge memory?

Really appreciate your help!

These large models usu usually a parallelism approach, such as model parallel, tensor parallel, pipeline parallel etc. e.g. via Megatron, DeepSpeed etc. and come with scripts to load them onto compute clusters.

Thanks, I’ll look them up and see whether they can solve my problem.

Thank you!

@Jiheng_Yang you might also want to check out Deferred Module Initialization in torchdistX:

1 Like