I am trying to load a hugging-face model microsoft/Phi-3-medium-4k-instruct to my GPU. The model size (26 GB) is bigger than VRAM (24 GB), but I do have 2 GPUs. Can I load my model using both GPUs (in batches) without facing this issue?
The code I am using is the following:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed
import time
import pdb
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-medium-4k-instruct", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-medium-4k-instruct", torch_dtype=torch.float16, low_cpu_mem_usage=True,device_map=local_rank)
pdb.set_trace()
Hey @anantguptadbl, I am not sure how this will help me. The issue I guess is that the model is bigger than the memory capacity of my first GPU and by default, it seems that I am only using one GPU.