Running out of GPU memory when loading a huggingface model

I am trying to load a hugging-face model microsoft/Phi-3-medium-4k-instruct to my GPU. The model size (26 GB) is bigger than VRAM (24 GB), but I do have 2 GPUs. Can I load my model using both GPUs (in batches) without facing this issue?

The code I am using is the following:

import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import deepspeed
import time
import pdb

tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-medium-4k-instruct", use_fast=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-medium-4k-instruct", torch_dtype=torch.float16, low_cpu_mem_usage=True,device_map=local_rank)
pdb.set_trace()

@chrathans

Can you try to create the device_map at the layers level

Example

device_map = {
    "transformer.wte": "cpu",
    "transformer.wpe": 0,
    "transformer.drop": "cpu",
    "transformer.h.0": "disk"
}

Will try from my end as well

Hey @anantguptadbl, I am not sure how this will help me. The issue I guess is that the model is bigger than the memory capacity of my first GPU and by default, it seems that I am only using one GPU.

Haven’t explored it myself but maybe relevant: Big Model Inference

Ok that indeed worked for me!