Why the memory usage is higher than expected when loading nvidia/NV-Embed-v2 model with FP16 precision?

Arunima_Ghosh · December 6, 2024, 8:07am

Hi,

I am using the nvidia/NV-Embed-v2 model with 7.85 billion parameters to generate embeddings. I’m loading the model with FP16 precision using the following code:

python

Copy code

import torch
from transformers import AutoTokenizer, AutoModel

torch.cuda.empty_cache()
device = torch.device('cuda')
model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True).to(device)
model.to(torch.float16)

I expect the model to take approximately 15.7 GB of memory (calculated as 7.85×27.85 \times 27.85×2 GB for FP16 precision). However, when I measure the memory usage with this code—without making any inference—the reported memory consumption is about 25 GB.

Can someone please explain why the memory usage is higher than expected? Is this extra memory usage due to model overhead (e.g., optimizer states, activation storage) or something else?

Thanks in advance!

ptrblck · December 6, 2024, 1:46pm

I don’t know if your memory estimation is correct or not.
However, you are loading the model in the default precision to your GPU first and are then transforming all registered parameters and buffers to float16. I also don’t know how you are measuring the used memory but note that the peak memory will of course be higher of the original model used a wider precision.

Arunima_Ghosh · December 9, 2024, 6:05am

Hi,

I am using nvidia-smi and nvtop to monitor GPU memory usage. Are these tools appropriate for measuring GPU memory usage, or are there other methods or tools that might provide a more accurate or comprehensive readout for this specific case?

ptrblck · December 11, 2024, 7:10pm

Yes, nvidia-smi will report the total memory used. If you want to check the allocated and reserved memory in your script, you could use torch.cuda.memory_allocated()/.memory_reserved()./memory_summary().

The previous point still holds: if you are loading the model in float32 first to the GPU before transforming the parameters and buffers to float16 later, the peak memory usage will report the memory usage from the float32 model.