When I load the model for inference. The model always consumes a lot of memory even the model size is small. For example: model size is 70MB (Encoder + Decoder + attention with Resnet 50 as backbone for encoder) but it occupies approximately 1GB GPU memory.
Can we have any ways to reduce the GPU memory when loading the model for inference?
I’m afraid there is not much we can do.
You can check this other topic on the same subject: Moving a tiny model to cuda causes a 2Gb host memory allocation
How about use of float16 instead of float32?, this conversion makes room twice.
Thanks all for supports.
with Float16, I have used Apex (mixed precision) for training, and the model size is reduced to 50 MB, but when I load the model to GPU, it still consumes 950 MB.
Look likes I need to see more for CUDA context for this issue.