When I load the model for inference. The model always consumes a lot of memory even the model size is small. For example: model size is 70MB (Encoder + Decoder + attention with Resnet 50 as backbone for encoder) but it occupies approximately 1GB GPU memory.
Can we have any ways to reduce the GPU memory when loading the model for inference?
Thanks all for supports.
with Float16, I have used Apex (mixed precision) for training, and the model size is reduced to 50 MB, but when I load the model to GPU, it still consumes 950 MB.
Look likes I need to see more for CUDA context for this issue.