Inference Memory consumption is higher than expected

I am trying to run inference for a gpt style model [i.e. a model the transformer block repeating itself n number of times].
My model has 48 layers and a hidden size of 4096. so, the model parameters occupy 20 GB space[model is set to fp16]. I’ve set the model to eval, I’m running this on an A100 80GB GPU. I’ve used torch.no_grad decorator over the forward pass of the main model block that has all the layers. There is no KV cache in this implementation

Theoretically, I should be able to fit >1000 batch size. but i’m only able to 100. Why is this happening?
I’ve tried using torch.cuda.empty_cache() and deleting variables that aren’t necessary. Doesn’t eleviate the problem.