BERT model consuming too much memory

I am trying to train a BERT model, But one pass consumes about 11Gb. I am using the transformers library. Is the library not optimized enough or what?
Here’s the code :

from transformers import BertConfig, BertForSequenceClassification
import torch

config = BertConfig()
model =  BertForSequenceClassification(config)

model.cuda(5)
model.forward(input_ids=torch.ones([15,512],dtype=torch.int64).cuda(5),
              attention_mask=torch.ones([15,512],dtype=torch.int64).cuda(5),
              labels=torch.ones([15],dtype=torch.int64).cuda(5))

I just mentioned the relevant code here, but I can provide other snippets if needed

What memory usage do you expect and how are you measuring it?
Calculating the tensor shapes of all parameters and buffers as well as the intermediate activations and gradients would give you an approx. estimate of the memory footprint.
Also, you could use the memory profiler to check where memory is allocated.

I am measuring the GPU memory consumption. I am using the command nvidia-smi to check how much memory is allocated for the particular PID. Whenever I execute the code attached above, 11Gb is added to the memory allocated for the process (literally every time I do inference). That memory does not get unallocated until I restart the kernel.

I haven’t profiled the memory consumption of this model, but do you have an estimation what memory usage would be expected or why do you think it’s too much?
Also note that nvidia-smi shows the memory used by the CUDA context, the allocated memory, and the cached memory (which can be reused by PyTorch without new allocations).