Weird memory Allocation issue. Pytorch tries to allocate 95GiB instead of 15GB

Hello All,
I have a model (let us call it NN_Model) that is made of two submodules let’s say A and B. The input first goes through A then through B and at last a couple of conv layers(2 to be precise).

output = A(input)
output = B(output)
output = conv(output)
return output

When I create an instance of model A and run the forward, it takes around 7.4GB on the GPU. Similarly, model B alone takes around 5GB on the GPU. The last convolutional layers do not take 700MB separately.

However when I try to run the forward pass of the ‘NN_Model’, pytorch tries to allocate 95GiB on the GPU with the following error message -
RuntimeError: CUDA out of memory. Tried to allocate 95.37 GiB (GPU 0; 7.80 GiB total capacity; 4.25 GiB already allocated; 2.59 GiB free; 4.31 GiB reserved in total by PyTorch)
I am not able to understand the sudden surge in memory required. I expected a total of 15GB required, give or take a GB but 95 is a lot.
Could anyone please tell me the possible reasons. The submodule A is a Resnet50 from torchvision and the submodule B contains 4 softmax’s as their learnable parameter and nothing else. The forward function mainly contains bmm calls which I suppose cannot add to the memory .

Could anyone please help me out me the possible reasons for the above aberration?

Edit - I noticed that 95GiB was trying to get allocated in the forward of the B(which works perfectly when run separately). what could be going wrong ?

TIA

Could you post the code of NN_Model so that we could try to reproduce this issue, please?