Pytorch 1.6.0 seem to leak memory in conv2d

I’m using pytorch 1.6.0,mmdetection 2.2.0, maskrcnn for inference. In Resnet the first conv2d seem to consume about 300M and not release after that. I’m using the “nvidia-smi -lms 100” and windows Task Manager to monitor the usage of GPU, so There is any wrong in monitor the usage of GPU or just pytorch leak GPU memory in conv2d?

There are a few things to keep in mind

  • PyTorch will keep GPU memory around after the tensors have been deallocated (the CUDA memory cache). nvidia-smi would see that as still in use. You can check allocated / cached memory and release cached memory using the functions in torch.cuda.
  • You would want to wrap things in with torch.no_grad() to instruct PyTorch to not keep the information around for backward.

While bugs in PyTorch, including memory leaks, can occur, it would seem that it is unlikely that benign use of Conv2D has it because it would affect very many people.

Best regards

Thomas

Thank you for your answer, I think your answer about nvidia-smi is right.
But I have another question, could you please explain why for me?
We want to deploy more than one pytorch model on single GPU, So we care about peak usage more.I have monitored the usage of GPU by nvidia-smi or torch.cuda.memory_summary(), For resnet_maskrcnn nvdia-smi result is 3095MB, and torch.cuda.max_memory_reserved() result is 1778384896byte(1696MB). For mobilenetV3_maskrcnn nvidia-smi result is 5964MB, and torch.cuda.max_memory_reserved() result 4640997376byte(4426MB), both are bigger than resnet_maskrcnn, but we can deploy three mobilenetV3_maskrcnn on 1050ti, and only two resnet_maskrcnn on 1050ti. Could you explain why or there is somthing wrong in my profile method.
I have summary the result for a table:
image

You might have allocations outside the PyTorch memory allocator. I think the cuda context itself isn’t showing as a PyTorch allocation, if I interpret @ptrblck’s comment correctly.

Best regards

Thomas

I agree with your opinion about cuda context will consume some GPU memory. I think that’s why nvidia-smi result different with torch.cuda.max_memory_reserved() result. But that answer can’t explain why mobilenetV3_maskrcnn use more GPU memory on peak but they can be deploy three model currently into single GPU, but resnet50_maskrcnn could do this two. Or there is something wrong in my understanding?

I did a very simple experiment today, codes are as follows:
with torch.no_grad():
print(torch.cuda.memory_reserved())
print(torch.cuda.memory_allocated())
print(os.system(‘nvidia-smi’))
inputs = torch.randn(1, 3, 512, 512).cuda()
print(os.system(‘nvidia-smi’))
conv = torch.nn.Conv2d(3, 64, (7, 7), stride=2, padding=3, bias=False).cuda()
inputs = conv(inputs)
print(torch.cuda.memory_reserved())
print(torch.cuda.memory_allocated())
torch.cuda.empty_cache()
print(torch.cuda.memory_reserved())
print(torch.cuda.memory_allocated())
outputs are as follows(remove some unused line):
0 ---- print(torch.cuda.memory_reserved())
0 ---- torch.cuda.memory_allocated()

487MiB / 8192MiB ---- print(os.system(‘nvidia-smi’))
before “torch.randn(1, 3, 512, 512).cuda()”, at this point maybe torch is not init and same as cuda, so this 487MiB is all windows used i think.

938MiB / 8192MiB ---- print(os.system(‘nvidia-smi’))
after “torch.randn(1, 3, 512, 512).cuda()”, at this point torch is init and “inputs” only consume little GPU memory, so about 500MB(944-487) is cuda context and so far everything can be explained

23068672 ---- print(torch.cuda.memory_reserved())
17863680 ---- print(torch.cuda.memory_allocated())
after inputs = conv(inputs), only one conv consume about 2G GPU memory, is this normal? and after the conv I think it should return to 938MiB, so I have thought there is memory leak in Pytorch.

23068672 ---- print(torch.cuda.memory_reserved())
17863680 ---- print(torch.cuda.memory_allocated())
after torch.cuda.empty_cache(), There is nothing happen, so I don’t understand why pytorch hold so many GPU memory

That’s ~23MB.

Note that there are additional lazy initializations for things like cublas, curand, cudnn.

you are right, my fault :smile:

I did a experiment in resnet50_maskrcnn, here is the code:
print(torch.cuda.memory_reserved())
print(torch.cuda.memory_allocated())
x = self.extract_feat(img)
print(torch.cuda.memory_reserved())
print(torch.cuda.memory_allocated())

model had call eval(), so the is_training is False
201326592(about 192MB) ---- print(torch.cuda.memory_reserved())
193334272(about 184.377MB) ---- print(torch.cuda.memory_allocated())
before the inference pytorch reserved about 192MB, I can understand that.

x = self.extract_feat(img)
this is resnet backbone, x consume about 87306240(83.262MB)

1304428544(1244MB) ---- print(torch.cuda.memory_reserved())
1267727360(1208.999MB) ---- print(torch.cuda.memory_allocated())
My question is why after the backbone inference pytorch hold so many GPU memory? There is only x in the GPU and it only consume about 83.262MB, what about 1GB(1244 - 83.262) GPU memory used for?

probably due to pending garbage collection (python side)

Maybe you are right, but how could you explain why 1050ti can load three mobilenetV3_maskrcnn but only load two resnet50_maskrcnn ?

No idea, sorry. Check topics like this. Fragmentation may be an issue, if you’re stress testing.