TorchVision ResNet model: GPU memory increases by 300 MB after every forward pass


(Burhan) #1

Hi,

I’m working on a Faster RCNN model coded on pytorch that uses the resnet backbone from the torchvision API. My program crashes after usually ~2000 iterations (sometimes less, sometimes more). I tried to profile the memory allocations using https://gist.github.com/MInner/8968b3b120c95d3f50b8a22a74bf66bc. What I’ve noticed is that for each forward pass almost 300 MB is added for the following commands. This is a sample output from my log.

model_tubelet forward:71 :9824.0 Mb features = self.features(image)
backbone.resnet forward:75 :10044.0Mb residual = x
backbone.resnet forward:77 :10044.0Mb out = self.conv1(x)
backbone.resnet forward:78 :10044.0Mb out = self.bn1(out)
backbone.resnet forward:79 :10044.0Mb out = self.relu(out)
backbone.resnet forward:81 :10044.0Mb out = self.conv2(out)
backbone.resnet forward:82 :10044.0Mb out = self.bn2(out)
backbone.resnet forward:83 :10044.0Mb out = self.relu(out)
backbone.resnet forward:85 :10044.0Mb out = self.conv3(out)
backbone.resnet forward:86 :10044.0Mb out = self.bn3(out)
backbone.resnet forward:88 :10044.0Mb if self.downsample is not None:
backbone.resnet forward:89 :10044.0Mb residual = self.downsample(x)
backbone.resnet forward:91 :10154.0Mb out += residual

I don’t see this behavior after every forward pass, just a few ones intermittently. Also, I tried using del in my main train loop to delete the input variables and the output loss values after the optimizer step but that just delays the crash. I can share the code if that is needed.

EDIT:

Pytorch Version: 0.4.0
Python Version: 3.5.2
TorchVision: 0.2.1