Running out of memory while running a model

Varun_Kumar_VIjay · December 9, 2019, 4:54pm

I am trying to run a small neural network on the CPU and am finding that the memory used by my script increases without limit. Since my script does not do much besides call the network, the problem appears to be a memory leak within pytorch.

The problem does not occur if I run the model on the gpu.

I’ve also posted this to the pytorch github, but I was hoping someone on here might be able to help.

Here’s the script I’m using. The output of the program increases until the machine runs out of memory, at which point it crashes.

import torch
import torch.nn.functional as F
import torch.nn as nn

import resource
import gc

class Net(nn.Module):
def init(self, in_size, h_size, out_size):
super(Net, self).init()

    self.lin1 = nn.Linear(in_size, h_size)
    self.lin2 = nn.Linear(h_size, out_size)
    self.nonlin = F.relu

def forward(self, x):
    h = self.nonlin(self.lin1(x))
    h = self.nonlin(self.lin2(h))
    return h

def main():
in_size = 100
h_size = 256
out_size = 512

net = Net(in_size, h_size, out_size)

while True:
    gc.collect()
    print('maxrss = {}'.format(
        resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1e6))
    for i in range(1000):
        net(torch.randn(in_size))

if name == ‘main’:
main()

Varun_Kumar_VIjay · December 9, 2019, 4:56pm

This is a dockerfile for building my environment:

FROM nvidia/cuda:10.0-cudnn7-devel-ubuntu18.04

RUN apt-get update -y &&
apt-get upgrade -y

RUN apt-get clean autoclean -y &&
apt-get autoremove -y

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get install -y libsm6 libxext6 wget curl unzip
apt-utils git python-virtualenv python3-dev python3-pip software-properties-common python3-opencv

RUN apt-get clean autoclean -y &&
apt-get autoremove -y

RUN apt-get install python3-pip cmake zlib1g-dev python3-tk python-opencv -y
RUN apt-get install libboost-all-dev -y
RUN apt-get install libblas-dev liblapack-dev libatlas-base-dev gfortran libosmesa6-dev -y
RUN apt-get install libsdl-dev libsdl-image1.2-dev libsdl-mixer1.2-dev libsdl-ttf2.0-dev libsmpeg-dev libportmidi-dev libavformat-dev libswscale-dev -y
RUN apt-get install dpkg-dev build-essential python3-dev libjpeg-dev libtiff-dev libsdl1.2-dev libnotify-dev freeglut3 freeglut3-dev libsm-dev libgtk2.0-dev libgtk-3-dev libwebkitgtk-dev libgtk-3-dev libwebkitgtk-3.0-dev libgstreamer-plugins-base1.0-dev -y
RUN apt-get install libsdl2-dev swig cmake -y
RUN apt-get install iputils-ping mongodb-clients vim -y
RUN apt-get install -y libsm6 libxext6 libxrender-dev vim python-opencv ffmpeg tmux
RUN apt-get install -y libosmesa6-dev libgl1-mesa-glx libglfw3 wget
RUN apt-get install -y cmake libopenmpi-dev python3-dev zlib1g-dev patchelf ssh rsync
RUN apt-get install -y git

Apt cleaning

RUN apt-get clean autoclean &&
apt-get autoremove -y

set environment variables

ENV PATH $PATH:/usr/local/cuda/bin
ENV LD_LIBRARY_PATH /usr/local/cuda/lib64:/usr/local/cuda/lib:/usr/local/lib:$LD_LIBRARY_PATH

RUN apt-get update -q
&& DEBIAN_FRONTEND=noninteractive apt-get install -y
curl
git
libgl1-mesa-dev
libgl1-mesa-glx
libglew-dev
libosmesa6-dev
software-properties-common
net-tools
unzip
vim
virtualenv
wget
xpra
xserver-xorg-dev
sudo
&& apt-get clean
&& rm -rf /var/lib/apt/lists/*

RUN apt-get update -q
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y swig

RUN DEBIAN_FRONTEND=noninteractive add-apt-repository --yes ppa:deadsnakes/ppa && apt-get update
RUN DEBIAN_FRONTEND=noninteractive apt-get install --yes python3.5-dev python3.5 python3-pip
RUN virtualenv --python=python3.5 env

RUN rm /usr/bin/python
RUN ln -s /env/bin/python3.5 /usr/bin/python
RUN ln -s /env/bin/pip3.5 /usr/bin/pip
RUN ln -s /env/bin/pytest /usr/bin/pytest

RUN curl -o /usr/local/bin/patchelf https://s3-us-west-2.amazonaws.com/openai-sci-artifacts/manual-builds/patchelf_0.9_amd64.elf
&& chmod +x /usr/local/bin/patchelf

ENV LANG C.UTF-8

ARG GID
ARG UID
ARG UNAME
RUN groupadd -g ${GID} -o ${UNAME}
RUN useradd -m -u ${UID} -g $GID -o -s /bin/bash ${UNAME}

COPY docker/vendor/Xdummy /usr/local/bin/Xdummy
RUN chmod +x /usr/local/bin/Xdummy

Workaround for https://bugs.launchpad.net/ubuntu/+source/nvidia-graphics-drivers-375/+bug/1674677

COPY docker/vendor/10_nvidia.json /usr/share/glvnd/egl_vendor.d/10_nvidia.json

RUN pip install cloudpickle==0.5.2
RUN pip install cached-property==1.3.1

For atari-py

RUN apt-get update -y
RUN apt-get install -y zlib1g-dev swig cmake
RUN pip install --upgrade pip
RUN pip install gym[all]==0.10.5
RUN pip install gitpython==2.1.7
RUN pip install gtimer==1.0.0b5
RUN pip install pygame
RUN pip install awscli==1.11.179
RUN pip install boto3==1.4.8
RUN pip install dominate==2.3.1
RUN pip install ray==0.2.2
RUN pip install path.py==10.3.1

RUN pip install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi
RUN pip install torch==1.3.1 torchvision
RUN pip install joblib==0.9.4

RUN apt-get update && apt-get install -y ffmpeg tmux

RUN apt-get install -y nvidia-390

RUN pip install opencv-python==3.4.0.12
RUN pip install sk-video==1.1.10
RUN pip install ipdb pyarrow gitpython objgraph pympler

USER $UNAME

mailcorahul · December 9, 2019, 5:20pm

Can you try running your model inside torch.no_grad()? By default, autograd engine is on, therefore gradients will be computed. Which is why your RAM gets accumulated and goes out of memory after consecutive forward passes.
Something like,

for i in range(1000):
    with torch.no_grad():
        net(torch.randn(in_size))

Varun_Kumar_VIjay · December 9, 2019, 5:38pm

I don’t believe gradient accumulation can cause the problem. It does take some extra memory to maintain the gradients and intermediate values, but this memory should be reused in the next iteration of the model.

Just to make sure, I repeated the test with your change and see that the memory usage still increases.

mailcorahul · December 9, 2019, 5:43pm

Right. I have faced something similar to this, but while doing inference on GPU.
Link to my post: Gpu memory gets accumulated during consecutive forward passes
Haven’t found any solution yet.

Varun_Kumar_VIjay · December 9, 2019, 6:48pm

Your issue looks similar. The only element that could cause a memory leak is the outputs variable, if you are holding on to it. In my test, I avoid storing the result of net.call so that it can be freed immediately.

Also, I don’t encounter a memory problem when I use cuda.