CPU Memory Leak

Hello,

I’m trying to experiment different configuration with the A3C code posted on GitHub under the following link:
https://github.com/MorvanZhou/pytorch-A3C

All my tested are concerned with the script “discrete_A3C.py”
https://github.com/MorvanZhou/pytorch-A3C/blob/master/discrete_A3C.py

Python Version: 3.6.9
Torch Version: 1.4.0

If I keep everything as in the original code the memory usage is consistent and does not increase over time. Which can be shown by plotting the memory usage graph using the command:

mprof run --multiprocess discrete_A3C.py

However, I tried to change the NN architecture to be as follows:

    def __init__(self, s_dim, a_dim):
        super(Net, self).__init__()
        self.s_dim = s_dim
        self.a_dim = a_dim

        self.fc1 = nn.Linear(s_dim, 128)
        self.fc2 = nn.Linear(128, 128)

        self.pi1 = nn.Linear(128, 64)
        self.pi2 = nn.Linear(64, a_dim)  # Q values for each action are output

        self.v1 = nn.Linear(128, 128)
        self.v2 = nn.Linear(128, 1)
        set_init([self.pi1, self.pi2, self.fc1, self.fc2, self.v1, self.v2])
        self.distribution = torch.distributions.Categorical

    def forward(self, x):
        x = F.relu6(self.fc1(x))
        x = F.relu6(self.fc2(x))

        pi1 = F.relu6(self.pi1(x))
        logits = self.pi2(pi1)

        v1 = F.relu6(self.v1(x))
        values = self.v2(v1)
        return logits, values

re-running the same command as before however it can be noticed a linear increase of memory usage overtime.

Please check the image below where the figure on the left is from running the original code, and the figure on the right is from the modified architecture.

I tried to debug this myself during the past week but can not come up with any clues, if anyone can provide any insights would be of valuable help.

Hi,

This looks very similar to https://github.com/pytorch/pytorch/issues/32284.
Does it actually run out of memory (like your process gets killed by the OS)? Or does it stabilizes when it gets to full memory?

Yes , it does get out of memory eventually.

I read the link, it seems similar indeed however in my case I’m not using any MaxPool2D and also the leak happens due to architecture change without using new functions.

Interesting.
Could you give both architectures side by side?
also does running import gc; gc.collect() helps?

I tried to use gc.collect() but it does not help at all.

Below are the two architectures side-by-side and the only modification to the whole code, the left one is the original one where the problem does not exist, the one on the right is the modified one which causes the issue:

What is F.relu6 here?

Otherwise this should not change anything.

F.relu6 is torch.nn.functional.relu6

documentation can be found here:
https://pytorch.org/docs/stable/nn.functional.html

Ha, I didn’t knew this was a thing :open_mouth:

Trying locally with these two nets do not lead to any issue.
Do you have a small (40/50 lines) code sample I can run locally that will reproduce this?

Are you using CentOS ? , I did a quick test on CentOS and it seems no issue there. But the issue happens on Ubuntu 18.04

You can find the whole code on the github repo in the description , just change the 2 functions I wrote above and launch the script discrete_A3C.py .

Yes i checked on centos.

You can find the whole code on the github repo

Unfortunately I cannot install gym and run the multiprocess training here locally (due to some restrictions on the installs on the machine I use) :confused:

1 Like

Ok , I will double check if the issue can be reproduced on CentOS and get back.
Also I found some interesting results after some tinkering that needs confirmation.

Will be back tomorrow. Thanks mate.

1 Like

The issue exists on CentOSv8.

Can you retry to do the test at your side with the following packages installed ?

cloudpickle==1.2.2
cycler==0.10.0
future==0.18.2
gym==0.15.4
joblib==0.14.1
kiwisolver==1.1.0
matplotlib==3.1.2
numpy==1.18.1
opencv-python==4.1.2.30
pandas==1.0.0
Pillow==7.0.0
pkg-resources==0.0.0
pyglet==1.3.2
pyparsing==2.4.6
python-dateutil==2.8.1
pytz==2019.3
scikit-learn==0.22.1
scipy==1.4.1
six==1.14.0
sklearn==0.0
torch==1.4.0
torchvision==0.5.0
xlrd==1.2.0

I doubt that maybe it’s something in my environment that is causing this issue.