CPU Memory Leak

mchawaV · February 6, 2020, 4:18pm

Hello,

I’m trying to experiment different configuration with the A3C code posted on GitHub under the following link:
https://github.com/MorvanZhou/pytorch-A3C

All my tested are concerned with the script “discrete_A3C.py”
https://github.com/MorvanZhou/pytorch-A3C/blob/master/discrete_A3C.py

Python Version: 3.6.9
Torch Version: 1.4.0

If I keep everything as in the original code the memory usage is consistent and does not increase over time. Which can be shown by plotting the memory usage graph using the command:

mprof run --multiprocess discrete_A3C.py

However, I tried to change the NN architecture to be as follows:

    def __init__(self, s_dim, a_dim):
        super(Net, self).__init__()
        self.s_dim = s_dim
        self.a_dim = a_dim

        self.fc1 = nn.Linear(s_dim, 128)
        self.fc2 = nn.Linear(128, 128)

        self.pi1 = nn.Linear(128, 64)
        self.pi2 = nn.Linear(64, a_dim)  # Q values for each action are output

        self.v1 = nn.Linear(128, 128)
        self.v2 = nn.Linear(128, 1)
        set_init([self.pi1, self.pi2, self.fc1, self.fc2, self.v1, self.v2])
        self.distribution = torch.distributions.Categorical

    def forward(self, x):
        x = F.relu6(self.fc1(x))
        x = F.relu6(self.fc2(x))

        pi1 = F.relu6(self.pi1(x))
        logits = self.pi2(pi1)

        v1 = F.relu6(self.v1(x))
        values = self.v2(v1)
        return logits, values

re-running the same command as before however it can be noticed a linear increase of memory usage overtime.

Please check the image below where the figure on the left is from running the original code, and the figure on the right is from the modified architecture.

I tried to debug this myself during the past week but can not come up with any clues, if anyone can provide any insights would be of valuable help.

albanD · February 6, 2020, 4:32pm

Hi,

This looks very similar to https://github.com/pytorch/pytorch/issues/32284.
Does it actually run out of memory (like your process gets killed by the OS)? Or does it stabilizes when it gets to full memory?

mchawaV · February 6, 2020, 4:37pm

Yes , it does get out of memory eventually.

I read the link, it seems similar indeed however in my case I’m not using any MaxPool2D and also the leak happens due to architecture change without using new functions.

albanD · February 6, 2020, 5:48pm

Interesting.
Could you give both architectures side by side?
also does running import gc; gc.collect() helps?

mchawaV · February 6, 2020, 6:04pm

I tried to use gc.collect() but it does not help at all.

Below are the two architectures side-by-side and the only modification to the whole code, the left one is the original one where the problem does not exist, the one on the right is the modified one which causes the issue:

albanD · February 6, 2020, 7:07pm

What is F.relu6 here?

Otherwise this should not change anything.

mchawaV · February 6, 2020, 9:44pm

F.relu6 is torch.nn.functional.relu6

documentation can be found here:
https://pytorch.org/docs/stable/nn.functional.html

albanD · February 6, 2020, 10:25pm

Ha, I didn’t knew this was a thing

Trying locally with these two nets do not lead to any issue.
Do you have a small (40/50 lines) code sample I can run locally that will reproduce this?

mchawaV · February 6, 2020, 10:38pm

Are you using CentOS ? , I did a quick test on CentOS and it seems no issue there. But the issue happens on Ubuntu 18.04

You can find the whole code on the github repo in the description , just change the 2 functions I wrote above and launch the script discrete_A3C.py .

albanD · February 6, 2020, 11:08pm

Yes i checked on centos.

You can find the whole code on the github repo

Unfortunately I cannot install gym and run the multiprocess training here locally (due to some restrictions on the installs on the machine I use)

mchawaV · February 6, 2020, 11:14pm

Ok , I will double check if the issue can be reproduced on CentOS and get back.
Also I found some interesting results after some tinkering that needs confirmation.

Will be back tomorrow. Thanks mate.

mchawaV · February 8, 2020, 12:42pm

The issue exists on CentOSv8.

Can you retry to do the test at your side with the following packages installed ?

cloudpickle==1.2.2
cycler==0.10.0
future==0.18.2
gym==0.15.4
joblib==0.14.1
kiwisolver==1.1.0
matplotlib==3.1.2
numpy==1.18.1
opencv-python==4.1.2.30
pandas==1.0.0
Pillow==7.0.0
pkg-resources==0.0.0
pyglet==1.3.2
pyparsing==2.4.6
python-dateutil==2.8.1
pytz==2019.3
scikit-learn==0.22.1
scipy==1.4.1
six==1.14.0
sklearn==0.0
torch==1.4.0
torchvision==0.5.0
xlrd==1.2.0

I doubt that maybe it’s something in my environment that is causing this issue.