Storing torch tensor for dqn memory issue

Hello,

I am running into a ram leaking memory problem when I am saving my frames as pytorch tensor for a simple dqn implementation (inspired by link). Here is an quick example without the learning loop, trying to isolate my issue:

import resource

import argparse
import gym
import numpy as np
from itertools import count
from collections import namedtuple
import os 

import torch
import random
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.autograd as autograd
from torch.autograd import Variable
import torchvision.transforms as T
import cv2
import pickle
import glob
import time
import subprocess
from collections import namedtuple

# Class
class ReplayMemory(object):
    '''
    A simple class to wrap around the concept of memory
    this helps for managing how much data is used. 
    '''
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0
        
    def push(self, *args):
        """Saves a transition."""
        if len(self.memory) < self.capacity:
            self.memory.append(None) 
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)
    
    def __len__(self):
        return len(self.memory)

# Functions
def ProcessState(state,torchOutput=True):
    img = cv2.cvtColor(state, cv2.COLOR_BGR2GRAY)
    img = cv2.resize(img, (imageShape[1],imageShape[0])).astype('float32')
    if torchOutput:
        img = torch.from_numpy(img)
    img /= 255
    img -= 0.5 
    img *= 2
    return img

# Variables
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward', 'done'))

imageShape = (110,80)
env = gym.make('PongDeterministic-v3')
action = 0 
memory = ReplayMemory(32)

# Example with pytorch
for i_episode in range(25):
    break
    print 'Pytorch: Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    obser = env.reset()
    obser = ProcessState(obser)

    state = torch.ones((3,imageShape[0],imageShape[1]))    
    state = torch.cat((state,obser.view(1,imageShape[0],imageShape[1])),0)

    for t in range(10000): 
        obser, reward, done, _ = env.step(0)

        #this is new observation getting process
        obser = ProcessState(obser)

        state = torch.cat((state,obser.view(1,imageShape[0],imageShape[1])),0)
        
        memory.push(state[:-1], action, state[1:], reward, done)

        state = state[1:]

        if done:
            break
# quit()
# memory = ReplayMemory(32)
# Numpy
for i_episode in range(25):
    print 'Numpy: Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    obser = env.reset()
    obser = ProcessState(obser,False)

    state = np.zeros((3,imageShape[0],imageShape[1]))
    state = np.concatenate([state, obser.reshape((1,imageShape[0],imageShape[1]))])

    for t in range(10000): 
        obser, reward, done, _ = env.step(0)

        #this is new observation getting process
        obser = ProcessState(obser,False)

        # state = torch.cat((state,obser.view(1,imageShape[0],imageShape[1])),0)
        state = np.concatenate([state, obser.reshape((1,imageShape[0],imageShape[1]))])
        
        memory.push(state[:-1], action, state[1:], reward, done)
        state = state[1:]

        if done:
            break

Here is the output I get for running the first loop (using pytorch) vs the second one, which is saving numpy arrays.

jtremblay@office:~/code/Personal-git/dqn$ python memory_issue.py 
[2017-03-06 12:38:30,254] Making new env: PongDeterministic-v3
Pytorch: Memory usage: 113432 (kb)
Pytorch: Memory usage: 226380 (kb)
Pytorch: Memory usage: 323796 (kb)
Pytorch: Memory usage: 410124 (kb)
Pytorch: Memory usage: 490116 (kb)
Pytorch: Memory usage: 565884 (kb)
Pytorch: Memory usage: 637428 (kb)
Pytorch: Memory usage: 704220 (kb)
Pytorch: Memory usage: 760188 (kb)
Pytorch: Memory usage: 815892 (kb)
Pytorch: Memory usage: 861828 (kb)
Pytorch: Memory usage: 905388 (kb)
Pytorch: Memory usage: 938916 (kb)
Pytorch: Memory usage: 966900 (kb)
Pytorch: Memory usage: 993036 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
Pytorch: Memory usage: 1001484 (kb)
jtremblay@office:~/code/Personal-git/dqn$ python memory_issue.py 
[2017-03-06 12:39:22,433] Making new env: PongDeterministic-v3
Numpy: Memory usage: 113936 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)
Numpy: Memory usage: 130988 (kb)

As you can see the numpy saving is much more stable. This does not look like much but when you run my script with a replay size of one million frames it crashes quickly.

Should I avoid storing torch tensor? I quite like keeping everything as a torch tensor to be honest. It saves me a few torch.from_numpy calls. Is there a way to release memory used by torch, I was not able to find anything on that subject in the documentation.

I can provide more examples with learning loops if needed.

1 Like

That’s weird, I’ll look into that tomorrow. Thanks for posting the script.

I look forward hearing from you. Is there a way to for delete a tensor? I also tried to use storage object instead and it did not work. Also here is the version I am using:

import torch
torch.__version__
'0.1.9_2'

[edit] I just tested with the most current version (‘0.1.10+ac9245a’) and i obverse the same problem.

It should get freed as soon as it goes out of scope (last reference to it is gone).

Ok this is not what I am observing though. Am I keeping weird references to the tensor somewhere? I have “fixed” my script with storing numpy arrays, but I am loosing a lot in performance, about twice the time it took with storing tensor arrays. Is there a fast way to copy numpy arrays into a tensor reference of the same size, eg avoiding using torch.from_numpy()?

torch.from_numpy should be very fast - it only allocates a new torch.Tensor object that will reuse the same memory as the numpy array did. It’s nearly free :confused: I’m surprised that it makes any difference for you

I have done more test for the memory and you are right, the tensor gets clear from the memory when there is no more reference. I am trying to figure out in the script that I have shared where references are kept. Using clone where I could helped reducing the footprint, but there is still a leakage. Here is the updated code.

import resource

import argparse
import gym
import numpy as np
from itertools import count
from collections import namedtuple
import os 

import torch
import random
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.autograd as autograd
from torch.autograd import Variable
import torchvision.transforms as T
import cv2
import pickle
import glob
import time
import subprocess
from collections import namedtuple

# Class
class ReplayMemory(object):
    '''
    A simple class to wrap around the concept of memory
    this helps for managing how much data is used. 
    '''
    def __init__(self, capacity):
        self.capacity = capacity
        self.memory = []
        self.position = 0
        
    def push(self, *args):
        """Saves a transition."""
        if len(self.memory) < self.capacity:
            self.memory.append(None) 
        self.memory[self.position] = Transition(*args)
        self.position = (self.position + 1) % self.capacity
        
    def sample(self, batch_size):
        return random.sample(self.memory, batch_size)
    
    def __len__(self):
        return len(self.memory)

# Functions
def ProcessState(state,torchOutput=True):
    img = cv2.cvtColor(state, cv2.COLOR_BGR2GRAY)
    img = cv2.resize(img, (imageShape[1],imageShape[0])).astype('float32')
    if torchOutput:
        img = torch.from_numpy(img)
    img /= 255
    img -= 0.5 
    img *= 2
    return img

# Variables
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward', 'done'))

imageShape = (110,80)
env = gym.make('PongDeterministic-v3')
action = 0 
memory = []
reward = 0
done = False 
# Example with pytorch
for i_episode in range(25):
    # break
    print ('Pytorch: Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
    obser = env.reset()
    obser = ProcessState(obser).clone()

    state = torch.ones((3,imageShape[0],imageShape[1])).clone()    
    state = torch.cat((state.clone(),obser.view(1,imageShape[0],imageShape[1])),0).clone()

    for t in range(10000): 
        obser, reward, done, _ = env.step(0)

        #this is new observation getting process
        obser = ProcessState(obser).clone()

        state = torch.cat((state.clone(),obser.view(1,imageShape[0],imageShape[1])),0).clone()
        
        memory.append({'state':state[:-1].clone(), 'action': action, 'state1':state[1:].clone(), 
            'reward':reward, 'done':done})
        if len(memory) > 32:
            memory = memory[1:]

        state = state[1:].clone()

        if done:
            break
# quit()
# memory = ReplayMemory(32)
# Numpy
for i_episode in range(25):
    print ('Numpy: Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
    obser = env.reset()
    obser = ProcessState(obser,False)

    state = np.zeros((3,imageShape[0],imageShape[1]))
    state = np.concatenate([state, obser.reshape((1,imageShape[0],imageShape[1]))])

    for t in range(10000): 
        obser, reward, done, _ = env.step(0)

        #this is new observation getting process
        obser = ProcessState(obser,False)

        # state = torch.cat((state,obser.view(1,imageShape[0],imageShape[1])),0)
        state = np.concatenate([state, obser.reshape((1,imageShape[0],imageShape[1]))])
        
        memory.append({'state':state[:-1], 'action': action, 'state1':state[1:], 
            'reward':reward, 'done':done})
        if len(memory) > 32:
            memory = memory[1:]
        state = state[1:]

        if done:
            break

Here is my output:

(py3) jtremblay@office:~/code/Personal-git/dqn$ python memory_issue.py 
[2017-03-07 12:19:41,554] Making new env: PongDeterministic-v3
Pytorch: Memory usage: 115628 (kb)
Pytorch: Memory usage: 131180 (kb)
Pytorch: Memory usage: 132500 (kb)
Pytorch: Memory usage: 133556 (kb)
Pytorch: Memory usage: 135932 (kb)
Pytorch: Memory usage: 137252 (kb)
Pytorch: Memory usage: 137780 (kb)
Pytorch: Memory usage: 138308 (kb)
Pytorch: Memory usage: 139364 (kb)
Pytorch: Memory usage: 139892 (kb)
Pytorch: Memory usage: 140420 (kb)
Pytorch: Memory usage: 140684 (kb)
Pytorch: Memory usage: 141476 (kb)
Pytorch: Memory usage: 141476 (kb)
Pytorch: Memory usage: 142004 (kb)
Pytorch: Memory usage: 142268 (kb)
Pytorch: Memory usage: 143060 (kb)
Pytorch: Memory usage: 143588 (kb)
Pytorch: Memory usage: 143852 (kb)
Pytorch: Memory usage: 143852 (kb)
Pytorch: Memory usage: 144116 (kb)
Pytorch: Memory usage: 144380 (kb)
Pytorch: Memory usage: 144380 (kb)
Pytorch: Memory usage: 144644 (kb)
Pytorch: Memory usage: 144644 (kb)
Numpy: Memory usage: 144908 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)
Numpy: Memory usage: 154932 (kb)

I have also try to create a simple code that would replicate the behaviour:

import torch
import resource
import numpy as np

import argparse
parser = argparse.ArgumentParser(description='Memory issue')
parser.add_argument('--numpy',    action='store_true')
args = parser.parse_args()


a = [None for _ in range(10)]
# print (a)
j = 0

if args.numpy:
    state = np.ones((3,200,200))
    state = np.concatenate([state,np.random.rand(1,200,200)],0)
else:
    state = torch.ones((3,200,200))
    state = torch.cat((state,torch.rand(1,200,200)),0)

for i in range(5000):
    if i % 400 is 0:
        if args.numpy:
            print ('Numpy Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
        else:
            print ('Torch Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
    if j > 8:
        j = -1
    j+=1

    if args.numpy:
        state = np.concatenate([state,np.random.rand(1,200,200)],0)
    else:
        state = torch.cat((state,torch.rand(1,200,200)),0)
    a[j] = state[0:3]
    state = state[1:]

Here is the output:

(py3) jtremblay@office:~/code/Personal-git/dqn$ python memory_simple.py
Torch Memory usage: 82428 (kb)
Torch Memory usage: 95536 (kb)
Torch Memory usage: 96024 (kb)
Torch Memory usage: 96024 (kb)
Torch Memory usage: 96024 (kb)
Torch Memory usage: 96088 (kb)
Torch Memory usage: 96660 (kb)
Torch Memory usage: 96660 (kb)
Torch Memory usage: 96660 (kb)
Torch Memory usage: 96660 (kb)
Torch Memory usage: 96660 (kb)
Torch Memory usage: 96660 (kb)
Torch Memory usage: 96660 (kb)
Torch Memory usage: 96660 (kb)
Torch Memory usage: 96808 (kb)
Torch Memory usage: 96808 (kb)
Torch Memory usage: 96808 (kb)
Torch Memory usage: 96808 (kb)
Torch Memory usage: 96808 (kb)
Torch Memory usage: 96808 (kb)
(py3) jtremblay@office:~/code/Personal-git/dqn$ python memory_simple.py --numpy
Numpy Memory usage: 81132 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)
Numpy Memory usage: 96064 (kb)

Really I am not sure how this helps or not. As for the time taken storing numpy arrays and then translate them into torch tensor, I will run more experiments and report them here.

I ran quick experiment running 4 data point in total and I get better performance if I am storing torch tensor rather than numpy. The experiment included running q network on 20 episodes in a deterministic environment

Torch: 2 min and 2 min 12 sec.
Numpy 3 min 34 sec. and 3 min 26 sec.

The only difference in the code includes converting the numpy array intor a torch array in order to run inference and learning process. I hope this helps.

Wow, that’s weird. I just ran your test script and I can’t reproduce the issue. If I increase the number of loop iterations the memory usage stabilizes at 93420KB for me (and 94440KB for numpy). Maybe it only happens in Python 2, I’ll need to try

Nope, can’t reproduce. On Python 2.7 it takes a bit longer to stabilize, but it stops at 98668KB.

i think this is just the kernel or the default memory allocator being smart / doing some caching. Some allocators / kernels do this, and I’ve seen this in other settings (unrelated to this).

Yes, the worrying part for me was that @jtremblay mentioned that it went OOM, so I thought it might be a leak and not just an allocator strategy. But I can’t reproduce it in any way

Thank you for running the test. The smaller snippet of code also stabilize for bigger loops on my machine (96876 kb). But the longer snippet of code (the one which uses the gym environment) does not stabilize. I am not sure why.

I think I might have been wrong, I ran a very long experiment and after a while the memory usage stabilize. I am sorry about this.

1 Like

I decided to run a longer test with the script storing states coming from the gym environment:

import resource

import argparse
import gym
import numpy as np
from itertools import count
from collections import namedtuple
import os 

import torch
import random
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.autograd as autograd
from torch.autograd import Variable
import torchvision.transforms as T
import cv2
import pickle
import glob
import time
import subprocess
from collections import namedtuple

# Functions
def ProcessState(state,torchOutput=True):
    img = cv2.cvtColor(state, cv2.COLOR_BGR2GRAY)
    img = cv2.resize(img, (imageShape[1],imageShape[0])).astype('float32')
    if torchOutput:
        img = torch.from_numpy(img)
    img /= 255
    img -= 0.5 
    img *= 2
    return img

# Variables
Transition = namedtuple('Transition', ('state', 'action', 'next_state', 'reward', 'done'))

imageShape = (110,80)
env = gym.make('PongDeterministic-v3')
action = 0 
memory = []
reward = 0
done = False 
# Example with pytorch
for i_episode in range(5000):
    if i_episode % 500 is 0:
        print (str(i_episode)+' Pytorch: Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
    
    obser = env.reset()
    obser = ProcessState(obser).clone()

    state = torch.ones((3,imageShape[0],imageShape[1])).clone()    
    state = torch.cat((state.clone(),obser.view(1,imageShape[0],imageShape[1])),0).clone()

    for t in range(10000): 
        obser, reward, done, _ = env.step(0)

        #this is new observation getting process
        obser = ProcessState(obser).clone()

        state = torch.cat((state.clone(),obser.view(1,imageShape[0],imageShape[1])),0).clone()
        
        memory.append({'state':state[:-1].clone(), 'action': action, 'state1':state[1:].clone(), 
            'reward':reward, 'done':done})
        if len(memory) > 32:
            memory = memory[1:]

        state = state[1:].clone()

        if done:
            break
# quit()
# memory = ReplayMemory(32)
# Numpy
for i_episode in range(50000):
    if i_episode % 500 is 0:
        print (str(i_episode)+' Numpy: Memory usage: %s (kb)' % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
    obser = env.reset()
    obser = ProcessState(obser,False)

    state = np.zeros((3,imageShape[0],imageShape[1]))
    state = np.concatenate([state, obser.reshape((1,imageShape[0],imageShape[1]))])

    for t in range(10000): 
        obser, reward, done, _ = env.step(0)

        #this is new observation getting process
        obser = ProcessState(obser,False)

        # state = torch.cat((state,obser.view(1,imageShape[0],imageShape[1])),0)
        state = np.concatenate([state, obser.reshape((1,imageShape[0],imageShape[1]))])
        
        memory.append({'state':state[:-1], 'action': action, 'state1':state[1:], 
            'reward':reward, 'done':done})
        if len(memory) > 32:
            memory = memory[1:]
        state = state[1:]

        if done:
            break

I am still getting issues for storing pytorch tensor. For some reason a reference is kept to some tensors and they do not get clean out of the memory. Here is the output I got (The pytorch test ran for 6 hours). I stopped the numpy after 3000 episodes as it showed stabilities.

(py3) jtremblay@office:~/code/Personal-git/dqn$ python memory_issue.py 
[2017-03-08 09:23:04,502] Making new env: PongDeterministic-v3
0 Pytorch: Memory usage: 116916 (kb)
500 Pytorch: Memory usage: 159720 (kb)
1000 Pytorch: Memory usage: 172392 (kb)
1500 Pytorch: Memory usage: 188232 (kb)
2000 Pytorch: Memory usage: 204864 (kb)
2500 Pytorch: Memory usage: 221232 (kb)
3000 Pytorch: Memory usage: 236540 (kb)
3500 Pytorch: Memory usage: 252908 (kb)
4000 Pytorch: Memory usage: 268744 (kb)
4500 Pytorch: Memory usage: 282472 (kb)
(py3) jtremblay@office:~/code/Personal-git/dqn$ python memory_issue.py 
[2017-03-08 11:30:58,943] Making new env: PongDeterministic-v3
0 Numpy: Memory usage: 116532 (kb)
500 Numpy: Memory usage: 129508 (kb)
1000 Numpy: Memory usage: 129508 (kb)
1500 Numpy: Memory usage: 129508 (kb)
2000 Numpy: Memory usage: 129508 (kb)
2500 Numpy: Memory usage: 129508 (kb)
3000 Numpy: Memory usage: 129508 (kb)

There might be something that I do not understand while manipulating the tensors, in this context I am using torch.cat, clone and from_numpy. Using the same code with numpy arrays does create any instabilities with the memory. I thought using clone everywhere I could would force to freed any reference to previous tensors. If I do not use clone this is the result usage I get after 500 episodes:

(py3) jtremblay@office:~/code/Personal-git/dqn$ python memory_issue.py 
[2017-03-08 12:51:20,207] Making new env: PongDeterministic-v3
0 Pytorch: Memory usage: 115008 (kb)
500 Pytorch: Memory usage: 492140 (kb)

Without the clone call, the memory usage is quite large. Also I never invoke the copy function in the numpy part. I am extremely confused by this behaviour.

Ok, I’ve found and fixed the problem! Thanks for the report, the patch should be in master soon.

3 Likes

That is great :stuck_out_tongue: Thank you for double checking everything!

1 Like

Quick question @apaszke and @jtremblay :
Did you implement DQN with the same memory configuration as stated in the original paper (1 Mio. transitions)?

I am currently facing a similar issue, where python starts to use up gigabytes of memory in an end-to-end learning setting an the python process gets killed at around 114 thousand transitions in memory.
I’m working on a MacbookPro with 16 gigs of RAM.

@denizs
You can use this approach from the OpenAI baseline implementation that greatly optimizes memory requirements
https://github.com/openai/baselines/blob/master/baselines/common/atari_wrappers_deprecated.py#L152

1 Like

Awesome thanks :slight_smile: