CUDA memory management

edowson · March 22, 2019, 6:33pm

Hi,

After running a DQN network for several 100s of episodes, I get a CUDA out of memory error.

Traceback (most recent call last):
  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/ardrone_v1_ddqn.py", line 533, in <module>
    optimize_model()
  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/ardrone_v1_ddqn.py", line 436, in optimize_model
    next_state_values[non_final_mask] = target_net(non_final_next_states).max(1)[0].detach()
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 489, in __call__
    result = self.forward(*input, **kwargs)
  File "/project/ros-kinetic-alphapilot/catkin_ws/src/alphapilot_openai_ros/ardrone_race_track/src/ardrone_v1_ddqn.py", line 285, in forward
    x = F.relu(self.bn2(self.conv2(x)))
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 862, in relu
    result = torch.relu(input)
RuntimeError: CUDA out of memory. Tried to allocate 155.12 MiB (GPU 0; 11.75 GiB total capacity; 8.74 GiB already allocated; 163.25 MiB free; 455.51 MiB cached)

Which parts of an NN model or other variables should I look at, to see where the leak is?

github.com

edowson/alphapilot_openai_ros/blob/master/ardrone_race_track/src/ardrone_v1_ddqn.py

# -*- coding: utf-8 -*-
"""ARDrone ddqn training script.

This script is based on PyTorch DQN example by Adam Paszke.

ref: https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html
"""

__copyright__ = "Copyright 2019, Elvis Dowson"
__license__ = "MIT"
__author__ = "Elvis Dowson <elvis.dowson@gmail.com>"

import cv2
import gym
import math
import random
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import PIL

This file has been truncated. show original

edowson · March 22, 2019, 6:58pm

I think I figured out why GPU memory usage keeps increasing. It is because of the replay buffer.