Low level Error running my Deep Q Learning Algorithm

I am running my Deep Q learning algorithm in FloydHub and when the algorithm runs I get the following error:

2018-01-22 20:27:42,694 INFO - *** Error in 'python': free(): invalid next size (fast): 0x00007f3794370de0 ***
2018-01-22 20:27:42,704 INFO - ======= Backtrace: =========
2018-01-22 20:27:42,704 INFO - /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f52a3f247e5]
2018-01-22 20:27:42,704 INFO - /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f52a3f2ce0a]
2018-01-22 20:27:42,705 INFO - /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f52a3f3098c]
2018-01-22 20:27:42,705 INFO - /usr/lib/x86_64-linux-gnu/libcudnn.so.5(cudnnDestroyConvolutionDescriptor+0x9)[0x7f52839fcff9]
2018-01-22 20:27:42,705 INFO - /usr/local/lib/python2.7/site-packages/torch/_C.so(+0x2deda9)[0x7f526d129da9]
2018-01-22 20:27:42,705 INFO - /usr/local/lib/python2.7/site-packages/torch/_C.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7f526e07f704]

This is a rented machine so I can’t debug anything, and also it seems to be a C problem, so it isn’t being catched by the python API and I don’t know exactly what in my code is generating the error.

Any suggestions?

GitHub link:

FloydHub link:

Something definitely sounds it went wrong…

Does the code run on a regular machine? Are you using the GPU?

Yes it does. If you remove all the .cuda() segments, then it definitely works with CPU.

The code I uploaded to GitHub should work for GPU only, but I’m getting the error I mentioned.

Does it look more like a PyTorch problem?

If I had to guess it definitely looks more like a pytorch problem. I’m having a little trouble running your code but I’ll try again later today and report back.

One thing that you could try in the mean time is building pytorch from source and running the code on a GPU with that. There’s been a couple of fixes to memory errors and this could be one of them.

1 Like

It definitely is a PyTorch error.

I was installing PyTorch 0.3.0.post4 for Python 2.7 with Cuda 8 from it’s public .whl file and that version gave me an error. Link: http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl

So I decided to downgrade to an earlier version: 0.2.0.post3 for same python and Cuda distributions.
Link: http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl
And this time it worked successfully, so I´d guess is an error on the latest public release.

I was able to run my DQL algorithm successfully:

How do I run your code? I downloaded it from your github repository but running python main.py gives:

(py27) [rzou@devgpu226.prn2 ~/deep-q-learning] python main.py
Loading Memory data into Replay Memory Instance...
Traceback (most recent call last):
  File "main.py", line 55, in <module>
    load_existing=True, data_dir=FLAGS.in_dir)
  File "/home/rzou/deep-q-learning/utils/models.py", line 28, in __init__
  File "/home/rzou/deep-q-learning/utils/models.py", line 111, in load
    self.__init__(self.memory_size, load_existing=False)
TypeError: __init__() takes at least 4 arguments (3 given)
1 Like

It is fixed now.

You can run python main.py and it should work

Let me know how it goes :slight_smile:

Running it with pytorch 0.3.0_post4 and it hasn’t erred out yet. How long would you say it usually takes for that to happen?

Few seconds or minutes, are you using the same CUDA version and the same Linux distro?

Check the console in this run https://www.floydhub.com/diegoalejogm/projects/atari/20

I am probably not on the same linux distro (on centos 6 right now), but I am using python 2.7 and cuda 8. Which linux distro are you on?