Low level Error running my Deep Q Learning Algorithm

diegoalejogm · January 22, 2018, 9:40pm

I am running my Deep Q learning algorithm in FloydHub and when the algorithm runs I get the following error:

2018-01-22 20:27:42,694 INFO - *** Error in 'python': free(): invalid next size (fast): 0x00007f3794370de0 ***
2018-01-22 20:27:42,704 INFO - ======= Backtrace: =========
2018-01-22 20:27:42,704 INFO - /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f52a3f247e5]
2018-01-22 20:27:42,704 INFO - /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f52a3f2ce0a]
2018-01-22 20:27:42,705 INFO - /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f52a3f3098c]
2018-01-22 20:27:42,705 INFO - /usr/lib/x86_64-linux-gnu/libcudnn.so.5(cudnnDestroyConvolutionDescriptor+0x9)[0x7f52839fcff9]
2018-01-22 20:27:42,705 INFO - /usr/local/lib/python2.7/site-packages/torch/_C.so(+0x2deda9)[0x7f526d129da9]
2018-01-22 20:27:42,705 INFO - /usr/local/lib/python2.7/site-packages/torch/_C.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7f526e07f704]
...

This is a rented machine so I can’t debug anything, and also it seems to be a C problem, so it isn’t being catched by the python API and I don’t know exactly what in my code is generating the error.

Any suggestions?

GitHub link:

FloydHub link:
https://www.floydhub.com/diegoalejogm/projects/atari/20/code

richard · January 22, 2018, 10:00pm

Something definitely sounds it went wrong…

Does the code run on a regular machine? Are you using the GPU?

diegoalejogm · January 22, 2018, 10:06pm

Yes it does. If you remove all the .cuda() segments, then it definitely works with CPU.

The code I uploaded to GitHub should work for GPU only, but I’m getting the error I mentioned.

diegoalejogm · January 22, 2018, 11:05pm

Does it look more like a PyTorch problem?

richard · January 23, 2018, 3:10pm

If I had to guess it definitely looks more like a pytorch problem. I’m having a little trouble running your code but I’ll try again later today and report back.

One thing that you could try in the mean time is building pytorch from source and running the code on a GPU with that. There’s been a couple of fixes to memory errors and this could be one of them.

diegoalejogm · January 23, 2018, 3:32pm

It definitely is a PyTorch error.

I was installing PyTorch 0.3.0.post4 for Python 2.7 with Cuda 8 from it’s public .whl file and that version gave me an error. Link: http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl

So I decided to downgrade to an earlier version: 0.2.0.post3 for same python and Cuda distributions.
Link: http://download.pytorch.org/whl/cu80/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl
And this time it worked successfully, so I´d guess is an error on the latest public release.

I was able to run my DQL algorithm successfully:

richard · January 23, 2018, 4:42pm

How do I run your code? I downloaded it from your github repository but running python main.py gives:

(py27) [rzou@devgpu226.prn2 ~/deep-q-learning] python main.py
Loading Memory data into Replay Memory Instance...
Traceback (most recent call last):
  File "main.py", line 55, in <module>
    load_existing=True, data_dir=FLAGS.in_dir)
  File "/home/rzou/deep-q-learning/utils/models.py", line 28, in __init__
    self.load()
  File "/home/rzou/deep-q-learning/utils/models.py", line 111, in load
    self.__init__(self.memory_size, load_existing=False)
TypeError: __init__() takes at least 4 arguments (3 given)

diegoalejogm · January 25, 2018, 5:59am

It is fixed now.

You can run python main.py and it should work

Let me know how it goes

richard · January 25, 2018, 8:50pm

Running it with pytorch 0.3.0_post4 and it hasn’t erred out yet. How long would you say it usually takes for that to happen?

diegoalejogm · January 26, 2018, 1:03am

Few seconds or minutes, are you using the same CUDA version and the same Linux distro?

Check the console in this run https://www.floydhub.com/diegoalejogm/projects/atari/20

richard · January 26, 2018, 4:07pm

I am probably not on the same linux distro (on centos 6 right now), but I am using python 2.7 and cuda 8. Which linux distro are you on?