I am running my Deep Q learning algorithm in FloydHub and when the algorithm runs I get the following error:
2018-01-22 20:27:42,694 INFO - *** Error in 'python': free(): invalid next size (fast): 0x00007f3794370de0 ***
2018-01-22 20:27:42,704 INFO - ======= Backtrace: =========
2018-01-22 20:27:42,704 INFO - /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f52a3f247e5]
2018-01-22 20:27:42,704 INFO - /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f52a3f2ce0a]
2018-01-22 20:27:42,705 INFO - /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f52a3f3098c]
2018-01-22 20:27:42,705 INFO - /usr/lib/x86_64-linux-gnu/libcudnn.so.5(cudnnDestroyConvolutionDescriptor+0x9)[0x7f52839fcff9]
2018-01-22 20:27:42,705 INFO - /usr/local/lib/python2.7/site-packages/torch/_C.so(+0x2deda9)[0x7f526d129da9]
2018-01-22 20:27:42,705 INFO - /usr/local/lib/python2.7/site-packages/torch/_C.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7f526e07f704]
This is a rented machine so I can’t debug anything, and also it seems to be a C problem, so it isn’t being catched by the python API and I don’t know exactly what in my code is generating the error.
Something definitely sounds it went wrong…
Does the code run on a regular machine? Are you using the GPU?
Yes it does. If you remove all the
.cuda() segments, then it definitely works with CPU.
The code I uploaded to GitHub should work for GPU only, but I’m getting the error I mentioned.
Does it look more like a PyTorch problem?
If I had to guess it definitely looks more like a pytorch problem. I’m having a little trouble running your code but I’ll try again later today and report back.
One thing that you could try in the mean time is building pytorch from source and running the code on a GPU with that. There’s been a couple of fixes to memory errors and this could be one of them.
It definitely is a PyTorch error.
I was installing PyTorch 0.3.0.post4 for Python 2.7 with Cuda 8 from it’s public
.whl file and that version gave me an error. Link: http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl
So I decided to downgrade to an earlier version:
0.2.0.post3 for same python and Cuda distributions.
And this time it worked successfully, so I´d guess is an error on the latest public release.
I was able to run my DQL algorithm successfully:
How do I run your code? I downloaded it from your github repository but running
python main.py gives:
(py27) [email@example.com ~/deep-q-learning] python main.py
Loading Memory data into Replay Memory Instance...
Traceback (most recent call last):
File "main.py", line 55, in <module>
File "/home/rzou/deep-q-learning/utils/models.py", line 28, in __init__
File "/home/rzou/deep-q-learning/utils/models.py", line 111, in load
TypeError: __init__() takes at least 4 arguments (3 given)
It is fixed now.
You can run
python main.py and it should work
Let me know how it goes
Running it with pytorch 0.3.0_post4 and it hasn’t erred out yet. How long would you say it usually takes for that to happen?
Few seconds or minutes, are you using the same CUDA version and the same Linux distro?
Check the console in this run https://www.floydhub.com/diegoalejogm/projects/atari/20
I am probably not on the same linux distro (on centos 6 right now), but I am using python 2.7 and cuda 8. Which linux distro are you on?