I am running my Deep Q learning algorithm in FloydHub and when the algorithm runs I get the following error:
2018-01-22 20:27:42,694 INFO - *** Error in 'python': free(): invalid next size (fast): 0x00007f3794370de0 ***
2018-01-22 20:27:42,704 INFO - ======= Backtrace: =========
2018-01-22 20:27:42,704 INFO - /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f52a3f247e5]
2018-01-22 20:27:42,704 INFO - /lib/x86_64-linux-gnu/libc.so.6(+0x7fe0a)[0x7f52a3f2ce0a]
2018-01-22 20:27:42,705 INFO - /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f52a3f3098c]
2018-01-22 20:27:42,705 INFO - /usr/lib/x86_64-linux-gnu/libcudnn.so.5(cudnnDestroyConvolutionDescriptor+0x9)[0x7f52839fcff9]
2018-01-22 20:27:42,705 INFO - /usr/local/lib/python2.7/site-packages/torch/_C.so(+0x2deda9)[0x7f526d129da9]
2018-01-22 20:27:42,705 INFO - /usr/local/lib/python2.7/site-packages/torch/_C.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7f526e07f704]
...
This is a rented machine so I can’t debug anything, and also it seems to be a C problem, so it isn’t being catched by the python API and I don’t know exactly what in my code is generating the error.
If I had to guess it definitely looks more like a pytorch problem. I’m having a little trouble running your code but I’ll try again later today and report back.
One thing that you could try in the mean time is building pytorch from source and running the code on a GPU with that. There’s been a couple of fixes to memory errors and this could be one of them.
How do I run your code? I downloaded it from your github repository but running python main.py gives:
(py27) [rzou@devgpu226.prn2 ~/deep-q-learning] python main.py
Loading Memory data into Replay Memory Instance...
Traceback (most recent call last):
File "main.py", line 55, in <module>
load_existing=True, data_dir=FLAGS.in_dir)
File "/home/rzou/deep-q-learning/utils/models.py", line 28, in __init__
self.load()
File "/home/rzou/deep-q-learning/utils/models.py", line 111, in load
self.__init__(self.memory_size, load_existing=False)
TypeError: __init__() takes at least 4 arguments (3 given)