Error in `python': free(): invalid pointer: When using model.cuda() on AWS Instance

For some reason when I set model.cuda() along with the other examples I get the following error:
*** Error in `python’: free(): invalid pointer: 0x00007f8af6c2bae0 ***

However when I no longer set model.cuda() I get no pointer free errors and the model trains fine. Do I have to set .cuda() on every single variable including the criterion?

I am using python2 and using the Udacity Tensorflow g2.2xlarge instance on Amazon AWS.

Here is a link to my code:

Thank you!

Hi,

The .cuda() operation is not inplace for tensors, your should do input = input.cuda().
That being said it should just raise an error, not fail like that.

I changed everything to .cuda() now but this is the error I get instead:

THCudaCheck FAIL file=/py/conda-bld/pytorch_1490983232023/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "model.py", line 244, in <module>
    train(train_loader, model, criterion, optimizer, epoch)
  File "model.py", line 118, in train
    loss.backward()
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
RuntimeError: cuda runtime error (2) : out of memory at /py/conda-bld/pytorch_1490983232023/work/torch/lib/THC/generic/THCStorage.cu:66

You don’t have enough memory on the GPU, you may want to reduce the batch size.

I get the same error on running the cartpole example with cuda. Hower as mentioned above without cuda it runs fine. The error persists even on reducing the batch_size to 2. Any solutions?

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

Fixes the error.

5 Likes

It works, thanks. But do you know why?

Indeed it solves the “invalid pointer error”! Can anyone explain why?

Is there any solution to this for someone on an academic institution cluster without sudo privileges?

This solved the problem, but after a few more epochs it crashed again. Any more suggestions?