Error in `python': free(): invalid pointer: When using model.cuda() on AWS Instance

sujithv28 · April 14, 2017, 5:51am

For some reason when I set model.cuda() along with the other examples I get the following error:
*** Error in `python’: free(): invalid pointer: 0x00007f8af6c2bae0 ***

However when I no longer set model.cuda() I get no pointer free errors and the model trains fine. Do I have to set .cuda() on every single variable including the criterion?

I am using python2 and using the Udacity Tensorflow g2.2xlarge instance on Amazon AWS.

Here is a link to my code:

github.com

sujithv28/Deep-LeafSnap/blob/master/model.py

import argparse
import cv2
import json
import numpy as np
import os
import pandas as pd
import scipy.misc
import shutil
import time
import torch
import torch.backends.cudnn as cudnn
import torch.nn as nn
import torch.nn.functional as F
import torch.nn.parallel
import torch.optim as optim
import torchvision
import torchvision.models as models
import utils

from PIL import Image

This file has been truncated. show original

Thank you!

albanD · April 14, 2017, 8:11am

Hi,

The .cuda() operation is not inplace for tensors, your should do input = input.cuda().
That being said it should just raise an error, not fail like that.

sujithv28 · April 14, 2017, 8:53am

I changed everything to .cuda() now but this is the error I get instead:

THCudaCheck FAIL file=/py/conda-bld/pytorch_1490983232023/work/torch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
Traceback (most recent call last):
  File "model.py", line 244, in <module>
    train(train_loader, model, criterion, optimizer, epoch)
  File "model.py", line 118, in train
    loss.backward()
  File "/home/ubuntu/anaconda3/lib/python3.5/site-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
RuntimeError: cuda runtime error (2) : out of memory at /py/conda-bld/pytorch_1490983232023/work/torch/lib/THC/generic/THCStorage.cu:66

albanD · April 14, 2017, 9:02am

You don’t have enough memory on the GPU, you may want to reduce the batch size.

iammohitchhabra · May 23, 2017, 4:33pm

I get the same error on running the cartpole example with cuda. Hower as mentioned above without cuda it runs fine. The error persists even on reducing the batch_size to 2. Any solutions?

iammohitchhabra · May 23, 2017, 5:30pm

sudo apt-get install libtcmalloc-minimal4
export LD_PRELOAD="/usr/lib/libtcmalloc_minimal.so.4"

Fixes the error.

mbednarski · August 3, 2017, 5:48pm

It works, thanks. But do you know why?

jupiter · August 10, 2017, 8:32pm

Indeed it solves the “invalid pointer error”! Can anyone explain why?

Jared_77 · July 4, 2018, 12:16am

Is there any solution to this for someone on an academic institution cluster without sudo privileges?

MJ-Osis · August 16, 2022, 7:28am

This solved the problem, but after a few more epochs it crashed again. Any more suggestions?