Recover from CUDA Out Of Memory

job28 · November 8, 2018, 4:20am

Hi team,
I have two data generator classes, one which loads all the data from a file onto memory thereafter feeds and another one which feeds batches from the file. My script tries the first approach and if the memory is not sufficient goes to approach two.

try:
loader = DataLoader1()
except RuntimeError as e:
loader = DataLoader2()

But once the RuntimeError,

RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCTensorMath.cu:35

is raised by DataLoader1, DataLoader2 also raises the same error. Is there a way I can solve this?

Thank you.

ptrblck · November 8, 2018, 9:41am

Have a look at this thread. FairSeq is using a different approaches in case they run into an OOM issue.
Maybe you could adapt them to your use case.

job28 · November 9, 2018, 2:26am

Thanks for the link. In my case, this happens before training, the model is not created yet! In the try block I’m trying to load all the training set onto memory which sometimes fails. Since that object is not created shouldn’t the DataLoader2 in exception be excecuted?
torch.cuda.empty_cache(), does not seem to help here. Moreover the documention also states

“the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch”

ptrblck · November 9, 2018, 9:30am

torch.cuda.empty_cache() is called after the tensors were deleted.
Could you try to delete loader in the exception first, then empty the cache and see if you can recreate the loader using DataLoader2?
How did you create your DataLoader? Do you push all data onto the GPU?

job28 · November 11, 2018, 9:59pm

In Dataloader1 cuda OOM occurs here when the tensor is created inside init(),

class DataLoader1():
    def __init__(self, args):
        ....
        self.features_vec = torch.zeros((self.num_samples, self.max_seq_len), dtype=torch.long, device=self.device)

Which is understandable, as I am trying to create a tensor of that big size. As this is in DataLoader1’s init(), I suppose the object loader is not created. I get the undefined variable error when I try

except RuntimeError as e:
    del loader

Whereas DataLoader2 looks like this;

class DataLoader2():
    def __init__(self, args):
        ...
        self.features_vec = torch.zeros((self._batch_size, self.max_seq_len), dtype=torch.long, device=self.device)

And this too raises OOM while creating self.features_vec. Here self.device = torch.device('cuda') and I can see plenty of memory left in nvidia-smi.

ptrblck · November 11, 2018, 10:07pm

Since the exception is thrown in __init__ your loader was actually never instantiated.
This would probably be the reason, why you can’t delete loader in the except block.
Is the memory still in use after the exception?

If so, could you try to create the try block inside of __init__ and empty the cache once you get the OOM error?

Are you getting the OOM error for DataLoader2 from the beginning or just after DataLoader1 could not be created?

job28 · November 11, 2018, 10:15pm

That’s right, loader is not instantiated. This is DataLoader1

class DataLoader1():

    def __init__(self, trg_file_path, emb_indices, batch_size, max_seq_len, label_signature=None,
                sample_per_word=False, one_hot_labels=False, update_emb_indices=False,
                infinite_batches=False, intents_to_use=None, device=None, num_samples_key='num_samples',
                get_raw_samples_while_iter=False, debug_mode=False):
        """
        Iterator to feed trg data
        args:;
            labels = [
                ('intent',[None, 'greet', 'direction']),
                ('action',[None, 'enquire', 'navigate']),
                ('subject',[None, 'thank', 'hello'])
                ]
            or
            labels = [None, 'person','location','day', 'time']
            batches_to_produce:
                -1 - feeds indefinitely
                0 - Feeds till the file exhausts
                <int value> - feeds <value> batches and exits
            intents_to_use - if a list is provided, excludes samples not included in this list
        """
        if debug_mode:
            torch.set_printoptions(threshold=10000)
        else:
            torch.set_printoptions(threshold=500)
        self.emb_indices = emb_indices
        self.batch_size = batch_size
        self.max_seq_len = max_seq_len
        self.sample_per_word = sample_per_word
        self.one_hot_labels = one_hot_labels
        self.update_emb_indices = update_emb_indices
        self.get_raw_samples_while_iter = get_raw_samples_while_iter
        self.sample_ptr = 0
        self.infinite_batches = infinite_batches
        self.num_samples_key = num_samples_key

        print ('Preparing data from file = {}'.format(trg_file_path))
        self.phrase_match, self.num_samples, self.intents_to_train, labels_from_trg_file = getClasses(
                                            trg_file_path,intents_to_get=intents_to_use,
                                            num_samples_key=num_samples_key)
        self.labels = labels_from_trg_file if label_signature is None else label_signature

        if device:
            self.device = device
        else:
            self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

        if get_raw_samples_while_iter:
            self.samples = []
        # calc feature vec size
        try:
            self.features_vec = torch.zeros((self.num_samples, self.max_seq_len), dtype=torch.long, device=self.device)
        except RuntimeError as e:
            torch.cuda.empty_cache()
            raise e

As you can see self.features_vec is the 1st tensor that is created, all others are trivial and small python variables. I tried clearing cache with a try block, but that didnt help.

Yes I’m getting OOM in DataLoader2 just after DataLoader1 fails to instantiate. If I run just DataLoader2 it works fine (That’s how I am training now)

job28 · November 11, 2018, 10:23pm

  1 Trying to load data onto memory
  2 Preparing data from file = trg_data.txt
  3 CUDA error: out of memory
  4 Not enough memory to load all the data to GPU. Run script without the '-m' flag
  5 torch.cuda.max_memory_allocated()=0 ,torch.cuda.max_memory_cached() = 0
  6 
  7 Preparing data from file = trg_data.txt
  8 THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=35 error=2 : out of memory
  9 Traceback (most recent call last):
 10   File "train_ext.py", line 239, in <module>
 11     intents_to_train=args.intents, debug_mode=args.debug, test=args.test, load_to_memory=args.load_all_data_to_memory    )   
 12   File "train_ext.py", line 59, in train
 13     num_samples_key='num_words', get_raw_samples_while_iter=debug_mode, debug_mode=debug_mode)
 14   File "/media/storage/dev/utils.py", line 261, in __init__
 15     self.features_vec = torch.zeros((self._batch_size, self.max_seq_len), dtype=torch.long, device=self.device)
 16 RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCTensorMath.cu:35

here’s the error. Line 4 is where OOM occurs for DataLoader1 and from Line 7 DataLoader2 tries to take over.

ptrblck · November 12, 2018, 12:05am

You could try to delete self.features_vec in case it’s holding to some reference before calling torch.cuda.empty_cache(). Let me know, if that works.

job28 · November 12, 2018, 12:10am

I think the just like loader, self.features_vec is also not instantiated. I get,

  File "/media/storage/dev/utils.py", line 65, in __init__
    del self.features_vec
AttributeError: DataLoader1 instance has no attribute 'features_vec'

when I do

        try:
            self.features_vec = torch.zeros((self.num_samples, self.max_seq_len), dtype=torch.long, device=self.device)
        except RuntimeError as e:
            del self.features_vec
            torch.cuda.empty_cache()
            raise e

ptrblck · November 12, 2018, 12:16am

Thanks the trying this approach. I’m afraid I have no other suggestions.
Let’s see what others might come up with. Maybe I’m just not seeing any obvious issue.

job28 · November 12, 2018, 12:24am

Thanks for your time @ptrblck. I will keep trying too.

job28 · November 13, 2018, 12:47am

I think I know what’s happening here. It takes a little time for the memory to clear so that it can be reused.
This is what worked!

        try:
            print("\nTrying to load data onto memory")
            data_loader = DataLoader1(trg_file_path, params)
        except RuntimeError as e:
            print ("Not enough memory to load all the data to GPU\nTrying generator approach")
            _initialized = 0
            while _initialized != 1:
                _initialized -= 1
                try:
                    data_loader = DataLoader2(trg_file_path, params)
                    _initialized = 1
                except RuntimeError as e:
                    if _initialized > -5:
                        print ("Failed to initialize, attempt {}. Let's try again".format(abs(_initialized)))
                    else:
                        raise RuntimeError(e)

Output:

Trying to load data onto memory
Preparing data from file = trg_data.txt
Not enough memory to load all the data to GPU
Trying generator approach
Preparing data from file = trg_data.txt
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=35 error=2 : out of memory
Failed to initialize, attempt 1. Let's try again
Preparing data from file =trg_data.txt

Success!!

As you can see, attempt 1 failed but attempt 2 suceeded. So now is there a way I can wait for memory to sync rather than running a while loop?

ptrblck · November 13, 2018, 1:41am

Oh wow, that’s awesome!
Just a wild guess, but would torch.cuda.synchronize() work?

job28 · November 13, 2018, 1:44am

Thanks, tried that but didnt work.

    try:
        print ("\nTrying to load all data onto Memory")
        data_loader = DataLoader1(trg_file_path, params)
    except RuntimeError as e:
        if 'out of memory' not in str(e):
            raise RuntimeError(e)
        print ("Not enough memory to load all that data\n\nTrying generator approach")
        torch.cuda.synchronize()
        _initialized = 0
        _max_retries = 5
        while _initialized != 1:
            _initialized -= 1
            try:
                data_loader = DataLoader2(trg_file_path, params)
                print("Attempt {}/{}, initialized".format(abs(_initialized), _max_retries))
                _initialized = 1
            except RuntimeError as e:
                if _initialized > -1 * _max_retries:
                    print("Attempt {}/{}, failed to initialize, trying again".format(abs(_initialized), _max_retries))
                else:
                    raise RuntimeError(e)

stdout

Trying to load data onto memory
Preparing data from file = trg_data.txt
Not enough memory to load all that data

Trying generator approach
Preparing data from file = trg_data.txt
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorMath.cu line=35 error=2 : out of memory
Attempt 1/5, failed to initialize, trying again
Preparing data from file = trg_data.txt
Attempt 2/5, initialized

Still takes 2 attempts

EDIT:Nomenclature