NLP classification with CNN - MemoryError: Unable to allocate memory

Legolas · August 11, 2020, 6:52pm

Hi,

I’m working on an NLP classification problem. Dataset is (after processing, of course) very simple. It has two columns (label with dozen unique values and a column with a large string value).

I’m testing CNN approach to this problem.
This part raises an memory error:

**dataset = load_classification_dataset(args.target_csv, tokenizer_func=character_tokenizer)**

args.num_embeddings = len(dataset.vectorizer.token_vocab)
args.num_classes = len(dataset.vectorizer.label_vocab)

model = CharCNN(num_embeddings=args.num_embeddings, 
                embedding_size=args.embedding_size, 
                channel_size=args.channel_size,
                num_classes=args.num_classes)

---------------------------------------------------------------------------

MemoryError                               Traceback (most recent call last)

<ipython-input-10-493fc7e9d364> in <module>
----> 1 dataset = load_classification_dataset(args.target_csv, tokenizer_func=character_tokenizer)
      2 
      3 args.num_embeddings = len(dataset.vectorizer.token_vocab)
      4 args.num_classes = len(dataset.vectorizer.label_vocab)
      5 

<ipython-input-6-fb4c2153975e> in load_classification_dataset(dataset_csv, tokenizer_func)
     26                                     vectorizer=vectorizer,
     27                                     target_x_column='tokenized',
---> 28                                     target_y_column='label')
     29 
     30     return dataset

<ipython-input-4-3c018962b6ba> in __init__(self, df, vectorizer, target_x_column, target_y_column)
     32             self._vectorized[split_name] = \
     33                 vectorizer.transform(x_data_list=split_df[target_x_column].tolist(), 
---> 34                                      y_label_list=split_df[target_y_column].tolist())
     35         self.vectorizer = vectorizer
     36         self.active_split = None

<ipython-input-3-a912c2226d66> in transform(self, x_data_list, y_label_list)
     67         y_vector = []
     68         for x_data, y_label in zip(x_data_list, y_label_list):
---> 69             x_vector, y_index = self.vectorize(x_data, y_label)
     70             x_matrix.append(x_vector)
     71             y_vector.append(y_index)

<ipython-input-3-a912c2226d66> in vectorize(self, x_data, y_label)
     48         x_data = self._wrap_with_start_end(x_data)
     49         #x_vector = np.zeros(self.max_seq_length).astype(np.int64) 
---> 50         x_vector = np.zeros(self.max_seq_length).astype(np.uint8)
     51         x_data_indices = [self.token_vocab[token] for token in x_data]
     52         x_vector[:len(x_data_indices)] = x_data_indices

MemoryError: Unable to allocate 15.7 MiB for an array with shape (16431250,) and data type uint8

Also, part of the code that generates batches has the following code:

    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last, **dataloader_kwargs)

    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

and it seems to me as a suspicious code that could cause the problem, but have no idea how to solve it.

Does anyone have a suggestion on how to solve this issue?

Thanks a lot

ptrblck · August 15, 2020, 6:49am

It seems the error is raised by numpy and might be related to this one.

Legolas · August 21, 2020, 7:50pm

Hi,
Thank you. Unfortunately, none of these suggestions helped to solve the issue.
Is there any other suggestion on how to solve it?

Thanks a lot