NLP classification with CNN - MemoryError: Unable to allocate memory

Hi,

I’m working on an NLP classification problem. Dataset is (after processing, of course) very simple. It has two columns (label with dozen unique values and a column with a large string value).

I’m testing CNN approach to this problem.
This part raises an memory error:

**dataset = load_classification_dataset(args.target_csv, tokenizer_func=character_tokenizer)**

args.num_embeddings = len(dataset.vectorizer.token_vocab)
args.num_classes = len(dataset.vectorizer.label_vocab)

model = CharCNN(num_embeddings=args.num_embeddings, 
                embedding_size=args.embedding_size, 
                channel_size=args.channel_size,
                num_classes=args.num_classes)

---------------------------------------------------------------------------

MemoryError                               Traceback (most recent call last)

<ipython-input-10-493fc7e9d364> in <module>
----> 1 dataset = load_classification_dataset(args.target_csv, tokenizer_func=character_tokenizer)
      2 
      3 args.num_embeddings = len(dataset.vectorizer.token_vocab)
      4 args.num_classes = len(dataset.vectorizer.label_vocab)
      5 

<ipython-input-6-fb4c2153975e> in load_classification_dataset(dataset_csv, tokenizer_func)
     26                                     vectorizer=vectorizer,
     27                                     target_x_column='tokenized',
---> 28                                     target_y_column='label')
     29 
     30     return dataset

<ipython-input-4-3c018962b6ba> in __init__(self, df, vectorizer, target_x_column, target_y_column)
     32             self._vectorized[split_name] = \
     33                 vectorizer.transform(x_data_list=split_df[target_x_column].tolist(), 
---> 34                                      y_label_list=split_df[target_y_column].tolist())
     35         self.vectorizer = vectorizer
     36         self.active_split = None

<ipython-input-3-a912c2226d66> in transform(self, x_data_list, y_label_list)
     67         y_vector = []
     68         for x_data, y_label in zip(x_data_list, y_label_list):
---> 69             x_vector, y_index = self.vectorize(x_data, y_label)
     70             x_matrix.append(x_vector)
     71             y_vector.append(y_index)

<ipython-input-3-a912c2226d66> in vectorize(self, x_data, y_label)
     48         x_data = self._wrap_with_start_end(x_data)
     49         #x_vector = np.zeros(self.max_seq_length).astype(np.int64) 
---> 50         x_vector = np.zeros(self.max_seq_length).astype(np.uint8)
     51         x_data_indices = [self.token_vocab[token] for token in x_data]
     52         x_vector[:len(x_data_indices)] = x_data_indices

MemoryError: Unable to allocate 15.7 MiB for an array with shape (16431250,) and data type uint8

Also, part of the code that generates batches has the following code:

    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last, **dataloader_kwargs)

    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

and it seems to me as a suspicious code that could cause the problem, but have no idea how to solve it.

Does anyone have a suggestion on how to solve this issue?

Thanks a lot

It seems the error is raised by numpy and might be related to this one.

Hi,
Thank you. Unfortunately, none of these suggestions helped to solve the issue.
Is there any other suggestion on how to solve it?

Thanks a lot