Hi,
I’m working on an NLP classification problem. Dataset is (after processing, of course) very simple. It has two columns (label with dozen unique values and a column with a large string value).
I’m testing CNN approach to this problem.
This part raises an memory error:
**dataset = load_classification_dataset(args.target_csv, tokenizer_func=character_tokenizer)**
args.num_embeddings = len(dataset.vectorizer.token_vocab)
args.num_classes = len(dataset.vectorizer.label_vocab)
model = CharCNN(num_embeddings=args.num_embeddings,
embedding_size=args.embedding_size,
channel_size=args.channel_size,
num_classes=args.num_classes)
---------------------------------------------------------------------------
MemoryError Traceback (most recent call last)
<ipython-input-10-493fc7e9d364> in <module>
----> 1 dataset = load_classification_dataset(args.target_csv, tokenizer_func=character_tokenizer)
2
3 args.num_embeddings = len(dataset.vectorizer.token_vocab)
4 args.num_classes = len(dataset.vectorizer.label_vocab)
5
<ipython-input-6-fb4c2153975e> in load_classification_dataset(dataset_csv, tokenizer_func)
26 vectorizer=vectorizer,
27 target_x_column='tokenized',
---> 28 target_y_column='label')
29
30 return dataset
<ipython-input-4-3c018962b6ba> in __init__(self, df, vectorizer, target_x_column, target_y_column)
32 self._vectorized[split_name] = \
33 vectorizer.transform(x_data_list=split_df[target_x_column].tolist(),
---> 34 y_label_list=split_df[target_y_column].tolist())
35 self.vectorizer = vectorizer
36 self.active_split = None
<ipython-input-3-a912c2226d66> in transform(self, x_data_list, y_label_list)
67 y_vector = []
68 for x_data, y_label in zip(x_data_list, y_label_list):
---> 69 x_vector, y_index = self.vectorize(x_data, y_label)
70 x_matrix.append(x_vector)
71 y_vector.append(y_index)
<ipython-input-3-a912c2226d66> in vectorize(self, x_data, y_label)
48 x_data = self._wrap_with_start_end(x_data)
49 #x_vector = np.zeros(self.max_seq_length).astype(np.int64)
---> 50 x_vector = np.zeros(self.max_seq_length).astype(np.uint8)
51 x_data_indices = [self.token_vocab[token] for token in x_data]
52 x_vector[:len(x_data_indices)] = x_data_indices
MemoryError: Unable to allocate 15.7 MiB for an array with shape (16431250,) and data type uint8
Also, part of the code that generates batches has the following code:
dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
shuffle=shuffle, drop_last=drop_last, **dataloader_kwargs)
for data_dict in dataloader:
out_data_dict = {}
for name, tensor in data_dict.items():
out_data_dict[name] = data_dict[name].to(device)
yield out_data_dict
and it seems to me as a suspicious code that could cause the problem, but have no idea how to solve it.
Does anyone have a suggestion on how to solve this issue?
Thanks a lot