Tokenizer.batch_encode_plus uses all my RAM

Pizzabread · March 19, 2021, 7:09pm

I only have 25GB RAM and everytime I try to run the below code my google colab crashes. Any idea how to prevent his from happening. Batch wise would work? If so, how does that look like?

max_q_len = 128
max_a_len = 64

def batch_encode(text, max_seq_len):
  return tokenizer.batch_encode_plus(
      text.tolist(),
      max_length = max_seq_len,
      pad_to_max_length=True,
      truncation=True,
      return_token_type_ids=False
  )

# tokenize and encode sequences in the training set
tokensq_train = batch_encode(train_q, max_q_len)
tokens1_train = batch_encode(train_a1, max_a_len)
tokens2_train = batch_encode(train_a2, max_a_len)

My Tokenizer is from Huggingface

tokenizer = BertTokenizerFast.from_pretrained('bert-base-multilingual-uncased')

len(train_q) is 5023194 (which is the same for train_a1 and train_a2)