Running with no end when usin the pipeline Trainer for the transformers

MABtt · September 15, 2021, 9:05am

Hello,

I have been exploring the pipelines of transformers with a dataset of tweets.
After I charged the data and prepare it with train_test_split of scikit learn, I prepared everything as the code below:

# load the tokenizer
tokenizer = CamembertTokenizerFast.from_pretrained("camembert-base", do_lower_case=True)

# tokenize the dataset, truncate when passed `max_length = 512`, 
# and pad with 0's when less than `max_length`
train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=512)

#load the model and pass to CUDA after importing all the packages 
model = CamembertForSequenceClassification.from_pretrained(model_name, num_labels=len(label_dict)).to(device)

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=4,              # total number of training epochs
    per_device_train_batch_size=10,  # batch size per device during training
    per_device_eval_batch_size=4,   # batch size for evaluation
    #warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training 
    logging_steps=100,               # log & save weights each logging_steps
    evaluation_strategy="steps",     # evaluate each `logging_steps`
    learning_rate=2e-5
)

trainer = Trainer(
    model="camembert-base",              # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=valid_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # prepare the function to computes metrics of interest
)

At this point I have no error at all I check every tensor and every output? all go.
But when I come to the final step of training, the model keeps running with no end

# train the model
trainer.train()

and display this…

***** Running training *****
  Num examples = 61601
  Num Epochs = 4
  Instantaneous batch size per device = 10
  Total train batch size (w. parallel, distributed & accumulation) = 10
  Gradient Accumulation steps = 1
  Total optimization steps = 24644
Exception in thread Thread-15:
Traceback (most recent call last):
  File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch_xla/distributed/parallel_loader.py", line 139, in _loader_worker
    _, data = next(data_iter)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "<ipython-input-15-37abd243a789>", line 7, in __getitem__
    item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
  File "<ipython-input-15-37abd243a789>", line 7, in <dictcomp>
    item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
IndexError: list index out of range

I couldn’t interpret the exception nor the error.
Can anyone help me through this, please?

ptrblck · September 16, 2021, 12:37am

The error points towards an invalid indexing operation in:

item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}

which fails in the data loading step.
I would recommend to iterate the Dataset and DataLoader for a full epoch and make sure all samples can be loaded properly.

MABtt · September 16, 2021, 7:29am

Thank you for the update but I think there is more to do than that because I found out that the process of tokenizing takes more than 2H just to tokenize a dataset of 60K tweets which is a very bad this to do.

train_encodings = tokenizer(train_texts, truncation=True, padding=True, max_length=512)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True, max_length=512)

Can you tell me if there is another way to tokenize the data with less time, please?
All the methods that I found and worked with the TrainingArguments and the Trainer do a simple tokenizer for a small data and wok well but with my data, it takes more than 2H

EDIT: It would be VERY helpful if you could tell me more about your proposed solution, I didn’t get what do you mean by

I would recommend to iterate the Dataset and DataLoader for a full epoch and make sure all samples can be loaded properly.

ptrblck · September 16, 2021, 7:42am

I don’t know if there are faster tokenizer, but would check what other repositories are using (e.g. HuggingFace).

Something like this:

for data, target in dataset:
    print(data.shape)

for data, target in loader:
    print(data.shape)

This would iterate the Dataset and DataLoader for an entire epoch and could be used to isolate the indexing issue.

MABtt · September 20, 2021, 7:19am

Hello,

I don’t know if there are faster tokenizer, but would check what other repositories are using (e.g. HuggingFace).

Thank you for your reply. I appreciate your efforts

This would iterate the Dataset and DataLoader for an entire epoch and could be used to isolate the indexing issue.

It still running very slow which is not very practical so I am going to change my strategy!
Thanks anyway!

ptrblck · September 20, 2021, 8:28am

The test of iterating the Dataset and/or the DataLoader is only a debugging step to isolate the failure and you wouldn’t need to use it in your actual training script.

MABtt · September 20, 2021, 3:33pm

Yes yes, I figured that out
I change all my script! and got more problems now haha
Thanks anyway