I am trying to understand how DataLoader
works with multiple GPUs and multiple workers. This question has four parts.
- The first part is asked here: Same seed across different gpus in multiple workers - Intermediate - Hugging Face Forums (While writing this post, there was no answer on this as well, I would appreciate it if someone can answer it as well)
- To investigate how the data is loaded during training I dumped the samples generated by each
rank
andworker
combination. It seems that most of the work to load the data is done by the workers of therank_0
. (In the figure below,rank_0
workers have the largest dump – 18M – while others has only 13K dump size. ). What could be the reason for this?
-
Evaluation steps result in
nan
, the investigation ofnan
suggests that thecustom_loss
function intransformer
’sTrainer
API receiving a blank input, although there are more samples left for evaluation. -
Multi-GPU evaluation and single-GPU evaluation utilize a different number of samples. Further, in multi-GPU evaluation, all the GPUs report independent metrics. I am using
transformer
’sTrainer.evaluate
API withtorchrun --nproc_per_node 2 eval.py
.
Training setup (for part 2 of the question):
train_data_loader: IterableDataset = ...
mlm_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm_probability=mlm_probability
)
deepspeed_config = {
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"fp16": {
"enabled": False
},
"zero_optimization": {
"stage": 0
}
}
training_arguments: TrainingArguments = TrainingArguments(. # from HF-transformer
output_dir=checkpoint_dir,
logging_dir=tensorboard_dir,
evaluation_strategy=evaluation_strategy,
learning_rate=2e-5,
weight_decay=0.01,
push_to_hub=False,
max_steps=max_steps,
dataloader_pin_memory=True,
dataloader_num_workers=32,
per_device_train_batch_size=64,
accelerator_config={"split_batches": True},
gradient_accumulation_steps=32,
dataloader_prefetch_factor=5,
deepspeed=deepspeed_config,
fp16=False,
disable_tqdm=True
)
trainer : Trainer = Trainer(...) # from HF-transformers
Run command:
deepspeed --master_port=29600 --num_gpus=4 train.py