Why Rank 0 does a lot of work for loading the data? Why NaN in evaluation?

I am trying to understand how DataLoader works with multiple GPUs and multiple workers. This question has four parts.

  1. The first part is asked here: Same seed across different gpus in multiple workers - Intermediate - Hugging Face Forums (While writing this post, there was no answer on this as well, I would appreciate it if someone can answer it as well)
  2. To investigate how the data is loaded during training I dumped the samples generated by each rank and worker combination. It seems that most of the work to load the data is done by the workers of the rank_0 . (In the figure below, rank_0 workers have the largest dump – 18M – while others has only 13K dump size. ). What could be the reason for this?

  1. Evaluation steps result in nan, the investigation of nan suggests that the custom_loss function in transformer’s Trainer API receiving a blank input, although there are more samples left for evaluation.

  2. Multi-GPU evaluation and single-GPU evaluation utilize a different number of samples. Further, in multi-GPU evaluation, all the GPUs report independent metrics. I am using transformer’s Trainer.evaluate API with torchrun --nproc_per_node 2 eval.py .

Training setup (for part 2 of the question):

train_data_loader: IterableDataset = ...
mlm_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm_probability=mlm_probability
)

deepspeed_config = {
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "fp16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 0
    }
}

training_arguments: TrainingArguments = TrainingArguments(.  # from HF-transformer
    output_dir=checkpoint_dir,
    logging_dir=tensorboard_dir,
    evaluation_strategy=evaluation_strategy,
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    max_steps=max_steps,
    dataloader_pin_memory=True,
    dataloader_num_workers=32,
    per_device_train_batch_size=64,
    accelerator_config={"split_batches": True},
    gradient_accumulation_steps=32,
    dataloader_prefetch_factor=5,
    deepspeed=deepspeed_config,
    fp16=False,
    disable_tqdm=True
)

trainer : Trainer = Trainer(...)  # from HF-transformers

Run command:

deepspeed --master_port=29600 --num_gpus=4 train.py