I’m currently working on training a GPT model using The Pile dataset on a single node with 8 A100 GPUs. This dataset is quite large, weighing in at around 950 GB. To achieve efficient distributed training, I’m leveraging torchrun
for its ease of use and seamless integration. However, I’ve encountered an issue where the data loading process seems to be triggered 8 times in parallel, which, I suspect, leads to excessive disk read overhead and consequently, a slow loading process.
When running the training on a single GPU, the dataset takes about 10 minutes to load. However, with all 8 GPUs in action, the loading time increases significantly, taking over an hour.
Here’s the command I’m using to initiate the training:
torchrun -m --nproc_per_node=8 gpt train
And this is the general structure of the relevant training code:
def train(cfg: DictConfig):
# Setup data and dataloader
data_module = prepare_data_module(**cfg.data)
train_dataloader = data_module.train_dataloader()
val_dataloader = data_module.val_dataloader()
# Extract tokenizer from datamodule
tokenizer = data_module.tokenizer
# Setup model and optimizer
model = GPT(**cfg.model)
# Setup data collator
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
train_args = TrainingArguments(**cfg.train_args)
trainer = Trainer(
model=model,
tokenizer=tokenizer,
train_args=args,
data_collator=data_collator,
train_dataset=train_dataloader,
eval_dataset=val_dataloader,
)
trainer.train()
I am seeking insights from the community on potential optimizations to resolve this data loading bottleneck. Are there any recommendations or best practices for efficiently handling data loading and distribution when training large-scale models like GPT with PyTorch?
Looking forward to your thoughts and suggestions!