Pretraining BERT using torchrun on 3 GPUS

I am trying to run this code mlm_pretrain.py using Slurm script, torch run in 3 gpu server. Since hf Trainer will by default do the DDP, I have not changed my code. The issue is using 3 gpus the training time is 3 times more than while pretraining on single GPU. It’s throwing this warning for each device id.

[rank0]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance.

Slurm -
#SBATCH --gres=gpu:3
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --qos=normal
torchrun --nproc-per-node 3 train_continual_with_loss.py

Detailed log

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
0%| | 0/307 [00:00<?, ?it/s][rank0]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank2]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank1]:[W reducer.cpp:1389] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
11%|█▏ | 35/307 [43:31<5:43:06, 75.69s/it

Did you try to disable find_unused_parameters by setting it to False or by dropping it? If so, how did the performance improve?

I have not changed any lines for distributed run in the mlm code; since HF trainer is handling those with torchrun. Since this error appeared, i tried to wrap the model in DDP with find_unused_parameters = False; but it was throwing index error, might be due to I have to wrap datasets in ddp format and have to make other code changes.
What is the ideal way to set find_unused_parameters=False, when using torchrun with HF trainer with minimal code changes

I’m not familiar enough with HF’s Trainer class and don’t know where these arguments are set.

Thank you @ptrblck . I have figured out that. I was doing in a wrong way by wrapping model into DDP, all I have to do was adding this argument to Trianing arguments of HF like this which has significantly reduced the time in gpus

training_args = TrainingArguments(
output_dir=output_dir,
overwrite_output_dir=True,
num_train_epochs=5,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
do_eval=True,
eval_strategy = ‘epoch’,
gradient_accumulation_steps=4,
prediction_loss_only=True,
ddp_find_unused_parameters=False
)

1 Like