Hi,
I’m working with Huggingface Transformers library to run pegasus model. It appears that the trainer class of transformers automatically handles the multi gpu training when the GPU devices are known (i.e., CUDA_VISIBLE_DEVICES flag). However, when I’m using multi-gpu for training, only one of GPUs is in near to 100% utilization and others are literally zero. Following is the command + nvidia output. I’m using pytoch 1.6.0 cuda 10.
CUDA_VISIBLE_DEVICES=0,1,2 python examples/pytorch/summarization/run_summarization.py \
--model_name_or_path google/pegasus-reddit_tifu \
--do_predict \
--train_file $DS_BASE_DIR/train.json \
--validation_file $DS_BASE_DIR/validation.json \
--test_file $DS_BASE_DIR/test.json \
--output_dir /home/code-base/user_space/saved_models/pegasus/ \
--per_device_train_batch_size=2 \
--per_device_eval_batch_size=2 \
--overwrite_output_dir \
--predict_with_generate \
--text_column text \
--summary_column summary \
--num_beams 5
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:17.0 Off | 0 |
| N/A 76C P0 290W / 300W | 16082MiB / 16160MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:18.0 Off | 0 |
| N/A 43C P0 72W / 300W | 4060MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:19.0 Off | 0 |
| N/A 43C P0 72W / 300W | 4044MiB / 16160MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
Any hint/advice?