Pytorch GPU utilization zero

Hi,

I’m working with Huggingface Transformers library to run pegasus model. It appears that the trainer class of transformers automatically handles the multi gpu training when the GPU devices are known (i.e., CUDA_VISIBLE_DEVICES flag). However, when I’m using multi-gpu for training, only one of GPUs is in near to 100% utilization and others are literally zero. Following is the command + nvidia output. I’m using pytoch 1.6.0 cuda 10.

CUDA_VISIBLE_DEVICES=0,1,2 python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path google/pegasus-reddit_tifu \
    --do_predict \
    --train_file $DS_BASE_DIR/train.json \
    --validation_file $DS_BASE_DIR/validation.json \
    --test_file $DS_BASE_DIR/test.json \
    --output_dir /home/code-base/user_space/saved_models/pegasus/ \
    --per_device_train_batch_size=2 \
    --per_device_eval_batch_size=2 \
    --overwrite_output_dir \
    --predict_with_generate \
    --text_column text \
    --summary_column summary \
    --num_beams 5
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   76C    P0   290W / 300W |  16082MiB / 16160MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   43C    P0    72W / 300W |   4060MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   43C    P0    72W / 300W |   4044MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Any hint/advice?

Could you link to the trainer code, which would implement this logic?
Based on the output of nvidia-smi it doesn’t seem to be the case or work as intended.