Can't utilise more than one GPU with transformers model

Hi.
I’m trying to use MarianModels from Huggingface library for back-translation. I’m using 8 GPU’s however nvidia-smi shows the utilization of only one GPU although I used nn.Dataparallel(). Can anyone help me with this issue?

import torch
import torchtext
from transformers import MarianMTModel, MarianTokenizer
target_langs = ['fr,wa,frp,oc,ca,rm,lld,fur,lij,lmo,es,pt,gl,lad,an,mwl,it,co,nap,scn,vec,sc,ro,la']

def translate(texts, model, tokenizer, language="fr"):
    
    with torch.no_grad():
        template = lambda text: f"{text}" if language == "en" else f">>{language}<< {text}"
        src_texts = [template(text) for text in texts]
        encoded = tokenizer.prepare_seq2seq_batch(src_texts, 
                                                      truncation=True, 
                                                      max_length=300, return_tensors="pt").to(device)   
        translated = model.module.generate(**encoded).to(device)
        translated_texts = tokenizer.batch_decode(translated, skip_special_tokens=True)
        return translated_texts


def back_translate(texts, source_lang="en", target_lang="fr"):
    # Translate from source to target language
    fr_texts = translate(texts, target_model, target_tokenizer, 
                         language=target_lang)

    # Translate from target language back to source language
    back_translated_texts = translate(fr_texts, en_model, en_tokenizer, 
                                      language=source_lang)
    
    return back_translated_texts



target_model_name = 'Helsinki-NLP/opus-mt-en-de'
target_tokenizer = MarianTokenizer.from_pretrained(target_model_name)
target_model = MarianMTModel.from_pretrained(target_model_name)

en_model_name = 'Helsinki-NLP/opus-mt-de-en'
en_tokenizer = MarianTokenizer.from_pretrained(en_model_name)
en_model = MarianMTModel.from_pretrained(en_model_name)

target_model = nn.DataParallel(target_model)    
target_model = target_model.to(device) # same performance  if I add .half()
target_model.eval()

en_model = nn.DataParallel(en_model)    
en_model = en_model.to(device)# same performance if I add .half()
en_model.eval()    

for i , (x1 , x2 , label) in enumerate(loader):
    with torch.no_grad():
        ## x1 and x2 are batches of strings. 
        bk_x1 = back_translate(x1, source_lang="en", target_lang=np.random.choice(target_langs))   
        bk_x2 = back_translate(x2, source_lang="en", target_lang=np.random.choice(target_langs))




here are GPU's performances:   low utilization due to small batch size 16 but if I increase the batch size I got Cuda out of memory error. also, I can see only one GPU is used for processing so might be that the Marian model can not be parallelized correctly. which explains the reason for Cuda's out of memory error and slow performance.  if so what would be the solution?

     |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  GeForce GTX 108...  Off  | 00000000:1B:00.0 Off |                  N/A |
    | 42%   78C    P2   199W / 250W |   9777MiB / 11178MiB |     91%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   1  GeForce GTX 108...  Off  | 00000000:1C:00.0 Off |                  N/A |
    | 29%   36C    P8    10W / 250W |      2MiB / 11178MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   2  GeForce GTX 108...  Off  | 00000000:1D:00.0 Off |                  N/A |
    | 31%   36C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   3  GeForce GTX 108...  Off  | 00000000:1E:00.0 Off |                  N/A |
    | 35%   41C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   4  GeForce GTX 108...  Off  | 00000000:3D:00.0 Off |                  N/A |
    | 29%   34C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   5  GeForce GTX 108...  Off  | 00000000:3F:00.0 Off |                  N/A |
    | 30%   31C    P8     8W / 250W |      2MiB / 11178MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   6  GeForce GTX 108...  Off  | 00000000:40:00.0 Off |                  N/A |
    | 31%   38C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
    |   7  GeForce GTX 108...  Off  | 00000000:41:00.0 Off |                  N/A |
    | 30%   37C    P8     9W / 250W |      2MiB / 11178MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |    0   N/A  N/A     58780      C   python                          10407MiB |
    |    1   N/A  N/A     58780      C   python                              0MiB |
    |    2   N/A  N/A     58780      C   python                              0MiB |
    |    3   N/A  N/A     58780      C   python                              0MiB |
    |    4   N/A  N/A     58780      C   python                              0MiB |
    |    5   N/A  N/A     58780      C   python                              0MiB |
    |    6   N/A  N/A     58780      C   python                              0MiB |
    |    7   N/A  N/A     58780      C   python                              0MiB |
    +-----------------------------------------------------------------------------+






FYI: I’m using

pytorch 1. 1.7.0
transformers 4.0.1
cudda 10.1

@ptrblck. I wonder if you possibly have any idea regarding my issue.

Are you seeing no utilization on the other GPUs at all or just a low utilization?
In the latter case, note that nn.DataParallel has some drawbacks, such as a higher memory usage on the default device, as well as worse performance than DistributedDataParallel, which is why we recommend to use DDP.

thanks for your reply. According to what nvidia-smi shows the other gpu’s have zero utilization. I tried to use the DistributedDataParallel the utilization problem is solved but seems there are problems with 1D batch normalization in distributed training ( I mentioned mine in 106969 but I see others also have the same problem)