Torch not able to utilize GPU ram properly

I am training Albert language model using huggingface transformer. While training I notice that on my p3dn instance,gpu 0 is almost completely used but others have around 50% ram unused. I am getting only 85 batch size on this system and above this OOM.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:16.0 Off |                    0 |
| N/A   77C    P0   291W / 300W |  30931MiB / 32510MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   71C    P0   255W / 300W |  18963MiB / 32510MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   71C    P0    95W / 300W |  18963MiB / 32510MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   68C    P0    89W / 300W |  18963MiB / 32510MiB |     72%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   68C    P0    78W / 300W |  18963MiB / 32510MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   69C    P0    96W / 300W |  18963MiB / 32510MiB |     65%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   69C    P0    79W / 300W |  18963MiB / 32510MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   74C    P0    80W / 300W |  18963MiB / 32510MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+

I was using default setting for it using data parallel.
I tried distributed training also using python -m torch.distributed.launch --nproc_per_node 8 test_lm.py but It started new job for each and every GPU.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,

Can anyone suggest what I should do for efficient training?

Looks like other processes have might stepped into cuda:0. Have you tried setting CUDA_VISIBLE_DEVICES to make sure that each process only sees one GPU?

No, I didnt. Usually thats not the case and havent experience such issue

Could you please also share the process pids using each device from nvidia-smi

PIDS for each device

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     23094      C   python                                     26309MiB |
|    1     23094      C   python                                     14341MiB |
|    2     23094      C   python                                     14341MiB |
|    3     23094      C   python                                     14341MiB |
|    4     23094      C   python                                     14341MiB |
|    5     23094      C   python                                     14341MiB |
|    6     23094      C   python                                     14341MiB |
|    7     23094      C   python                                     14341MiB |
+-----------------------------------------------------------------------------+

Using distributed training will help here?

I have a similar problem. I used DistributedDataParallel and python -m torch.distributed.launch --nproc_per_node=8.

Hey @Tyan

The figure you shared looks a little different from the one @karan_purohit attached. Looks like all processes step into cuda:0, which could happen if they use cuda:0 as the default device and then some tensors/context were unintentionally created there. E.g., when you call empty_cache() without a device context, or create some cuda tensor without specifying device affinity.

Can you try setting CUDA_VISIBLE_DEVICES for all processes so that each process exclusively works on one device?

Hey @karan_purohit

Looks like there is only one process using GPUs in your application while there should be 8 processes? Did you create DDP instances with proper device ids in all processes? Could you please share a min snippet of your Python script that can reproduce this behavior?

Hi, @mrshenli
Thanks a billion for your reply. I didn’t set CUDA_VISIBLE_DEVICES, I set env as follows:

def train(args):
    torch.backends.cudnn.benchmark = True
    dist.init_process_group('nccl')
    torch.cuda.set_device(args.local_rank)
    device = torch.device('cuda', args.local_rank)
    ....
    model = model.to(device)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank)
    criterion = criterion.to(device)
    ...
    
        lr_imgs = lr_imgs.to(device)
        hr_imgs = hr_imgs.to(device)

simiar settings in another model training, gpu utilization is balance. It’s very strange.

I tried set CUDA_VISIBLE_DEVICES , it didn’t work.

Hi @Tyan

How did you set CUDA_VISIBLE_DEVICES? Is it sth like os.environ["CUDA_VISIBLE_DEVICES"]=f"{args.local_rank}" in every individual process before running any cuda related code?

Besides, can you try swapping the order of the following two lines? I am not 100% sure, but ProcessGroupNCCL might create CUDA context on the default device.

    dist.init_process_group('nccl')
    torch.cuda.set_device(args.local_rank)

@mrshenli
Yeah. I have tried what you said. It didn’t work. Current setting:

def train(args):
    # Env
    os.environ["CUDA_VISIBLE_DEVICES"] = str(args.local_rank)
    torch.backends.cudnn.benchmark = True
    torch.cuda.set_device(args.local_rank)
    dist.init_process_group('nccl')
    device = torch.device('cuda', args.local_rank)

That’s weird. Can you share a min repro so that we can debug locally?

Hi, @mrshenli
I checked the code again. I found in utils.py, some variables occupy the gpu:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Some constants
rgb_weights = torch.FloatTensor([65.481, 128.553, 24.966]).to(device)
imagenet_mean = torch.FloatTensor([0.485, 0.456, 0.406]).unsqueeze(1).unsqueeze(2)
imagenet_std = torch.FloatTensor([0.229, 0.224, 0.225]).unsqueeze(1).unsqueeze(2)
imagenet_mean_cuda = torch.FloatTensor([0.485, 0.456, 0.406]).to(device).unsqueeze(0).unsqueeze(2).unsqueeze(3)
imagenet_std_cuda = torch.FloatTensor([0.229, 0.224, 0.225]).to(device).unsqueeze(0).unsqueeze(2).unsqueeze(3)

Very sorry to trouble you.

1 Like

I meet the same question, And I didn’t find any errors in the code. Can you help me take a look

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2,3,4"

import torch
import torch.distributed as dist
import torch.multiprocessing as mp
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP

def example(rank, world_size):
    # create default process group
    dist.init_process_group("gloo", init_method='tcp://127.0.0.1:6666', rank=rank, world_size=world_size)
    # create local model
    model = nn.Linear(10, 10).to(rank)
    # construct DDP model
    ddp_model = DDP(model, device_ids=[rank])
    # define loss function and optimizer
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    # forward pass
    outputs = ddp_model(torch.randn(20, 10).to(rank))
    labels = torch.randn(20, 10).to(rank)
    # backward pass
    loss_fn(outputs, labels).backward()
    # update parameters
    optimizer.step()
    print("finished rank: {}".format(rank))

def main():
    world_size = torch.cuda.device_count()
    mp.spawn(example,
        args=(world_size,),
        nprocs=world_size,
        join=True)

if __name__=="__main__":
    main()

and this is the result:
b3da9abf3e2fbae4e83160570d3de6c

Did you try to set torch.cuda.set_device to the rank as was suggested before?

Follow your suggestion, I set torch.cuda.set_device to the rank , and it’s work well.

But I still don’t understand the reason behind it, because I used model.to(rank) and input.to(rank), Therefore, shouldn’t each variable be on its own gpu card?

Yes, each of these objects will be moved to the corresponding rank and I assume you are creating CUDA contexts on all devices somewhere else.
Using set_device is an easy method to avoid it as the “offending call” wouldn’t be able to initialize a context on any other device.

Please tell me how CUDA_VISIBLE_DEVICES can solve Tyan’s problem. I encountered the same problem. Why do I need to configure CUDA_VISIBLE_DEVICES? Can you help me?

This is my code, if I don’t configure CUDA_VISIBLE_DEVICES, my program will hang

# test_init_dist.py
import torch
import os
 
def init_distributed():
    local_rank = int(os.environ["LOCAL_RANK"])
    global_rank = int(os.environ["RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    print("local_rank:" + str(local_rank))

    torch.distributed.init_process_group("nccl", rank=global_rank, world_size=world_size)
    torch.cuda.set_device(local_rank)
    torch.distributed.barrier()
    torch.distributed.all_reduce(torch.rand(1).cuda())

    print("Done initializing distributed")
 
if __name__ == "__main__":
    init_distributed()
 
 
#!/bin/bash
export N=2
export MASTER_ADDR=10.100.98.63  
export MASTER_PORT=12355
 
# 启动分布式训练的完整命令
python -m torch.distributed.launch \
    --nnodes=$N \
    --node_rank=0 \
    --nproc_per_node=2 \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    test_init_dist.py
 
#!/bin/bash
 
export N_RANK=1  
export MASTER_ADDR=10.100.98.63  
export MASTER_PORT=12355  
 
python -m torch.distributed.launch \
    --nnodes=$N \
    --node_rank=1 \
    --nproc_per_node=2 \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    --use_env  test_init_dist.py