Torch not able to utilize GPU ram properly

I am training Albert language model using huggingface transformer. While training I notice that on my p3dn instance,gpu 0 is almost completely used but others have around 50% ram unused. I am getting only 85 batch size on this system and above this OOM.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:16.0 Off |                    0 |
| N/A   77C    P0   291W / 300W |  30931MiB / 32510MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   71C    P0   255W / 300W |  18963MiB / 32510MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   71C    P0    95W / 300W |  18963MiB / 32510MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   68C    P0    89W / 300W |  18963MiB / 32510MiB |     72%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   68C    P0    78W / 300W |  18963MiB / 32510MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   69C    P0    96W / 300W |  18963MiB / 32510MiB |     65%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   69C    P0    79W / 300W |  18963MiB / 32510MiB |     95%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   74C    P0    80W / 300W |  18963MiB / 32510MiB |     12%      Default |
+-------------------------------+----------------------+----------------------+

I was using default setting for it using data parallel.
I tried distributed training also using python -m torch.distributed.launch --nproc_per_node 8 test_lm.py but It started new job for each and every GPU.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,
Calling AlbertTokenizer.from_pretrained() with the path to a single file or url is deprecated
/language_model/lm/lib/python3.6/site-packages/transformers/tokenization_utils.py:830: FutureWarning: Parameter max_len is deprecated and will be removed in a future release. Use model_max_length instead.
  category=FutureWarning,

Can anyone suggest what I should do for efficient training?

Looks like other processes have might stepped into cuda:0. Have you tried setting CUDA_VISIBLE_DEVICES to make sure that each process only sees one GPU?

No, I didnt. Usually thats not the case and havent experience such issue

Could you please also share the process pids using each device from nvidia-smi

PIDS for each device

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     23094      C   python                                     26309MiB |
|    1     23094      C   python                                     14341MiB |
|    2     23094      C   python                                     14341MiB |
|    3     23094      C   python                                     14341MiB |
|    4     23094      C   python                                     14341MiB |
|    5     23094      C   python                                     14341MiB |
|    6     23094      C   python                                     14341MiB |
|    7     23094      C   python                                     14341MiB |
+-----------------------------------------------------------------------------+

Using distributed training will help here?

I have a similar problem. I used DistributedDataParallel and python -m torch.distributed.launch --nproc_per_node=8.

Hey @Tyan

The figure you shared looks a little different from the one @karan_purohit attached. Looks like all processes step into cuda:0, which could happen if they use cuda:0 as the default device and then some tensors/context were unintentionally created there. E.g., when you call empty_cache() without a device context, or create some cuda tensor without specifying device affinity.

Can you try setting CUDA_VISIBLE_DEVICES for all processes so that each process exclusively works on one device?

Hey @karan_purohit

Looks like there is only one process using GPUs in your application while there should be 8 processes? Did you create DDP instances with proper device ids in all processes? Could you please share a min snippet of your Python script that can reproduce this behavior?

Hi, @mrshenli
Thanks a billion for your reply. I didn’t set CUDA_VISIBLE_DEVICES, I set env as follows:

def train(args):
    torch.backends.cudnn.benchmark = True
    dist.init_process_group('nccl')
    torch.cuda.set_device(args.local_rank)
    device = torch.device('cuda', args.local_rank)
    ....
    model = model.to(device)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.local_rank], output_device=args.local_rank)
    criterion = criterion.to(device)
    ...
    
        lr_imgs = lr_imgs.to(device)
        hr_imgs = hr_imgs.to(device)

simiar settings in another model training, gpu utilization is balance. It’s very strange.

I tried set CUDA_VISIBLE_DEVICES , it didn’t work.

Hi @Tyan

How did you set CUDA_VISIBLE_DEVICES? Is it sth like os.environ["CUDA_VISIBLE_DEVICES"]=f"{args.local_rank}" in every individual process before running any cuda related code?

Besides, can you try swapping the order of the following two lines? I am not 100% sure, but ProcessGroupNCCL might create CUDA context on the default device.

    dist.init_process_group('nccl')
    torch.cuda.set_device(args.local_rank)

@mrshenli
Yeah. I have tried what you said. It didn’t work. Current setting:

def train(args):
    # Env
    os.environ["CUDA_VISIBLE_DEVICES"] = str(args.local_rank)
    torch.backends.cudnn.benchmark = True
    torch.cuda.set_device(args.local_rank)
    dist.init_process_group('nccl')
    device = torch.device('cuda', args.local_rank)

That’s weird. Can you share a min repro so that we can debug locally?

Hi, @mrshenli
I checked the code again. I found in utils.py, some variables occupy the gpu:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Some constants
rgb_weights = torch.FloatTensor([65.481, 128.553, 24.966]).to(device)
imagenet_mean = torch.FloatTensor([0.485, 0.456, 0.406]).unsqueeze(1).unsqueeze(2)
imagenet_std = torch.FloatTensor([0.229, 0.224, 0.225]).unsqueeze(1).unsqueeze(2)
imagenet_mean_cuda = torch.FloatTensor([0.485, 0.456, 0.406]).to(device).unsqueeze(0).unsqueeze(2).unsqueeze(3)
imagenet_std_cuda = torch.FloatTensor([0.229, 0.224, 0.225]).to(device).unsqueeze(0).unsqueeze(2).unsqueeze(3)

Very sorry to trouble you.

1 Like