How to use multiple GPUs (DataParallel) for training a model that used to use one gpu

Hey! I came across the same problem. I set CUDA_VISIBLE_DEVICES=‘0,1,2,3’ and model = torch.nn.DataParallel(model, device_ids=[0,1,2,3]). But the code still only uses GPU 0 and got out of memory. Could you please explain more about what “each chunk of the batch will be sent to each GPU, so you should at least pass one sample for each GPU” means? Thanks!

If your batch size is 1, i.e. you are only using a single sample in each batch, you won’t be able to use nn.DataParallel, since there are not samples to parallelize the workload.
In other words, for 4 GPUs, your batch size should be at least 4 (or multiples of 4 to evenly distribute the workload).

Note that you won’t be able to run nn.DataParallel, if your GPU is already running out of memory using a single sample and you could use e.g. torch.utils.checkpoint to trade compute for memory or use a model sharding approach, where each GPU would compute a specific part of the model.

I am trying to use this architecture to make it compatible with Fisheye images with rotated bounding boxes. I have written the code which works fine on single GPU but it is giving me ‘RuntimeError: Caught RuntimeError in replica 0 on device 0’ and ‘RuntimeError: CUDA error: device-side assert triggered’ when trying to use with Multiple GPUs.

I have 4 ‘GeForce GTX 1080 Ti’ GPUs on my university server which I am trying to use remotely. I wish to access the first 2 GPUs for running this code.

I tried setting the [‘CUDA_LAUNCH_BLOCKING’] = “1” but it is taking forever to load the data. I also tried setting ´num_workers=0 in dataloader which I read on one of the PyTorch forum discussion but the problem still persists

I have made the following changes to the code:

I am using following part of the code to run on multiple GPUs:

import os
#os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"
from torch.nn import DataParallel

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    os.makedirs("output", exist_ok=True)
    os.makedirs("checkpoints", exist_ok=True)

    # Get data configuration
    data_config = parse_data_config(opt.data_config)
    train_path = data_config["train"]
    valid_path = data_config["valid"]
    train_annpath = data_config["json_train"]
    valid_annpath = data_config["json_val"]
    class_names = load_classes(data_config["names"])

    if len(class_names) == 80:
        class_80 = True
    else:
        class_80 = False

    # Initiate model
    model = Darknet(opt.model_def).to(device)
    model.apply(weights_init_normal)

    # If specified we start from checkpoint
    if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

    #Use multiple GPUs
    #model = DataParallel(model)

    #Check if multiple GPUs available
    if torch.cuda.device_count() > 1:
        print("Let's use", torch.cuda.device_count(), "GPUs!")
        # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
        model = DataParallel(model)

Following is the exact error and its original traceback. I would really use some help. Thanks in advance.

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [93,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [94,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [95,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=278 error=710 : device-side assert triggered
Traceback (most recent call last):
File “/localdata/saurabh/.local/Python3.8/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/localdata/saurabh/.local/Python3.8/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/sband/.vscode-server/extensions/ms-python.python-2020.11.358366026/pythonFiles/lib/python/debugpy/main.py”, line 45, in
cli.main()
File “/home/sband/.vscode-server/extensions/ms-python.python-2020.11.358366026/pythonFiles/lib/python/debugpy/…/debugpy/server/cli.py”, line 430, in main
run()
File “/home/sband/.vscode-server/extensions/ms-python.python-2020.11.358366026/pythonFiles/lib/python/debugpy/…/debugpy/server/cli.py”, line 267, in run_file
runpy.run_path(options.target, run_name=compat.force_str(“main”))
File “/localdata/saurabh/.local/Python3.8/lib/python3.8/runpy.py”, line 265, in run_path
return _run_module_code(code, init_globals, run_name,
File “/localdata/saurabh/.local/Python3.8/lib/python3.8/runpy.py”, line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File “/localdata/saurabh/.local/Python3.8/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/localdata/saurabh/yolov3/train.py”, line 131, in
loss, outputs = model(imgs, targets)
File “/localdata/saurabh/yolov3/yol/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 722, in _call_impl
result = self.forward(*input, **kwargs)
File “/localdata/saurabh/yolov3/yol/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py”, line 155, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File “/localdata/saurabh/yolov3/yol/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py”, line 165, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File “/localdata/saurabh/yolov3/yol/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py”, line 85, in parallel_apply
output.reraise()
File “/localdata/saurabh/yolov3/yol/lib/python3.8/site-packages/torch/_utils.py”, line 395, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.

Original Traceback (most recent call last):

File “/localdata/saurabh/yolov3/yol/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py”,
line 60, in _worker
output = module(*input, **kwargs)
File “/localdata/saurabh/yolov3/yol/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 722,
in _call_impl
result = self.forward(*input, **kwargs)
File “/localdata/saurabh/yolov3/models.py”, line 287, in forward
x, layer_loss = module[0](x, targets, img_dim)
File “/localdata/saurabh/yolov3/yol/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 722,
in _call_impl
result = self.forward(*input, **kwargs)
File “/localdata/saurabh/yolov3/models.py”, line 205, in forward
iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tangle, tcls, tconf = build_targets(
File “/localdata/saurabh/yolov3/utils/utils.py”, line 405, in build_targets
noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0
RuntimeError: CUDA error: device-side assert triggered

An indexing operation fails as given in the error message:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [93,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

You can rerun the script with CUDA_LAUNCH_BLOCKING=1 python script.py args to get the right line of code, which is causing this error.

Thanks for your reply. Found the error. Some bounds were missing. However, I am still not able to run the code. I am not getting any error but the code is also not executing after data-loading. I tried keeping the num_workers=0 as you suggested in one of the above comments because code works fine on single gpu but with no luck. How should I approach this problem. This is the first time I am working with multiple gpus as well as pytorch, so please excuse me if the questions seem silly

You could use a very simple model without DataLoaders etc. and just check, if your data parallel approach is working. If that’s not the case, you could run e.g. some NCCL samples and make sure that your GPUs can communicate with each other.

Okay. let me try this. Thanks for the advice. I hope I find the problem soon

Hello,

I just upgrade my machine with a second GPU (exactly the same GTX1080Ti), I tried to use parallel trainning in a network that I trained before, and the trainning have gooten slower (at least 50% slower), what could be the problem? I also check that the trainning is like stopping at the middle of the epoch.

SO: Ubuntu 20.04
python: 3.8
nvidiadriver: 4.60
pytorch: 11.1cuda

Hi,
I tried to run the code given in the tutorial. It is printing correctly the number o GPUs(2 in my case). Then it is stuck.