DataParallel not effective

Hi all,
I have a quite large model and need to do data parallel among multiple GPUs.
I used:

model = nn.DataParallel(model)

And there are three visible GPUs. The GPU usage is:

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     12444      C   python                                     10741MiB |
|    1     12444      C   python                                      4663MiB |
|    2     12444      C   python                                      4633MiB |
+-----------------------------------------------------------------------------+

GPU1 and GPU2 are not fully utilized, but I cannot increase the batch size because then there will be memory error on GPU0.
Does anyone know how to solve this problem?

Hello, I am having the same issue. Have you solved it? Thanks

Hi, Andy-jpa
If the the first gpu (id:0) is occupied all the time and no more space to use, you could consider to use only 1 and 2 gpu explictly assigned in you code by device_ids.

refer : Multi-GPU Examples — PyTorch Tutorials 2.1.1+cu121 documentation

code like this:

def data_parallel(module, input, device_ids, output_device=None):
    if not device_ids:
        return module(input)

    if output_device is None:
        output_device = device_ids[0]

    replicas = nn.parallel.replicate(module, device_ids)
    inputs = nn.parallel.scatter(input, device_ids)
    replicas = replicas[:len(inputs)]
    outputs = nn.parallel.parallel_apply(replicas, inputs)
    return nn.parallel.gather(outputs, output_device)

Hi I have same issue.
I tried 16GPU.

I think this problem is batch_size need increase to use efficiently multiple GPUs but outputs increase either.
replicas … batch_size / GPUs * input_size
output_device … batch_size * GPUs * output_size
but output_device must have replica in the current DataParallel imprementation.
Therefore I try rewrite DataParallel to output_device avoid from replicas.
In my case, I could increase batch size to output_device GPU use 30GB and each replica GPUs uses 20GB.

But, I’m not sure it’s correct.
I know, it need resolve hide latency to get more efficiency process.
best solution is change to concurrent processing.

I have used PyTorch since last week replace from TensorFlow2.
I want to know best practice of PyTorch.

The fastest and recommended approach is DistributedDataparallel using a single process per GPU as described in these docs.

thanks quick reply
i’ll try it

DDP is a very good performance and nothing worried for memory management.
thanks for ptrblck and pytorch team