How to reduce the execution time of "forward pass" on GPU

I already had a pre-train model. I only use it to extract the feature (only use “forward pass”).
Now, I only load it into single GPU and use with torch.no_grad()
My question is that: I have 2 GPUs on my computer and how to reduce the execution time with 2 GPUs.
P/s: Input of my model have a size as (8,3,256,128)


DataParallel and DistributedDataParallel would help here.

Thanks for replaying!!

However, when I read the document of DataParallel and DistributedDataParallel , I think it would not help me to reduce the execution time because I do not need the backward pass.

> assert any((p.requires_grad for p in module.parameters())), (
>             "DistributedDataParallel is not needed when a module "
>             "doesn't have any parameter that requires a gradient."
>         )

I will try with it and tell u the result.

How to use DataParallel:
model = DataParallel(model, dim=your batch dim in input, device_ids=[main_id, other_ids …], output_device=main_id)
Note that main_id (GPU that store the original model parameter) should be the first in the list of device_ids.

You can use DataParallel since it’s easier to setup and test, but remember to:

  1. Set the batch dimension of DataParallel. The default is dim=0 but sometime you might want to apply another dimension.
    (e.g. my input size is (Time, Batch, Dim_Data), and the model require a fully time series. In this scenario I will apply dim=1 instead of dim=0, because If I choose dim=0, the time serie will be split into multiple fragments.)
  2. DataParallel return a wrapped model, so use the wrapped model to forward instead of original one, and remember to…
  3. …handle the state_dict of wrapped model before save the state_dict into files. DataParallel will append prefix “module.” to each key of the original state_dict.keys(), you have to remove the prefix before saving the state_dict.
  4. Set the GPU with larger memory as output_device , and also pass model parameter and your input to this GPU. output_device need to store both data and model parameters, so larger GPU memory is favorable.

Here is my code with DataParallel

import time
import torchvision.models as models
import torch
import torch.nn as nn
model = models.resnet50(num_classes=1000).to('cuda:0')

model = nn.DataParallel(model, device_ids=[0,1], output_device=0)
batch_size = 8
image_w = 128
image_h = 128

#warm up GPU
input = torch.randn(batch_size,3,image_w, image_h).to('cuda:0')

listTime = []
for i in range(20):
    with torch.no_grad():
        startTime = time.time()
        input = torch.randn(batch_size,3,image_w, image_h).to('cuda:0')
        out = model(input)
        esl = time.time() - startTime
        print("Total time of loop {} :: {}".format(i, esl))

meanTime = torch.mean(torch.tensor(listTime[9:]))

I test with resnet50(). The size of input is (8,3,128,128).
I run the forward() pass in 20 steps and choice the last 10 steps to find the mean of execution time.

Without Dataparallel, meanTime = 0.0064s (run with single GPU)
and with Dataparalle, meanTime = 0.0396s(run with 2 GPUs)
P/s: I have 2 GPUs as below image.
Do you have any solution for my problem ?

@dat_pham_thanh Can you benchmark using at least 1000 iterations and also track throughput instead (images/s)? Mean time can be misleading since a single outlier could change the mean quite a bit.


DataParallel would replicate the model, scatter the input, and gather outputs in every iteration. So, if the input size is too small, the overhead of replicating the model might overshadow the benefits of parallelizing the computation. Besides what @pritamdamania87 suggested above, could you please also try with large batch size?

Thank for your reply!!
I think you are correct. I can not increase the batch size because it is fixed ( batch size always equals 8) for each iteration.
I use the ONNX model to solve my problem!!