Problem with speedup DataParallel

Hello, I am trying to understand how DataParallel works. Now I am testing simple code to see speedup on 2 GPUs:

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 10000
data_size = 1000000

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return output

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)
model.to(device)

for data in rand_loader:
  input = data.to(device)
  output = model(input)
  print("Outside: input size", data.size(),
        "output_size", output.size())

I measure time by using ‘time’ utility. Execution on 1 GPU takes 3m26.624s, and execution on 2 GPUs takes approximately the same time (±5 seconds). What could be the problem?

Are you sure you’re using both GPUs with nn.DataParallel? The line:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

makes me think that it’s running on a single gpu. You could try changing this to:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In addition, one of the main benefits of data parallelism is that you can use larger batch sizes to speed up iterating through the dataset, so you can try:

batch_size = 10000 * torch.cuda.device_count()

Thank you for answer. I changed “cuda:0” to “cuda” as you said, and didn’t see any changes. If I change “batch_size” depending on the “torch.cuda.device_count()”, I will get different sizes of batch and time comparisons will not be honest between single GPU and multi GPU.

Hi,

I think the problem is that your model is so small that your task is cpu bound. So using more GPUs won’t help.
You can do the same experiment with a resnet from torchvision for example (and lower batch_size) to make sure you get a GPU bound task.

I change a little bit code to use it with resnet18

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.models as models

# Parameters and DataLoaders
input_size = 224
output_size = 1000

batch_size = 256
data_size = 10000

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, 3 * size * size).view(length, 3, size, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

model = models.resnet18(pretrained=True)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  model = nn.DataParallel(model)
model.to(device)

for data in rand_loader:
  input = data.to(device)
  output = model(input)
  print("Outside: input size", data.size(),
        "output_size", output.size())

However, using “batch_size=256” on single and multiple GPUs I see the same time (2m6.530s).

And what is the usage of the GPU when you run on a single one? When you run on two?

Thank you for your replies. The problem was with heavy RandomDataset, that generates dataset. When I reduced “data_size” and add 200 epochs, I have seen speedup on two GPUs. So problem was CPU bound.

3 Likes