Problem with speedup DataParallel

goaltender · September 27, 2019, 12:44am

Hello, I am trying to understand how DataParallel works. Now I am testing simple code to see speedup on 2 GPUs:

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 10000
data_size = 1000000

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return output

model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)
model.to(device)

for data in rand_loader:
  input = data.to(device)
  output = model(input)
  print("Outside: input size", data.size(),
        "output_size", output.size())

I measure time by using ‘time’ utility. Execution on 1 GPU takes 3m26.624s, and execution on 2 GPUs takes approximately the same time (±5 seconds). What could be the problem?

Swarchal · September 27, 2019, 2:38pm

Are you sure you’re using both GPUs with nn.DataParallel? The line:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

makes me think that it’s running on a single gpu. You could try changing this to:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In addition, one of the main benefits of data parallelism is that you can use larger batch sizes to speed up iterating through the dataset, so you can try:

batch_size = 10000 * torch.cuda.device_count()

goaltender · September 27, 2019, 3:02pm

Thank you for answer. I changed “cuda:0” to “cuda” as you said, and didn’t see any changes. If I change “batch_size” depending on the “torch.cuda.device_count()”, I will get different sizes of batch and time comparisons will not be honest between single GPU and multi GPU.

albanD · September 27, 2019, 3:08pm

Hi,

I think the problem is that your model is so small that your task is cpu bound. So using more GPUs won’t help.
You can do the same experiment with a resnet from torchvision for example (and lower batch_size) to make sure you get a GPU bound task.

goaltender · September 27, 2019, 4:20pm

I change a little bit code to use it with resnet18

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.models as models

# Parameters and DataLoaders
input_size = 224
output_size = 1000

batch_size = 256
data_size = 10000

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, 3 * size * size).view(length, 3, size, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

model = models.resnet18(pretrained=True)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  model = nn.DataParallel(model)
model.to(device)

for data in rand_loader:
  input = data.to(device)
  output = model(input)
  print("Outside: input size", data.size(),
        "output_size", output.size())

However, using “batch_size=256” on single and multiple GPUs I see the same time (2m6.530s).

albanD · September 27, 2019, 6:29pm

And what is the usage of the GPU when you run on a single one? When you run on two?

goaltender · September 27, 2019, 8:44pm

Thank you for your replies. The problem was with heavy RandomDataset, that generates dataset. When I reduced “data_size” and add 200 epochs, I have seen speedup on two GPUs. So problem was CPU bound.