Hello, I am trying to understand how DataParallel works. Now I am testing simple code to see speedup on 2 GPUs:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
# Parameters and DataLoaders
input_size = 5
output_size = 2
batch_size = 10000
data_size = 1000000
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
class RandomDataset(Dataset):
def __init__(self, size, length):
self.len = length
self.data = torch.randn(length, size)
def __getitem__(self, index):
return self.data[index]
def __len__(self):
return self.len
rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
batch_size=batch_size, shuffle=True)
class Model(nn.Module):
# Our model
def __init__(self, input_size, output_size):
super(Model, self).__init__()
self.fc = nn.Linear(input_size, output_size)
def forward(self, input):
output = self.fc(input)
print("\tIn Model: input size", input.size(),
"output size", output.size())
return output
model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
# dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
model = nn.DataParallel(model)
model.to(device)
for data in rand_loader:
input = data.to(device)
output = model(input)
print("Outside: input size", data.size(),
"output_size", output.size())
I measure time by using ‘time’ utility. Execution on 1 GPU takes 3m26.624s, and execution on 2 GPUs takes approximately the same time (±5 seconds). What could be the problem?
Are you sure you’re using both GPUs with nn.DataParallel? The line:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
makes me think that it’s running on a single gpu. You could try changing this to:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
In addition, one of the main benefits of data parallelism is that you can use larger batch sizes to speed up iterating through the dataset, so you can try:
Thank you for answer. I changed “cuda:0” to “cuda” as you said, and didn’t see any changes. If I change “batch_size” depending on the “torch.cuda.device_count()”, I will get different sizes of batch and time comparisons will not be honest between single GPU and multi GPU.
I think the problem is that your model is so small that your task is cpu bound. So using more GPUs won’t help.
You can do the same experiment with a resnet from torchvision for example (and lower batch_size) to make sure you get a GPU bound task.
Thank you for your replies. The problem was with heavy RandomDataset, that generates dataset. When I reduced “data_size” and add 200 epochs, I have seen speedup on two GPUs. So problem was CPU bound.