I wonder how to use multi-gpus

Hi. I want to use multi-gpus but I don’t know how to use it.

I used the DataParallel function because it seems to be the most basic way.

When I use 2 gpus, the images are well assigned to each gpus, but the speed at which the actual code runs is the same as when using one gpu.

Below is the code I used.

# create model
print("=> creating model 'resnet18'")
model = resnet.resnet18()

if torch.cuda.device_count() > 1:
    print("Using", torch.cuda.device_count(), "GPUs")
    # DataParallel will divide and allocate batch_size to all available GPUs
    model = nn.DataParallel(model.cuda(), device_ids=[0,1])

# define loss function (criterion) and optimizer
criterion = nn.NLLLoss()
softmax = nn.Softmax(1)

optimizer = torch.optim.SGD(model.parameters(), learning_rate,
                            momentum=momentum,
                            weight_decay=weight_decay)

cudnn.benchmark = True


## train code
def train(train_loader, model, criterion, optimizer, epoch):

    # switch to train mode
    model.train()

    for i, (images, target) in enumerate(train_loader):

        images = images.to(device)
        target = target.to(device)

        out = model(images)
        out_sm = softmax(out)
        log = torch.log(out_sm)
        
        loss = criterion(log, target)

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    return loss

Is there anything I missed?
Thanks

Did you increase your batch size?
If the speed is the same with the batch size doubled, you would have a perfect scaling. :slight_smile:

yes, I increased my batch size doubled(256 -> 512) with two GPUs.
Images are also well assigned to each GPU.
but still, the speed is the same as using one GPU with batch size 256.

The overall speed for the complete training or for each iteration?

In both cases it is.

I’m a bit confused.
If the overall training (or epoch) takes the same amount of time, you won’t save anything.
On the other hand, if each iteration (with a doubled batch size) will take the same time, your epoch duration will be halved, so you are scaling with 2x.

Yes, that’s the point I feel strange too.
My current environment is torch 1.3.0 in CUDA 10.0.
Is this environment a problem to use more than one GPU?
When I use one GPU, it works fine.

I solved it. The reason GPUs are not fully utilized is that I have not used enough CPU.
The CPU has to pass the dataset to the GPU quickly, but I limited the number of CPUs in the DataLoader (num_workers=4), so the dataset was not fed to the GPU enough.
In fact, by increasing the num_workers(=16) of the DataLoader, the training speed was improved while fully utilizing the GPU.
Thank you.