8 GPUs slower than 1 GPU

I was fine-tuning Inception v3 using Colab with a NVIDIA P100 GPU, batch_size = 32 on circa 100K images size 299x299. Each epoch was taking around 8min.

I then acquired some time on GCP. Set up a nice machine with 8xTesla V100. Connected my colab to it using Colab SDK… Then I’ve changed the model to run in parallel as per tutorials. Increased the batch size to 32*8… However training is now much slower even though I can see the program using the 8 gpus trough nvidia-smi

I am using SSD disk. I wonder if I’ll have to change all layers of Inception_v3 and distribute them across GPUs. Or…is there an easier change I can perform in my code below?

Let me know! This is the first time I am trying parallel process with multi gpus.

This is how I build things:

#batch_size
batch_size = 32*8

# Num of workers
num_w = multiprocessing.cpu_count()

data_loaders = {'train': DataLoader(data['train'], batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=num_w),
                'val': DataLoader(data['val'], batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=num_w),
                'test' : DataLoader(data['test'],  batch_size=batch_size, shuffle=True, pin_memory=True, num_workers=num_w)}

(…)

#(.....)
# Download inception
    elif model_name == "inception":
        """ Inception v3
        Be careful, expects (299,299) sized images and has auxiliary output
        """
        model_ft = models.inception_v3(pretrained=use_pretrained)
        set_parameter_requires_grad(model_ft, feature_extract)
        # Handle the auxilary net
        num_ftrs = model_ft.AuxLogits.fc.in_features
        model_ft.AuxLogits.fc = nn.Linear(num_ftrs, num_classes)
        # Handle the primary net
        num_ftrs = model_ft.fc.in_features
        model_ft.fc = nn.Linear(num_ftrs,num_classes)
        input_size = 299

    else:
        print("Invalid model name, exiting...")
        exit()

    return model_ft, input_size

# Initialize the model for this run
model_ft, input_size = initialize_model(model_name, num_classes, feature_extract, use_pretrained=True)

Put in the GPUs

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  model_ft = nn.DataParallel(model_ft)

model_ft = model_ft.to(device)

We generally recommend using nn.DistributedDataParallel using a single process for each GPU to avoid the communication overhead from nn.DataParallel.
Could you try that and check the performance again?

1 Like

Thanks @ptrblck,
I’ve found this tutorial: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

This is what I needed.

It seems like I will have to do further investigations on Rank and World_size, but it seems like it’s related to the number of GPUs per node/machine. If I am using a single instance, maybe I’ll have world_size = 1?

Or is it world_size = 8 (GPUs) with 1 rank [0…7]?

I think I would go with the world_size=8 since I am not working with multiple machines / racks (which might be the case for large research centres / super computers etc…). However, I am not sure if this is the right understanding. Any tips?

1 Like