Multi-GPU Training for a Unet based segmentation model

rahulrav · August 4, 2020, 5:26pm

I am trying to take advantage of PyTorch’s multi-GPU support on a single machine by using nn.DataParallel.

Note: I am using a framework called fastai2 which builds on top of PyTorch, so my scrips will have a bit of that sprinkled in.


import numpy as np
from fastai2.vision.all import *
from fastai2.distributed import *


def train():
    path = untar_data(URLs.CAMVID_TINY)

    def label_func(fn): 
        return path/"labels"/f"{fn.stem}_P{fn.suffix}"

    codes = np.loadtxt(path/'codes.txt', dtype=str)
    fnames = get_image_files(path/"images")
    dls = SegmentationDataLoaders.from_label_func(
        path, bs=8, fnames = fnames, label_func = label_func, codes = codes
    )

    learner = unet_learner(dls, resnet34).to_fp16()
    if torch.cuda.device_count() > 1:
        wrapped_model = nn.DataParallel(learner.model)
        learner.model = wrapped_model.module

    callbacks = [
        EarlyStoppingCallback(min_delta=0.001, patience=5)
    ]

    learner.fine_tune(20, freeze_epochs=2, wd=0.01, base_lr=0.0006, cbs=callbacks)
    print('Done')


if __name__ == "__main__":
    train()

unet_learner returns an instance of nn.Module which I am trying to wrap with nn.DataParallel.

Problem

This does not seem to have the intended effect. I am still only able to use 1 GPU.

I tried changing the batch_size (bs in SegmentationDataLoaders) as well, and that did not make any difference other than running out of GPU memory.

Any ideas on what I might be missing ?

ptrblck · August 7, 2020, 5:25am

I’m a bit confused, what these lines are doing:

wrapped_model = nn.DataParallel(learner.model)
learner.model = wrapped_model.module

I assume that learner.model is an nn.Module, wrapped_model would be an nn.DataParallel object, and you would reassign the module to itself?
What would happen, if you remove the second line of code?

rahulrav · August 7, 2020, 5:54am

Sorry about that. My code snippet was incorrect. I was assigning learner.model to the instance of nn.DataParallel object. Let me send you the error message I saw first thing tomorrow.