Add multiple GPUs to code for feature extraction

Hi,

I would like to add GPUs to different parts of my code.I am extracting features from several different magnifications of the same image, however using 1 GPU is quite a slow process. I was wondering whether there is a simple way of speeding this up, perhaps by applying different GPU devices for each input? I’m unsure of how to proceed…

Check out my code below:

I have simpliefied it by only adding two magnifications (20x and 40x).

In the for loop, the inputs are put into the available device and features are then extracted. I then concatenate the two output features, where they are then put onto a cpu before I append them to the H5PY dataset (put on to the cpu because of being a numpy array).

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

with h5py.File(path, mode='r+') as hdf5_file:
            array_40 = hdf5_file[f'{phase}_40x_arrays']
            array_20 = hdf5_file[f'{phase}_20x_arrays']
            array_all = hdf5_file[f'{phase}_all_arrays']
            array_labels = hdf5_file[f'{phase}_labels']
            array_batch_idx = hdf5_file[f'{phase}_batch_idx']
            array_paths = hdf5_file[f'{phase}_img_paths']

            batch_idx = int(array_batch_idx[0]+1)
            print("Batch ID is restarting from {}".format(batch_idx))

            dataloaders_dict = torch.utils.data.DataLoader(datasets_dict, batch_size=args.batch_size, sampler=SequentialSampler2(
                datasets_dict, batch_idx), num_workers=args.num_workers, shuffle=False)  
  
            for i, (inputs40x, inputs20x, paths40x, paths20x, labels) in enumerate(dataloaders_dict):

                print(f'Batch ID: {batch_idx}')

                inputs40x = inputs40x.to(device)
                inputs20x = inputs20x.to(device)

                labels = labels.to(device)
                paths = paths40x

                # delete the last fc layer.
                modules = list(resnet50.children())[:-1]
                resnet = nn.Sequential(*modules)

                x40 = resnet(inputs40x)
                x20 = resnet(inputs20x)
                x_all = torch.cat([x40, x20], dim=1)

                # add to index
                array_40[batch_idx, ...] = x40.cpu()
                array_20[batch_idx, ...] = x20.cpu()
                array_all[batch_idx, ...] = x_all.cpu()
                array_labels[batch_idx, ...] = labels[:].cpu()
                array_batch_idx[:,...] = batch_idx
                array_paths[batch_idx, ...] = paths

                batch_idx +=1

If I understand your use case correctly, you would just want to extract the features from a resnet without the last classification layer.
If that’s correct, you could wrap the inference code in a with torch.no_grad() block to avoid storing the intermediate tensors, which would be needed for the gradient calculation.
Also calling model.eval() might be necessary to use the running stats in batchnorm layers and disable dropout.

If you have multiple GPUs, you could also use nn.DataParallel and increase the batch size.
This will chunk the data in dim0 and send each chunk to a model replica on the different devices.

Hi,

Thanks for the reply.

In regards to using with torch.no_grad(), would this be to avoid storing the intermediate tensors in memory? My current output from x40 is torch.Size([1, 2048, 1, 1]) #batch, features, 1l, 1l , so I assumed the features were being extracted correctly. Also I’m not actually training anything. Before this code I have shown, I have the following code to define the resnet model, where I thought I was already doing this:

resnet50 = models.resnet50(pretrained=True)
    resnet50.to(device)
    for param in resnet50.parameters():
        param.requires_grad = False

I didn’t include this because there is a lot of code to show!

Furthermore, I have checkpoitning put into place where it must be sequential. If the loop breaks, it essentially restarts from the last image that was processed for feature extraction. I am creating a HDF5 dataset that will be used for further training later on. This is why I use a batch size of 1. I was wondering if I could use GPUs to speed up the feature extraction process for weach input without changing the batch size as its position in the table needs to remain the same.

Cheers,

In a data parallel setup the output of the models will be gathered in the same way as the data was chunked in before the forward pass.
I.e. if you increase the batch size and make sure that the order of the samples inside the batch is correct, the extracted features will have the same order.

Freezing all parameters should also work, so you wouldn’t need the no_grad statement.

1 Like

Hi,

Apologies for the delay in reply.

I’ve attempted to add 2 GPUs to the code via torch.nn.DataParallel.

I get a new error:

TypeError: Can’t broadcast (4, 3, 224, 224) -> (4, 2048, 1, 1)

From my understanding, this is trying to broadcast the images (batch, channel, image size, image size) rather than the extracted features.

Here is my updated (and simplified code):

#where I tell the machine to use cuda
device = torch.device("cuda")
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")


# where i define the model (way before the extraction loop)
resnet50 = models.resnet50(pretrained=True)
resnet50 = torch.nn.DataParallel(resnet50)
resnet50.to(device)
for param in resnet50.parameters():
    param.requires_grad = False


# after defining the dataloaders, feature extraction happens here and appended to hdf5 dataset
for i, (inputs40x, inputs20x, paths40x, paths20x, labels) in enumerate(dataloaders_dict):

                inputs40x = inputs40x.to(device)
                inputs20x = inputs20x.to(device)

                labels = labels.to(device)
                paths = paths40x

                # delete the last fc layer.
                modules = list(resnet50.children())[:-1]
                resnet = nn.Sequential(*modules)
                x40 = resnet(inputs40x)
                x20 = resnet(inputs20x)
   
                x_all = torch.cat([x40, x20], dim=1)

                # torch.Size([1, 2048, 1, 1]) batch, feats, 1l, 1l
                array_40[batch_idx, ...] = x40.cpu()
                array_20[batch_idx, ...] = x20.cpu()
                array_all[batch_idx, ...] = x_all.cpu()
                array_labels[batch_idx, ...] = labels[:].cpu()
                array_paths[batch_idx, ...] = paths
                array_batch_idx[:,...] = batch_idx

Where do I make the changes for torch.nn.DataParallel to work? For example, from another task, I was able to add “module” before .fc. It’s unclear from this task, as I am purely extracting features without doing any training.

Cheers,

I would recommend to setup the model before wrapping it into nn.DataParallel.
The new creation of resnet via:

modules = list(resnet50.children())[:-1]
resnet = nn.Sequential(*modules)

might have side effects and I’m not sure how this would interact with nn.DataParallel.

1 Like

Hi,

Setting up the model appears to have worked seamlessly. I’m not sure about side effects, but I do hope I have set up defining the model and freezing the parameters correctly. Here is what appears to have not produced any errors:

resnet50 = models.resnet50(pretrained=True)
for param in resnet50.parameters():
    param.requires_grad = False
# delete the last fc layer.
modules = list(resnet50.children())[:-1]
resnet = nn.Sequential(*modules)
resnet50 = torch.nn.DataParallel(resnet50)
resnet50.to(device)