How to correctly use GPU for tensor operations?

I am trying to reconstruct a rather complicated neural network for shape & appearance disentanglement. It includes several transformations on the tensors during the process. Now I am wondering, whether and how to correctly send data on GPU in order to be efficient without taking away useful memory space.

The first questions arises within the DataSet class. It currently looks as follows:

class ImageDataset(Dataset):
    def __init__(self, images, arg):
        super(ImageDataset, self).__init__()
        self.device = arg.device =
        self.brightness = arg.brightness_var
        self.contrast = arg.contrast_var
        self.saturation = arg.saturation_var
        self.hue = arg.hue_var
        self.scal = arg.scal
        self.tps_scal = arg.tps_scal
        self.rot_scal = arg.rot_scal
        self.off_scal = arg.off_scal
        self.scal_var = arg.scal_var
        self.augm_scal = arg.augm_scal
        self.images = images
        self.transforms = transforms.Compose([transforms.ToTensor(),
                                              transforms.Normalize([0.5], [0.5])

    def __len__(self):
        return len(self.images)

    def __getitem__(self, index):
        # Select Image
        image = self.images[index]

        # Get parameters for transformations
        tps_param_dic = tps_parameters(1, self.scal, self.tps_scal, self.rot_scal, self.off_scal,
                                       self.scal_var, self.augm_scal)
        coord, vector = make_input_tps_param(tps_param_dic)

        # Make transformations
        x_spatial_transform = self.transforms(image).unsqueeze(0).to(self.device)
        x_spatial_transform, t_mesh = ThinPlateSpline(x_spatial_transform, coord,
                                                      vector, 128, self.device)
        x_spatial_transform = x_spatial_transform.squeeze(0)
        x_appearance_transform = K.ColorJitter(self.brightness, self.contrast, self.saturation, self.hue)\
        original = self.transforms(image)
        coord, vector = coord[0], vector[0]

        return original, x_spatial_transform, x_appearance_transform, coord, vector

The ThinPlateSpline function is a rather complicated function, that performs a TPS transformation. During the process, some tensors are created and since I need the function later again, I have to specify the device. As an example, it contains things like that:

def ThinPlateSpline(U, coord, vector, out_size, device, move=None, scal=None):
    coord = torch.flip(coord, [2])
    vector = torch.flip(vector, [2])

    num_batch, channels, height, width = U.shape
    out_height = out_size
    out_width = out_size
    height_f = torch.tensor([height], dtype=torch.float32).to(device)
    width_f = torch.tensor([width], dtype=torch.float32).to(device)
    num_point = coord.shape[1]

The tensors height_f and width_f are therefore created at each call of the function and I wonder if this is a problem? Is there a better way to perform operations on my data within the architecture?

Also, should i send the data to the GPU within the DataSet class?

Thanks for your help!

Do transforms on the GPU. Have the dataloader return unscaled 8-bit int images on the CPU. After these are collated you can batch transfer these to the GPU and then apply the first set of transform self.transforms (Note: you would have to change the normalization mean and var to reflect unscaled values).

Also, the rest of the code can all be run on the GPU.

Thanks, that actually helped me a lot, in fact I removed the transformations within the DataLoader completely and only return the image.

Another question:

For my training I am using a dataset, that lies on a server. It contains about 10.000 images of size 256x256, so it is quite large. Currently I am loading the dataset as a numpy array:

# Load Datasets
data = load_images_from_folder()
train_data = np.array(data[:-1000])
train_dataset = ImageDataset(train_data, arg)
test_data = np.array(data[-1000:])
test_dataset = ImageDataset(test_data, arg)

# Prepare Dataloader & Instances
train_loader = DataLoader(train_dataset, batch_size=bn, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=bn)

The function load_images_from_folder just returns a list containing all images. Is that the correct approach?

Edit: I want to add, that I am asking because it takes like 5 minutes until my network starts training.

In your case the dataset can be fit into RAM and you are taking advantage of that. The initial slowdown that you are facing is a one time slowdown as you don’t hve to load the images again (the biggest bottleneck in all cases).

The choice to load the entire dataset on RAM depends on what is the bottleneck CPU or GPU. If GPU is the bottleneck than no need to load images in RAM.