Most efficient way of loading data

After a couple of weeks of intensively working with pytorch, I am still wondering what the most efficient way of loading data on the fly is, i.e. not considering loading the entire data into RAM.

The dataloader tutorial reads in csv files and then pngs in every call to getitem().

I used to use hdf5 but cannot get rid of some nasty bottlenecks plus the looming danger of receiving corrupted data due to hdf5’s multiprocessing issues, although I included several checks to exclude those errors.

pytorch offers a “fakedata” dataset that exhausts my TitanX during training for about 65% but without ever dropping down which is quite nice. Apparently it accesses pil images, but does that mean, that images are really saves as pil on disk?

Then there is the option of saving the data via torch.save() prior to the training process. Would that be beneficial under certain circumstances?

My intention is to get rid of the regular GPU exhaustion drops after every epoch and I have tried every alley so far, but without much success.

I even did load my entire set into RAM, roughly 2 GB but I still got the drops…

I’d also be interested in the answers. As you say, there are GPU exhaustion drops after every epoch that I haven’t been able to get around. What is it that takes time here? The sampler, spinning up new workers or what?

There is only ever a significant performance hit for small dataset where one have to shuffle/resample often. Could I ask how quickly you go through an epoch?

I would say no. The only use case I see is if you want to do pre-processing that takes a lot of time and wouldn’t be necessary every epoch, but can’t be done before training.

Just to be clear, are you having GPU drops after the epochs or also between batches? You are running several workers right?

I have been experimenting with every possible number of workers. Using a GPU (TitanX, Ti1080) and CPU Training. The drops occur before/after every epoch. Currently an epoch takes around 8 s for images of size 480x640 and a batch size of 20.

1 Like

I see. The simplest fix would be to put your data into RAM / saving it in a variable but as this wasn’t what you were after lets try to get another solution.

You could artificially increase your dataset size. This can be done by returning a multiple of your data size in the len() method. Then you’d have high indexes in the getitem() method which you’d have to map to your actual data with a modulus operation for example.

With this approach the problem of putting the same image in the batch twice arises so you’d have to write a custom sampler to avoid this. Probably drop the last batch if it’s smaller than batch size as well

You updated your post about putting into RAM -> If you aren’t doing any data augmentation you could iterate over the dataset with the dataloader once, and save the data in a variable, like a list. For the second (and rest) of the epochs you iterate over the list instead of iterating with the dataloader.

Do you feel me?

actually thats what I did. I saves the entire data set into torch.tensor variables inside of the init function of my dataset class.

Then I accessed only specific indices in the getitem method, but that did not improve on the gpu drops

Well there seems to be some overhead by using the Dataset + Dataloader classes. Create a TensorDataset (or some other datastructure) outside of your training loop without the use of the Dataset + Dataloader classes. Then you can come up with your own indexes (randomly) and simply get the relevant items from your datastructure. This way you load into RAM and don’t have to deal with the dataloader overhead each epoch

1 Like

thank you mate. I was thinking about that for a while. but then I cannot take advantage of multi processing like I can with num_workers > 0 can I?

Also I did not yet completely understand the use case of TensorDataset, would you care to elaborate a little?

No you can’t. But I don’t think you need to since the number of workers primarily is for when you want to fetch data from disk or perform data augmentation (or other pre-processing).

The TensorDataset is just a dataset made from tensors, without the whole torch.data.Dataset code. I made an example that you might find useful. Note that this example probably doesn’t play well for datasets that aren’t evenly divided with the batch size, like a dataset of 14, batch size of 4 -> 4+4+4=12 and then there are 2 left which isn’t enough for a full batch. Easily fixable with some if statement :slight_smile:

Also note that I hardcoded the number of labels to 1 on two occasions. And the more_itertools need to be pip installed

import numpy as np
import torch
from torch.utils.data import TensorDataset
import random
import more_itertools

def load_data():
  # Fake data. You can also load your images and convert them into tensors.
  number_images = 100
  images = torch.randn(number_images, 3, 2, 2)
  labels = torch.ones(number_images, 1)
  return TensorDataset(images, labels)

def get_batch(dataset, batch_idx):
  ''' Returns the data items given batch indexes '''

  # Set up the datastructures
  im_size = dataset[0][0].size()
  batch_size = len(batch_idx)
  batch_data = torch.empty((batch_size, *im_size))
  batch_labels = torch.empty((batch_size, 1))
  
  # Add data to datastructures
  for i, data_idx in enumerate(batch_idx):
    data, label = dataset[data_idx]
    batch_data[i] = data
    batch_labels[i] = label

  return batch_data, batch_labels

dataset = load_data()
data_length = len(dataset)

batch_size = 10
n_epochs = 10
for epoch in range(n_epochs):
  # Create indexes, shuffles them and split them into batches
  indexes = list(range(data_length))
  random.shuffle(indexes)
  indexes = more_itertools.chunked(indexes, batch_size)

  for batch_idx in indexes:
    images, labels = get_batch(dataset, batch_idx)
    # You can now work with your data
1 Like

thanks a lot for the detailed example. I will try that out and compare it to a custom dataloader version and then report results here =)

1 Like