Rookie ask: how to speed up the loading speed in pytorch

Hi,I was trying some preliminary models on a huge single datafile (~150G, each row represents a data point) which is not able to be loaded into memory all at once. Therefore, I wrote a data-iterator as follows for batch-wise loading&training. However, I found the loading procedure is very slow. When the batch size is set to 1024, it takes ~0.9s to load making loading data a bottleneck in the experiment. It seems that torch.utils.data.DataLoader is preferable for those datasets where each data point can be efficiently accessed and not suitable for this case.

I was wondering if anyone can provide some recommendations on improving the loading efficiency? Millions of thanks in advance.

def data_iterator(data_path, batch_size):
    print("Loading data in {}".format(data_path))
    count = 0
    X = list()
    y = list()
    with open(data_path, 'rb') as fr:
        for line in fr:
            items = line.strip().split('\t')
            label = int(items[2])
            trip = items[5]
            trip = json.loads(trip)
            for frame in trip:
                count += 1
                frame = np.asarray(frame)
                frame = np.reshape(frame,(1,200,50))
                X.append(frame)
                y.append(label)
                if count % batch_size == 0:
                    yield X, y
                    del X
                    del y
                    X = list()
                    y = list()
    if len(X)>0:
        yield X, y

Try to (safely) use multiple threads for loading. The torch DataLoader already has multi-threading functionalities and might prove useful. If you train on a GPU, the loading and pre-processing of data in main memory by the CPU adds almost no overhead since it is hidden by the GPU training (i.e. as an iteration happens on the GPU, the CPU loads the next batch).

When it comes to large datasets that won’t fit in your RAM, a memory mapped database can help. Perhaps try to store your (processed) data in an LMDB (Lightning Memory-mapped Database) instance.

EDIT: If you save your data in a database, using DataLoader will be much easier. Large text files are hard to work with since reading is usually done line by line for efficiency and random line reads are slow.

3 Likes

Hi Demtris,

Thank you very much for you detailed answer!

As you said, we are reading a large text file that won’t fit in RAM and I will definitely try your recommended LMDB and see what happens.

I understand your point about “loading with CPU adds almost no overhead while training on GPU” and I am curious if there is a simple way of writing a dataloader that can load data independently (e.g, using a queue storing batches loaded like what has been done in TensorFlow )

This is exactly what DataLoader does if you set num_threads > 1 (the name DataLoader is unfortunate in my opinion, since it is really an iterator).

What you will need to do in your case to use DataLoader, is to implement your own dataset and pass it to the DataLoader when you create it. Your dataset will be a class that implements getitem(self, index) (and len(self) ). getitem will load and return a datapoint from your database (or from wherever else you choose to load your data from). You could even have it read from a text file, but you might run into problems with that if using multiple threads.

See the an example dataset here which loads images from directories:

Got it~~ Thank you very much for your patient answer. As a text file is usually read line by line, it seems to be difficult for a single-thread DataLoader to do some simple operations like shuffle, correct? I will try the LMDB solution and share my feedback soon. Thanks again : )

hello, Have you found the efficient way to load data? Thank you~