Efficiency of dataloader and collate for large array-like datasets

I have been writing a custom dataset to handle my HDF5-stored tables, and I really like it as an abstraction and interface. I liked it so much I just played with the class and added some flexbility that should make sense to efficiently gather my data.

So me, a horrible, terrible newbie and pytorch phillistine, wrote the dataset as I would intuitively use it (even outside of training loops).

BUT, I found that the pyTorch dataloader handles this quite differently, and I would ask your thoughts.
In particular, I am not sure if the dataloader is really efficient when used with the canonical HDF5 implementation, or what the trade-offs are. More on that below.

Use case

I query a HDF5 database for NLP purposes. This is simply a large array-like table, where each row of this database is a sequence of words. The reason to use HDF5 (pyTables to be exact), is that the whole corpus is large and can not fit into memory.

In the getitem function, I open the table (according to the internet, this is necessary for threading as HDF5 is in general not thread safe), get the data, tokenize and augment (padding to fixed length). All of this works great.

My intuitive approach before looking at the dataloader

The output of dataset[i], that is, a single row, should be a pytorch tensor of tokens, a different tensor of dimension 1 with the sequence id, and potentially other relevant data.
Let’s say the return value is a list or tuple (token_tensor(1xlength), sequence_id(1).
So far, so good.

When writing the dataset class, I thought it would make sense to allow a query of the style dataset[i:k]. I implemented this so that the dataset class returns, for n elements, a tuple (token_tensor(nxlength), sequence_id(nx1).

This makes sense for two reasons. First, it’s nice to be able to use the dataset class as abstraction for pulling data and do all processing. Second, since it is querying a HDF5 file by index, it is also efficient to pull multiple entries at once, given the overhead of opening the database.

So in essence, I wrote a dataset class that can return both single entries, and batches.

Enter dataloader

Dataloader handles things differently. It always accesses a single element, and collates a batch as a tuple of single elements.
That is, instead of asking for n items of the form [i:j] from the dataset, the dataloader queries the dataset n times to get a n length tuple of (token_tensor(1xlength), sequence_id(1).

This means I have to use a custom collate function to step over elements of this tuple. And so, I am doing some operations that technically should not be necessary and I’d imagine are not efficient:

  1. List operations over tuples to collate the rows. Since I am using a table or array style data source, the data already comes in form of an array that can be turned into a tensor. Why transform it to tuples of single elements?
  2. I open and close (in case of HDF5), or at least access the database many, many more times, instead of pulling a larger sample straight away. pyTables, for example, can return any coordinates of rows, exactly what the dataloader should need to create a sample.

My question

I have now set things up such that the output of dataloader+collate and dataset[i:j] are the same. That is, I can either query a batch “raw” via dataset, or I can do so with a dataloader+collate function and get the same thing.

I am not asking “why does the dataloader work this way”.
I realize that the dataloader works the way it does for flexibility. In many situations, what I am doing in the dataset makes no sense or isn’t possible. And given the structure of the operation, letting dataset (as opposed to the dataloader) return batches is clearly not the original intention. I get that. I am also conscious of the possibility that all this is a technical necessity related to threading and the like.

But I can’t help but feel that for many use cases, in particular those where data is queried from an array-like data source, the logic with dataloader may not be efficient for the reasons mentioned above.

My question is therefore the following: What is correct way in pyTorch to load and sample data of this form?
Online, I have almost only found tutorials where either a) the data fits into memory, in which case all of this doesn’t matter, or b) tutorials on non-array like data (like images), where it makes sense to read from single files each time c) approaches that are similar to mine, but of course have the disadvantages above.

Perhaps I am missing what the correct way in pyTorch is? Perhaps there are some implementations or tools that I did not find?
Or maybe I am wrong, and loading single elements is not really a performance issue in the first place?

But if that is not true, it seems that there is something missing in the way pyTorch implements datasets, which is the use case of accessing array like databases. Of course I can write my own dataset/dataloader class to work with array like databases. But this includes doing all the threading and optimization etc, and as a newbie, I can’t hope to match the efficiency of pyTorch devs, so that seems like a bad idea.

Further, you may say that one should not use a HDF5 DB for text, instead loading the raw text samples as a dataset. But this makes no sense to me, because first, one can not really query sequences in raw text files except with a secondary database, and second, basic operations such as deleting non-text, segmentation, and the like, I think should not be part of the training loop, and finally, a compressed HDF5 is probably faster than loading text files anyway, especially because one can query by index?

I apologize if this is all wrong. I tried hard to find “efficient” and “canonical” ways to work with HDF5 files in pyTorch, and it seems that this is what people do. But that can’t really be right.

I’d be happy for your thoughts! Thank you!

Indeed I ran a timing test comparing the two. The code is as follows:

import time
start_time = time.time()

for i in range(0, dataset.nitems-200, 200):
        batch = dataset[i,i+200]
        print(i)
        print(batch[1])
        print(batch[2])

dataset.close()
print("--- %s seconds ---" % (time.time() - start_time))

And for the dataloader

import time
start_time = time.time()

dataloader = DataLoader(dataset, batch_size=200, shuffle=False, num_workers=16,pin_memory=False,collate_fn=text_dataset_collate)

for i, batch in enumerate(dataloader):
       print(i)
       print(batch[1])
       print(batch[2])

dataset.close()
print("--- %s seconds ---" % (time.time() - start_time))

Custom batching takes

--- 2.1859922409057617 seconds ---

The dataloader takes

--- 19.42865824699402 seconds ---

I know this is not quite perfect, because of printing as operation (is this significant) and also the dataset is rather small, only 200k sequences.

But the difference is staggering, even though the dataloader process runs on 16 threads, and the custom batching is simply a loop?
For other batch sizes, differences are even larger. In particular, the dataloader hovers around 20 seconds, decreasing slightly with batch sizes, but the custom loop goes down to things like 0.05 seconds!

Profiling shows that the main difference are the dataloader functions, next and get_data. So it really seems that loading data step by step is inefficient

What am I doing wrong?

This can not be right

I have replicated the test with a more realistic task, encoding the sequences with BERT from pytorch-transformers on my CPU without any threading except num_workers in the dataloader.

Outcome is the same, using batching via the dataset is at least 10x faster than using a dataloader.

I am 100% sure I am doing something wrong, but my dataset class is quite standard, except that I have to open and close a database file, and tokenize+pad in the getitem function, but of course that is the same whether I use dataloader or custom batches.
Thoughts?

In case where bulk read makes sense, you should not use DataLoader in the default random read style.

Here is a quick start on how to switch to bulk loading mode.

Instead, write a sampler that spits out keys representing batch index (e.g., range, or list of indices). For DataLoader, set batch_size=None and use sampler= for your new sampler. And in dataset code __getitem__, the input is what your custom sampler yields, so something that represents batch indices, so use that to index into your hdf5 and return the array.

3 Likes

Ahh that makes sense. Thank you!

I will start a new question for batch sampling

For  `DataLoader` , set  `batch_size=None`  and use  `sampler=`  for your new sampler.

Do you mean batch_sampler= instead of sampler= ? I learned from the doc that batch_sampler= yields batch of indices a time, which suits this case. Here’s the doc: https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader

  • sampler ( Sampler , optional ) – defines the strategy to draw samples from the dataset. If specified, shuffle must be False .
  • batch_sampler ( Sampler , optional ) – like sampler , but returns a batch of indices at a time. Mutually exclusive with batch_size , shuffle , sampler , and drop_last .

I find this part slightly confusing, as there are several options.

My concerns are in this new thread:

The issue, as I see it, is that there are in fact several ways to set up “batch sampling” in the documentation. These include setting up a custom sample function, using the pre-defined batch sampler class also as a custom sampler (both cases have batch_size=None), and finally just setting this batch sampling parameter.

However, just setting batch size and dropout, which is said to be equivalent in the docs, clearly does not return a set of indicies to the dataset, but rather queries single indicies which are then collated.

I vow to try all these options on my problem today or tomorrow, and report back in detail what produces the fastest performance when reading small, medium and large batches from a HDF5 database.

In particular, I am unsure what is better: Using a custom batch sampler and a map style dataset, or using an iterative dataset, where the iteration is determined by a custom batch size.
This is what I want to test - or find out if anyone knows…

I meant sampler=. The approach I described is effectively turning off auto-collation in PyTorch, i.e., to make the DataLoader think that it is loading a single data sample at the time, similar to

for index in sampler:
  yield convert_to_tensor(dataset[index])

It is just that to achieve bulk loading, we return a range / list of indices in sampler, which is used to index the dataset as a single key. If batch_sampler is used instead, DataLoader will unpack the list yielded by the batch_sampler and index the dataset with the resulting keys one-by-one.

1 Like

Yes, you are right that there are two ways:

  1. use IterableDataset: sampling is in dataset code, which is run in worker process.
  2. use batch_size=None and sampler=custom_sampler_that_is_really_a_batch_sampler: sampling is done by a separate class, run in main process.

In general, IterableDataset gives you finer control, because you can achieve things like data-dependent batch size / ordering. And depending on your code, having sampling logic and fetching logic in the same class may be cleaner. But since sampling is done in workers, you likely need to configure each worker to make sure they don’t load duplicate batches, or load in correlated ordering. To me, the second approach makes more sense if your sampling and fetching logic are relatively isolated, and is usually less tricky to implement.

Re docs: The docs could use some improvements… It was largely my fault. If I find some free time I would update it. Sorry about it.

Ahh now I get it with regards to batch_sampler and sampler, I thought it was the other way around. This also explains why a batch_sampler=… does not speed up the data_loader.

I am in the process of benchmarking all those options for my use-case.