I have been writing a custom dataset to handle my HDF5-stored tables, and I really like it as an abstraction and interface. I liked it so much I just played with the class and added some flexbility that should make sense to efficiently gather my data.
So me, a horrible, terrible newbie and pytorch phillistine, wrote the dataset as I would intuitively use it (even outside of training loops).
BUT, I found that the pyTorch dataloader handles this quite differently, and I would ask your thoughts.
In particular, I am not sure if the dataloader is really efficient when used with the canonical HDF5 implementation, or what the trade-offs are. More on that below.
Use case
I query a HDF5 database for NLP purposes. This is simply a large array-like table, where each row of this database is a sequence of words. The reason to use HDF5 (pyTables to be exact), is that the whole corpus is large and can not fit into memory.
In the getitem function, I open the table (according to the internet, this is necessary for threading as HDF5 is in general not thread safe), get the data, tokenize and augment (padding to fixed length). All of this works great.
My intuitive approach before looking at the dataloader
The output of dataset[i]
, that is, a single row, should be a pytorch tensor of tokens, a different tensor of dimension 1 with the sequence id, and potentially other relevant data.
Let’s say the return value is a list or tuple (token_tensor(1xlength), sequence_id(1)
.
So far, so good.
When writing the dataset class, I thought it would make sense to allow a query of the style dataset[i:k]
. I implemented this so that the dataset class returns, for n elements, a tuple (token_tensor(nxlength), sequence_id(nx1)
.
This makes sense for two reasons. First, it’s nice to be able to use the dataset class as abstraction for pulling data and do all processing. Second, since it is querying a HDF5 file by index, it is also efficient to pull multiple entries at once, given the overhead of opening the database.
So in essence, I wrote a dataset class that can return both single entries, and batches.
Enter dataloader
Dataloader handles things differently. It always accesses a single element, and collates a batch as a tuple of single elements.
That is, instead of asking for n
items of the form [i:j]
from the dataset, the dataloader queries the dataset n
times to get a n
length tuple of (token_tensor(1xlength), sequence_id(1)
.
This means I have to use a custom collate function to step over elements of this tuple. And so, I am doing some operations that technically should not be necessary and I’d imagine are not efficient:
- List operations over tuples to collate the rows. Since I am using a table or array style data source, the data already comes in form of an array that can be turned into a tensor. Why transform it to tuples of single elements?
- I open and close (in case of HDF5), or at least access the database many, many more times, instead of pulling a larger sample straight away. pyTables, for example, can return any coordinates of rows, exactly what the dataloader should need to create a sample.
My question
I have now set things up such that the output of dataloader+collate and dataset[i:j] are the same. That is, I can either query a batch “raw” via dataset, or I can do so with a dataloader+collate function and get the same thing.
I am not asking “why does the dataloader work this way”.
I realize that the dataloader works the way it does for flexibility. In many situations, what I am doing in the dataset makes no sense or isn’t possible. And given the structure of the operation, letting dataset (as opposed to the dataloader) return batches is clearly not the original intention. I get that. I am also conscious of the possibility that all this is a technical necessity related to threading and the like.
But I can’t help but feel that for many use cases, in particular those where data is queried from an array-like data source, the logic with dataloader may not be efficient for the reasons mentioned above.
My question is therefore the following: What is correct way in pyTorch to load and sample data of this form?
Online, I have almost only found tutorials where either a) the data fits into memory, in which case all of this doesn’t matter, or b) tutorials on non-array like data (like images), where it makes sense to read from single files each time c) approaches that are similar to mine, but of course have the disadvantages above.
Perhaps I am missing what the correct way in pyTorch is? Perhaps there are some implementations or tools that I did not find?
Or maybe I am wrong, and loading single elements is not really a performance issue in the first place?
But if that is not true, it seems that there is something missing in the way pyTorch implements datasets, which is the use case of accessing array like databases. Of course I can write my own dataset/dataloader class to work with array like databases. But this includes doing all the threading and optimization etc, and as a newbie, I can’t hope to match the efficiency of pyTorch devs, so that seems like a bad idea.
Further, you may say that one should not use a HDF5 DB for text, instead loading the raw text samples as a dataset. But this makes no sense to me, because first, one can not really query sequences in raw text files except with a secondary database, and second, basic operations such as deleting non-text, segmentation, and the like, I think should not be part of the training loop, and finally, a compressed HDF5 is probably faster than loading text files anyway, especially because one can query by index?
I apologize if this is all wrong. I tried hard to find “efficient” and “canonical” ways to work with HDF5 files in pyTorch, and it seems that this is what people do. But that can’t really be right.
I’d be happy for your thoughts! Thank you!