Bulk Batch Sampling Options

IngoMarquart · October 31, 2019, 12:11pm

I am essentially looking for the best-practice for loading bulk batches of data, and I find very little definitive information in the docs or online, even though I’d imagine this is a quite common use case.

Task description

My specific use-case has two features of import

The data does not fit into memory but needs to be fetched from a table / array style database. However, given that it is array-like, it can be indexed very easily (or in case of a DB, queried even non-sequentially). And adjacent use case would be data that is somehow chunked.
As a result, it is inefficient to use the standard dataloader behavior (including the batch_size parameter), as the dataset is queried with single indecies of data, which leads to a) many small queries for the database and b) many collate operations (in case a custom collate is necessary).

In my use case, even using a custom “dumb” loop over a dataset written to take ranges of indicies is 10x faster than the default dataloader with 16 workers on 16 CPU cores.

Proposed options

I gather the following options. Assume my dataset is able to take lists of indecies and return a finished batch consisting of a data tensor and some labeling tensor, but can also take a single index to work with the standard batching.

In the dataloader, set batch_size to None. Write a sampler class with my desired batch_size and an iter function that yields successive indicies of the type [1,2,3,…batch_size]. An example may be this code:
https://gist.github.com/SsnL/205a4cd2e4e631a42cc9d8e879a296dc
Which also implements chunking.
I would then use a custom collate function, that just returns the data from the dataset as is.
It also seems that there is a “Batch_Sampler” class implemented. e.g.
BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False). This wraps another sampler, in my case, a Sequential Sampler and should also yield the same indicies, right? So no need to write my custom sampler?
This option would again mean setting batch_size to none and specifying, in the dataloader the sampler. Is it truly the same, or is there some sort of unnecessary overhead in either case?
Dataloader itself has a “batch_sampler=” parameter.
The documentation is quite confusing. It says one the one hand
" For map-style datasets, users can alternatively specify batch_sampler , which yields a list of keys at a time."
which suggests this is exactly the same as my previous two points. However, later it says
"The batch_size and drop_last arguments essentially are used to construct a batch_sampler from sampler."
But the batch_sampler thus supposedly created by batch_size clearly does not query batches from the dataset, but queries single indecies and collates via collate_fn. That behavior is inefficient in this use case, and I have tested it to be at least a magnitude too slow
Finally, there is also iterableDataset. I mention this, because this has been proposed for similar use cases (out-of-memory data). I am unsure, however, what the performance implications are. Clearly, if the iter function loads one data-row at a time, we are again inefficient. But the IterableDataset could be set up with a batch size, and itself iterate over batches (perhaps using an internal sampler generator). This would then be the same as a map-style dataset with a batch sampler, right?
Is there some sort of consensus which of these options would be better? I mean sure, the map dataset is more flexible, but I do theoretically iterate over the whole dataset. Would such a “batched IterableDataset” be faster for some reason?

Question

Is my sense of these options correct? Or are there even more? Do you have a sense of what would be efficient?

Further question:

A custom collate function here would just return the inputs, which are already batches, and in my case, already torch tensors (because of the operations peformed after loading the data).
Since there are no pointers in Python, is there something to be aware of here when it comes to passing pyTorch tensors? Clearly, we do not want to modify or copy the tensors provided by the dataset, or get anymore overhead. Is a simple return fine?

deathcrush · March 14, 2022, 8:13pm

I was wondering what your conclusions were in this problem? I am facing a similar use case!