DataLoader: direct multi-index instead of coallate for batches on map-style datasets

jacobbuckman · April 28, 2021, 3:50am

Yes, the latter. Here’s a full description of what I am doing:

I have a custom Dataset that stores a large dataset in a compressed way, and I have code to efficiently decompress a minibatch onto the CPU, which I implement through .__getitem__(index) where index is batched. .__getitem__ returns a custom data type, Data, which lets me conveniently work with many different fields, i.e. data.x, data.y, data.meta, data.source, etc., each of which is a CPU Tensor whose first dimension is minibatch-sized. Data also has a couple other convenience functions, including .pin_memory(), which returns a new instance of Data where all of the Tensors now live in pinned memory.

So, I want a DataLoader which takes a batch of indices, passes it to .__getitem__, takes the resulting Data object and calls .pin_memory on it, and then holds the result in a queue so it can later be fed to my training loop. The first half (__getitem__ on batch of indices) was solved by the forum post you linked. The second issue (calls .pin_memory and returns the result) is what I’m struggling with now.