DataLoader: direct multi-index instead of coallate for batches on map-style datasets

I’m trying to implement a batched data loader. When I implement a BatchedSampler, it gives me elements like this: [1,2,3], [4,5,6],…etc. Were I to load the data myself directly, I would want to just use multiindexing: train_set[idx] where idx=[1,2,3] or whatever. (Both are tensors, of course.) But it seems like the DataLoader insists on using a coallate function. It returns: coallate([train_set[1], train_set[2], train_set[3]).

Is there any way to disable this behavior? I’m using a weird custom object, so doing many individual queries is expensive, as is a coallate_fn call. And, I’ve already set it up so that we can multiselect quickly. I can see no good reason why we should select each one separately and then coallate.

Is there a way to do this?

You could try to use a BatchSampler as given in this code snippet, which would pass multiple indices to the __getitem__ and thus you could load multiple samples in this call.

Thanks, this is just what I was after.

A follow-up: my dataloader uses a custom data type, Data. Data has a function pin_memory(self), which returns another object of custom type Data but in pinned memory. However, when I set pin_memory=True in my dataloader, suddenly my iterable yields objects of type dict instead of Data. Any idea what might be going on?

No, I don’t know how custom objects are treated in the DataLoader and might interact with pinning the memory. What’s your use case that you need to pin the memory in your custom object and then again in the DataLoader? Or are you trying to use the DataLoader to call into your custom pin_memory method?

Yes, the latter. Here’s a full description of what I am doing:

I have a custom Dataset that stores a large dataset in a compressed way, and I have code to efficiently decompress a minibatch onto the CPU, which I implement through .__getitem__(index) where index is batched. .__getitem__ returns a custom data type, Data, which lets me conveniently work with many different fields, i.e. data.x, data.y, data.meta, data.source, etc., each of which is a CPU Tensor whose first dimension is minibatch-sized. Data also has a couple other convenience functions, including .pin_memory(), which returns a new instance of Data where all of the Tensors now live in pinned memory.

So, I want a DataLoader which takes a batch of indices, passes it to .__getitem__, takes the resulting Data object and calls .pin_memory on it, and then holds the result in a queue so it can later be fed to my training loop. The first half (__getitem__ on batch of indices) was solved by the forum post you linked. The second issue (calls .pin_memory and returns the result) is what I’m struggling with now.

Interesting use case. I’m still unsure what’s causing the dict creation, but I guess that something might have changed in the object such that the collate_fn is now treating it differently.
Could you try to create a custom collate_fn and return the desired type?

I’ve been using collate_fn=lambda x:x in an effort to avoid doing anything weird. I’ve confirmed that the input and output are both of custom type Data.

I traced it back to _utils.pin_memory.pin_memory, which uses an if/else statement to outline a bunch of options for the pinning. One of these options checks if the object is a, and if so, returns a dict with each item pinned. Since my custom object is a dict subclass, this is triggering instead of my class’s custom pin_memory.

I’ve monkey-patched the function on my own PyTorch build, so I’m good, but it might be worth fixing for the library. Maybe you could modify that function so that it exclusively triggers if the object in question is a dict.

If you think this behavior is wrong and could be improved, feel free to create a feature request on GitHub and let us know what your exact use case is. I’m not deeply familiar with all checks for pinning tensors, but would be afraid of breaking other use cases.