I’m trying to implement a batched data loader. When I implement a BatchedSampler, it gives me elements like this: [1,2,3], [4,5,6],…etc. Were I to load the data myself directly, I would want to just use multiindexing: train_set[idx] where idx=[1,2,3] or whatever. (Both are tensors, of course.) But it seems like the DataLoader insists on using a coallate function. It returns: coallate([train_set[1], train_set[2], train_set[3]).
Is there any way to disable this behavior? I’m using a weird custom object, so doing many individual queries is expensive, as is a coallate_fn call. And, I’ve already set it up so that we can multiselect quickly. I can see no good reason why we should select each one separately and then coallate.
Is there a way to do this?
You could try to use a BatchSampler
as given in this code snippet, which would pass multiple indices to the __getitem__
and thus you could load multiple samples in this call.
Thanks, this is just what I was after.
A follow-up: my dataloader uses a custom data type, Data
. Data
has a function pin_memory(self)
, which returns another object of custom type Data
but in pinned memory. However, when I set pin_memory=True
in my dataloader, suddenly my iterable yields objects of type dict
instead of Data
. Any idea what might be going on?
No, I don’t know how custom objects are treated in the DataLoader
and might interact with pinning the memory. What’s your use case that you need to pin the memory in your custom object and then again in the DataLoader
? Or are you trying to use the DataLoader
to call into your custom pin_memory
method?
Yes, the latter. Here’s a full description of what I am doing:
I have a custom Dataset
that stores a large dataset in a compressed way, and I have code to efficiently decompress a minibatch onto the CPU, which I implement through .__getitem__(index)
where index is batched. .__getitem__
returns a custom data type, Data
, which lets me conveniently work with many different fields, i.e. data.x
, data.y
, data.meta
, data.source
, etc., each of which is a CPU Tensor whose first dimension is minibatch-sized. Data
also has a couple other convenience functions, including .pin_memory()
, which returns a new instance of Data
where all of the Tensors now live in pinned memory.
So, I want a DataLoader which takes a batch of indices, passes it to .__getitem__
, takes the resulting Data
object and calls .pin_memory
on it, and then holds the result in a queue so it can later be fed to my training loop. The first half (__getitem__
on batch of indices) was solved by the forum post you linked. The second issue (calls .pin_memory
and returns the result) is what I’m struggling with now.
Interesting use case. I’m still unsure what’s causing the dict
creation, but I guess that something might have changed in the object such that the collate_fn
is now treating it differently.
Could you try to create a custom collate_fn
and return the desired type?
I’ve been using collate_fn=lambda x:x
in an effort to avoid doing anything weird. I’ve confirmed that the input and output are both of custom type Data
.
I traced it back to _utils.pin_memory.pin_memory
, which uses an if/else statement to outline a bunch of options for the pinning. One of these options checks if the object is a collections.abc.Mapping
, and if so, returns a dict with each item pinned. Since my custom object is a dict subclass, this is triggering instead of my class’s custom pin_memory
.
I’ve monkey-patched the function on my own PyTorch build, so I’m good, but it might be worth fixing for the library. Maybe you could modify that function so that it exclusively triggers if the object in question is a dict.
If you think this behavior is wrong and could be improved, feel free to create a feature request on GitHub and let us know what your exact use case is. I’m not deeply familiar with all checks for pinning tensors, but would be afraid of breaking other use cases.