Is this possible?
I’m asking this becasue for example today I had couple of models and for each of them I’d like to use a different batch_size, I initially created a dataloader with lets say batch_size of 32, and now I want to increase its size to lets say 128. but I dont want to create a new dataloader.
Is this possible?
I would appreciate any kind of help in this regard.
This shouldn’t be allowed in the current version anymore and you’ll get a ValueError
:
ValueError: batch_size attribute should not be set after DataLoader is initialized
Creating a new DataLoader
should be cheap, so I would recommend to initialize a new DataLoader
.
Thanks, but why not?
Why isn’t this possible? How do I know what kind of DataLoader was passed?
Say I have the following:
def return_dl(
dataloader: DataLoader,
batch_size: int = None,
) -> DataLoader:
if batch_size is not None:
dataloader.batch_size = batch_size
return dataloader
Here, dataloader
could be any DataLoader
subclass, e.g. 'NeighborLoader` from PyTorch Geometric. Why should this remain not allowed?
The ValueError
was added here when persistent_workers
were introduced since changing internal attributes won’t have an effect anymore. I’m sure the code owners would like to hear your concerns and PRs should be welcome in case you want to allow changing these attributes for valid use cases.
Thanks for the reply. But then how would one handle my use case? You receive an already instantiated DataLoader
, and one has to change its batch size while keeping all the rest. What would be the best way to do so? How can one initialize a new DataLoader
that is exactly like the old but with a different batch size?
Exactly as you’ve described: you could re-initialize a new DataLoader
. Usually the creation of the DataLoader
is cheap compared to the actual data loading and processing (assuming you are using lazy data loading).
Another complication (besides persistent workers) is that each worker pre-fetches batches in advance. If your code now manipulates internal states (such as the batch size) how and when should the change be visible?
Maybe there could be a clear API design, so your ideas are welcome.
Thanks for being very active here!
Yep, I see. But re-initializing a new DataLoader
is not easy as I do not know how this DataLoader
was initialized! In my design, the user passes a dataloader and I have to change its batch size. Why? Well, because that dataloader (more the data behind it) is used to train two different models. The second model benefits from huge batch sizes. To solve the problem, I see no other solutions but to force the user to create two dataloaders, unless there is an easy way to create a new DataLoader
from an existing one.
You should be able to still read the attributes of the passed DataLoader
and could create the new one based on these attributes. Note that you might also need to check the internal sampler
and its attributes in order to restore it. In case the user creates a custom sampler
and you wouldn’t have the source code of it (I don’t know how it would work but it seems you don’t even know how the Dataset
is defined), you might not be able to restore it.