Chained Dataloaders

Hello everyone,

The framework I need to plug in expects an instance of data loader, yet I need to train on two different data sets each having a different collate fn. So I can’t simply use ConcatDataset; I need something like ConcatDataloader. Is there such a best somewhere? Do you see other ways to solve this issue?

Many thanks.

Would the output of the different Datasets and collate functions be the same or would they be different?
In the latter case I guess you don’t want to shuffle the indices/samples, since this would create “mixed” batches?
Maybe you could create a custom collate_fn, which uses different work flows based on the passed indices? I.e. would it work if you combine the two collate functions and change the behavior based on the currently loaded data?

Thanks for your response @ptrblck, I overlooked “mixed batches”. This unified Dataloader would be used in BERT pre-training. The research paper suggests using one sequence length encoding first and then the second. Therefore, “To speed up pretraing in our experiments, we pre-train the model with a sequence length of 128 for 90% of the steps. Then, we train the rest 10% of the steps of sequence of 512 to learn the positional embeddings”.

It would be easier to define the custom collate_fn if the indices are passed to it but it only receives the samples/data directly, right? Perhaps, another approach now that the context is more clear?

If you want to select specific samples, I think you should implement a custom sampler.
Samplers are used to create the indices, which are passed to the Dataset.__getitem__ and can thus implement e.g. shuffling or weighted sampling.
Once the samples are loaded they will be passed to the collate_fn, which will then create a batch (by default it will just torch.stack the samples).

Would that make sense or am I still misunderstanding the use case?