Does a `DataLoader` created from `ConcatDataset` create a batch from a different files, or a single file?

Santosh-Gupta · October 13, 2019, 6:58am

I am working with multiple files, and multiple training samples in each file. I will use ConcatDataset as described here:

I need to have negative samples in addition to my true samples, and I need my negative samples to be randomly selected from all the training data files. So, I am wondering, would the returned batch samples just be a random consecutive chuck from a random single file, or would be batch span across multiple random indexes across all the datafiles?

If there are more details needed about what I am trying to do exactly, it’s because I am trying to train over a TPU with Pytorch XLA.

Normally for negative samples, I would just use a 2nd DataSet and DataLoader, however, I am trying to train over TPUs with Pytorch XLA (alpha was just released a few days ago GitHub - pytorch/xla: Enabling PyTorch on XLA Devices (e.g. Google TPU) ), and to do that I need to send my DataLoader to a torch_xla.distributed.data_parallel.DataParallel object, like model_parallel(train_loop_fn, train_loader) which can be seen in these example notebooks

https://github.com/pytorch/xla/blob/master/contrib/colab/resnet18-training-xrt-1-15.ipynb

https://github.com/pytorch/xla/blob/master/contrib/colab/mnist-training-xrt-1-15.ipynb

So, I am now limited to a single DataLoader, which will need to handle both the true samples, and negative samples that need to be randomly selected from all my files.