[CLOSED]DataLoader in parallel?

Petro_Key · September 2, 2019, 5:09am

Hi, thank you always for your help.

I am writing a dataloader using torch.utils.data.DataLoader class.
I want to load at once the same indexed files from different folders.

For example, provided I have folder 0, 1, …, 9, and each folder have the same indexed files from 00 to 99 like this:
folder 0 — 00.txt, 01.txt, …, 99.txt
folder 1 — 00.txt, 01.txt, …, 99.txt
…
folder 9 — 00.txt, 01.txt, …, 99.txt

I prepared a loading list file where the indices are shuffled, and then tried to load them as usual:
for batch_index, inputs in enumerate(self.data_loader):
process(inputs)
What I expect is, if the loading list file indicates 07.txt, then the loader takes all the 07.txt files from the total 10 folders above. But it seems that all these files are shuffled. I guess this is due to the parallel computing.
Is there any good way to load correctly?

Thank you in advance.

ptrblck · September 2, 2019, 11:54am

If I understand the use case correctly, you would like to return the samples from the .txt files having the same index.
This would also mean that you would increase the batch size by 10 (number of folders).
If that’s correct, you could use the index in your __getitem__ and load the correcponding .txt file using this index for all 10 folders.
Would that work or am I misunderstanding your use case?

Petro_Key · September 3, 2019, 12:26am

Hi @ptrblck, thank you always for your help!
Yes, that is exactly what I want to do, and I see your suggestion about the batch size now, which I have not understood.
I confirmed your idea in my code and obtained probably the correct items.
Thank you very much!