Is multiprocess Dataloader for IterableDataset keeping the order of original interable?

Emerald01 · June 27, 2020, 6:16am

Hi,

I have a question here.

I derive the IterableDataset so it can yield my data. In particular, my input data is a list that I care for the order because later I need to make sure that my output still follows the same order as input. Although more complicated than this, essentially it can be thought as [1,2,3,4…]

Then I use Dataloader with multiple workers like the following

loader = DataLoader(iterable_dataset, batch_size=256, num_workers=2, worker_init_fn=worker_init_fn)
for batch in loader:
    print(batch)

However, I find the batch is strictly following the original order of the iterable [1,2,3,4]. Even I on purpose delay the first worker which is outputting 1 and 2, and the second worker for 3 and 4, this for loop is still yielding data in the original order, i.e., 1, 2, 3, 4. That makes me believe that the async DataLoader here, although they are processing data in parallel (I timing the worker process right before each yield, and I am pretty sure both workers start to work separately at time 0), they will communicate to always keep the original order of the dataset. For example, if 2,3,4 are all ready at the earlier time but they will be blocked and wait for 1 to finish and finally only allow the yield order to be 1,2,3,4

Is this an expected behavior? It looks true to me and I feel it is sub-optimal as it is not a real queue. If this is the case, I am okay but I am curious which line of source code takes care of that, if it is not, could you please hint if I have some misunderstanding or code error so it looks like this?

Thank you so much!

mrshenli · June 28, 2020, 9:41pm

cc @vincentqb for dataloader questions