creating a batch from the two workers would require waiting for them to finish their task (getting data) before returning the batch, which is inefficient and (I guess depending on the architecture, data structure, storage etc…) useless compared to a single process loading sample1, sample2, sample3 and sample4 and then returning the batch.
for your first point, @apaszke only showed the result, but I want to know the reason.
for the second, multi workers handle a batch may not be slower than single worker.
In my test case, I use 200 samples and set batch size is equal to 50. The result shows only 4 subprocess works even if i set num_workers=16(I have 16 CPUs). So I think alex.veuthey is right.