Hi, I read the docs and understood that we do not need to worry about data duplication with map-style datasets. But, with iterable-style need to be careful of data duplication, and so use of worker_init_function
is recommended in the DataLoader
.
First, did I understand it right?
Secondly, when I was trying the IterDataPipes
from TorchData
, I used the worker_init_function
argument in the DataLoader to which PyTorch gave a deprecation warning.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/backward_compatibility.py:4: UserWarning: Usage of backward_compatibility.worker_init_fn is deprecated as DataLoader automatically applies sharding in every worker
warnings.warn(“Usage of backward_compatibility.worker_init_fn is deprecated”
/usr/local/lib/python3.7/dist-packages/torch/utils/data/backward_compatibility.py:4: UserWarning: Usage of backward_compatibility.worker_init_fn is deprecated as DataLoader automatically applies sharding in every worker
warnings.warn(“Usage of backward_compatibility.worker_init_fn is deprecated”
So, does this mean we do not need to worry about duplication with IterDataPipes
as sharding is automatically applied in every worker according to this warning?
If yes, why do we then need to chain the sharding_filter()
datapipe? Isn’t the purpose of sharding_filter()
to make sure sharding is done in every worker process?
Thanks in advance!