Worker_init_function deprecated in DataLoader

Hi, I read the docs and understood that we do not need to worry about data duplication with map-style datasets. But, with iterable-style need to be careful of data duplication, and so use of worker_init_function is recommended in the DataLoader.

First, did I understand it right?

Secondly, when I was trying the IterDataPipes from TorchData, I used the worker_init_function argument in the DataLoader to which PyTorch gave a deprecation warning.

/usr/local/lib/python3.7/dist-packages/torch/utils/data/backward_compatibility.py:4: UserWarning: Usage of backward_compatibility.worker_init_fn is deprecated as DataLoader automatically applies sharding in every worker
warnings.warn(“Usage of backward_compatibility.worker_init_fn is deprecated”
/usr/local/lib/python3.7/dist-packages/torch/utils/data/backward_compatibility.py:4: UserWarning: Usage of backward_compatibility.worker_init_fn is deprecated as DataLoader automatically applies sharding in every worker
warnings.warn(“Usage of backward_compatibility.worker_init_fn is deprecated”

So, does this mean we do not need to worry about duplication with IterDataPipes as sharding is automatically applied in every worker according to this warning?

If yes, why do we then need to chain the sharding_filter() datapipe? Isn’t the purpose of sharding_filter() to make sure sharding is done in every worker process?

Thanks in advance!

Prior to the latest TorchData release (0.3.0 or before), you needed both backward_compatibility.worker_init_fn for DataLoader and .sharding_filer() in your pipeline for the sharding to work properly in every worker process.

You no longer need to import and use backward_compatibility.worker_init_fn in the latest TorchData version (0.4.0) because of a change to DataLoader. However, you still need to add .sharding_filer() in your data pipeline in the location when/where you would like to shard your data. See this tutorial for an example.

1 Like