Hi All,
I am currently utilizing AWS for storing a large dataset, which has been divided into tar files. While trying to scale up my training process, I found that the main bottleneck lies in the fsspec
component. It takes approximately 5 seconds to load the data stored in the tar files.
If you have encountered a similar challenge and have any experience or insights to share, I would greatly appreciate your assistance.
Thank you for your time!
Setup:
torch==2.0
pytorch_lightning==2.0.2
torchdata==0.6
TorchData Datapipe.
datapipe = torchdata.datapipes.iter.IterableWrapper(data_dir)\
.shuffle()\
.open_files_by_fsspec(mode='rb')\
.load_from_tar() \
.groupby(group_by_filename, group_size=2, guaranteed_group_size=2)\
.sharding_filter()\
.shuffle(buffer_size=self.batch_size)\
.map(self.to_sampels) \