Slow aws S3 data loading using TorchData open_files_by_fsspec

Hi All,

I am currently utilizing AWS for storing a large dataset, which has been divided into tar files. While trying to scale up my training process, I found that the main bottleneck lies in the fsspec component. It takes approximately 5 seconds to load the data stored in the tar files.

If you have encountered a similar challenge and have any experience or insights to share, I would greatly appreciate your assistance.

Thank you for your time!

Setup:

torch==2.0
pytorch_lightning==2.0.2
torchdata==0.6

TorchData Datapipe.

datapipe = torchdata.datapipes.iter.IterableWrapper(data_dir)\
			.shuffle()\
			.open_files_by_fsspec(mode='rb')\
			.load_from_tar() \
			.groupby(group_by_filename, group_size=2, guaranteed_group_size=2)\
			.sharding_filter()\
			.shuffle(buffer_size=self.batch_size)\
			.map(self.to_sampels) \

I had the same issue with bottleneck when streaming data from a Google Cloud Storage bucket. It looks like if you’re already using tar files, you could use WebDataset to stream the data - might be faster than using IterableWrapper.

I found this blog by GCP, but I’ve never tried it as my data is in CSV format.