We have lots of images in S3 and want to train a model on them.
There is a bucket containing many images but not all images are labeled, maybe a million of images out of several million images are labeled.
The plan is to make a CSV file containing the S3 paths and labels. Then we need to get the images, convert to a WebDataset, and upload it to another S3 bucket. Then we will train from those WebDataset files.
Questions for yall
Is the creation of WebDataset files necessary? I had hoped it wasn’t but I am hearing that it will be to avoid networking bottlenecks.
Would it make sense to use the torchdata library’s datapipes in the WebDataset creation processing pipeline? It seems like torchdata is for loading data for training but does it make sense to use generally for processing?
More generally if you think there is any good advice you have to offer it would really help me out. Thanks.
I’m doing some multiprocessing async thing that seems like it might work well enough, but looking forward to using the S3 integration.
Based on performance testing in this article, I plan to use Sagemaker FFM when actually training. So the usage of the S3 integration for me is purely to create tarfiles.
Other notes on intended setup - Planning on using .tar.bz for archiving since Python can natively write to it and torchdata library supports it. I will write to a single large tarfile. I won’t even have to make different folders for train and test, since I can include indicators in the filenames and then use Demultiplexer? Apparently files from the tarfile will get pulled out in pseudorandom order so no need to worry about them all being aligned by filename in some problematic way, although there is torchdata Shuffler that shuffles a buffer of some size.