TorchData: Advice on loading images from S3

MatthewCaseres · July 1, 2022, 5:48pm

Problem statement

We have lots of images in S3 and want to train a model on them.

There is a bucket containing many images but not all images are labeled, maybe a million of images out of several million images are labeled.

The plan is to make a CSV file containing the S3 paths and labels. Then we need to get the images, convert to a WebDataset, and upload it to another S3 bucket. Then we will train from those WebDataset files.

Questions for yall

Is the creation of WebDataset files necessary? I had hoped it wasn’t but I am hearing that it will be to avoid networking bottlenecks.
Would it make sense to use the torchdata library’s datapipes in the WebDataset creation processing pipeline? It seems like torchdata is for loading data for training but does it make sense to use generally for processing?

More generally if you think there is any good advice you have to offer it would really help me out. Thanks.

tom · July 2, 2022, 5:37pm

If you can, check out the brand new PyTorch releas which sports S3 integration.

Best regards

Thomas

MatthewCaseres · July 3, 2022, 5:35am

That was the plan but then I ran into a bug -

github.com/pytorch/data

S3FileLister: ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?)

opened 08:04PM - 01 Jul 22 UTC

MatthewCaseres

### 🐛 Describe the bug The code that I am running is - ```py import torchd…ata.datapipes as dp s3_prefixes = dp.iter.IterableWrapper(["s3://bucket/key"]).list_files_by_s3(request_timeout_ms=100) print(next(iter(s3_urls))) ``` The full readout that I am seeing is here - ``` --------------------------------------------------------------------------- ValueError Traceback (most recent call last) /tmp/ipykernel_11947/2457993211.py in <cell line: 5>() 3 s3_urls = dp.iter.IterableWrapper(["s3://bucket/key"]).list_files_by_s3(request_timeout_ms=100) 4 ----> 5 print(next(iter(s3_urls))) ~/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/utils/data/datapipes/_typing.py in wrap_generator(*args, **kwargs) 512 response = gen.send(None) 513 else: --> 514 response = gen.send(None) 515 516 while True: ~/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torchdata/datapipes/iter/load/s3io.py in __iter__(self) 56 for prefix in self.source_datapipe: 57 while True: ---> 58 urls = self.handler.list_files(prefix) 59 yield from urls 60 if not urls: ValueError: curlCode: 77, Problem with the SSL CA cert (path? access rights?) This exception is thrown by __iter__ of S3FileListerIterDataPipe(length=-1, source_datapipe=IterableWrapperIterDataPipe) ``` I can successfully run the following code - ```py import boto3 s3 = boto3.resource('s3') object = s3.Object('bucket', 'key') # Download the file from S3 object.download_file('./test.tfrecords') ``` ### Versions Unsure if relevant but I am on an EC2 instance Deep Learning AMI. ``` PyTorch version: 1.12.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A OS: Ubuntu 18.04.6 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: Could not collect CMake version: version 3.22.3 Libc version: glibc-2.27 Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) [GCC 10.3.0] (64-bit runtime) Python platform: Linux-5.4.0-1080-aws-x86_64-with-glibc2.27 Is CUDA available: True CUDA runtime version: 11.5.119 GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 510.47.03 cuDNN version: Probably one of the following: /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True Versions of relevant libraries: [pip3] mypy-boto3-s3==1.21.0 [pip3] mypy-boto3-sagemaker==1.21.0 [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.22.4 [pip3] numpydoc==1.2.1 [pip3] torch==1.12.0 [pip3] torch-model-archiver==0.5.3b20220226 [pip3] torch-workflow-archiver==0.2.4b20220513 [pip3] torchaudio==0.11.0 [pip3] torchdata==0.4.0 [pip3] torchserve==0.5.3b20220226 [pip3] torchtext==0.12.0 [pip3] torchvision==0.12.0 [conda] blas 2.115 mkl conda-forge [conda] blas-devel 3.9.0 15_linux64_mkl conda-forge [conda] captum 0.5.0 0 pytorch [conda] cudatoolkit 11.5.1 hcf5317a_10 conda-forge [conda] libblas 3.9.0 15_linux64_mkl conda-forge [conda] libcblas 3.9.0 15_linux64_mkl conda-forge [conda] liblapack 3.9.0 15_linux64_mkl conda-forge [conda] liblapacke 3.9.0 15_linux64_mkl conda-forge [conda] magma-cuda115 2.6.1 0 pytorch [conda] mkl 2022.1.0 h84fe81f_915 conda-forge [conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge [conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge [conda] mkl-service 2.4.0 py39hb699420_0 conda-forge [conda] mkl_fft 1.3.1 py39h1fd5c3a_3 conda-forge [conda] mkl_random 1.2.2 py39h8b66066_1 conda-forge [conda] numpy 1.22.4 py39hc58783e_0 conda-forge [conda] numpydoc 1.2.1 pyhd8ed1ab_0 conda-forge [conda] pytorch-mutex 1.0 cuda pytorch [conda] torch 1.12.0 pypi_0 pypi [conda] torch-model-archiver 0.5.3 py39_0 pytorch [conda] torch-workflow-archiver 0.2.4 py39_0 pytorch [conda] torchaudio 0.11.0 py39_cu115 pytorch [conda] torchdata 0.4.0 pypi_0 pypi [conda] torchserve 0.5.3 py39_0 pytorch [conda] torchtext 0.12.0 py39 pytorch [conda] torchvision 0.12.0 py39_cu115 pytorch ```

I’m doing some multiprocessing async thing that seems like it might work well enough, but looking forward to using the S3 integration.

Based on performance testing in this article, I plan to use Sagemaker FFM when actually training. So the usage of the S3 integration for me is purely to create tarfiles.

Other notes on intended setup - Planning on using .tar.bz for archiving since Python can natively write to it and torchdata library supports it. I will write to a single large tarfile. I won’t even have to make different folders for train and test, since I can include indicators in the filenames and then use Demultiplexer? Apparently files from the tarfile will get pulled out in pseudorandom order so no need to worry about them all being aligned by filename in some problematic way, although there is torchdata Shuffler that shuffles a buffer of some size.

Dor_Biton · October 3, 2022, 2:18pm

I had the same issue and export environment variable helped
export S3_VERIFY_SSL '0'
worked for me.