Using a Google Cloud Storage bucket for dataset

Since there is continued interests from the community to load datasets from AWS S3, Google Cloud, and Azure, we have included a tutorial in our documentation on this topic.

I think the code base may be a bit ahead of what’s documented as I found the following worked in a notebook environment in VSCode. Just wanted to share for anyone else working through ways to leverage webdataset and GCP Buckets. (If I’ve missed the documentation on gs:// type urls, please reply with a link for reference)

I haven’t tested this in a VM environment yet but I don’t see why it wouldn’t work in that context too.

#handle GCP login stuff, this will launch interactive login sequence
!gcloud auth application-default login
#set project
!gcloud config set project your-cool-project
train_dataset = (wds.WebDataset("gs://your-bucket/yourshard_{00000..00005}.tar")