Using a Google Cloud Storage bucket for dataset

cool_cheetah1 · March 12, 2022, 8:53pm

I’ve been stuck with this for a while now, and I’m certain this is a problem people have dealt with in the past.

So I’m trying to use torch dataloaders on a Google Cloud VM with a GPU attached. I want to run training on ImageNet. So my options are to either get a persistent disk and download ImageNet onto it, or otherwise download ImageNet onto a GCS bucket.

If I were to go the bucket route, does anyone know what the best way to then have this interface with a torch dataloader would be? I’ve looked into WebDataset as well, and it seems cool - but I’d likely have to pass in URLs (which isn’t necessarily a problem).

Long story short, I’m not sure if I should be mounting the bucket, or if things “just work” if I were to use WebDataset, and overall just what the best/most economical path is going forward.

Thanks very much!

ejguan · March 14, 2022, 1:42pm

You may checkout TorchData library we just released. We do have a DataPipe that takes URL from Google drive data/online.py at 652986b14e893e08edaea2c519e21bc61706de5c · pytorch/data · GitHub
Let us know if this covers your request

Nilabhra · March 16, 2022, 4:17pm

I don’t think GDrive is the same as GCS. Being able to load data from AWS s3 buckets and GCP gcs are features I am looking forward to.

nivek · March 16, 2022, 6:59pm

The fsspec and iopath DataPipes here should allow you to load data from S3. Let us know that works for you.

I don’t think we have looked in to Google Cloud Storage but I’ll make a note of it.

Nilabhra · March 18, 2022, 10:50am

Thanks! I’ll check if the functionalities allow for usage of GCS buckets.

sephib · June 2, 2022, 8:02am

Hi,
I’ve also started to check this option out.
what I have found is the following:

there is an issue with setting credentials to fsspec in the FSSpecFileListerIterDataPipe. a possible solution is to add a token argument to fsspec.core.url)
fs, path = fsspec.core.url_to_fs(self.root, token=path/to/json)
the output from fs.protocol is a tuple ('gcs', 'gs') .
I’ve currently patched the code to take the relevant prefix for our GCS path.
I hope to update once I get the pipeline running

nivek · June 2, 2022, 2:48pm

Hi @sephib, thanks for spotting the issue, we appreciate it!

Please open an issue (and a PR if you have a proposed fix) on GitHub.

nivek · June 3, 2022, 9:05pm

We currently have a PR: Improve fsspec DataPipe to accept extra keyword arguments by ejguan · Pull Request #495 · pytorch/data · GitHub

Let us know if it doesn’t resolve your need to pass a token argument.

sephib · June 6, 2022, 2:25pm

BTW - currently investigating 2 strange issues with the hotfix of the token

I tried to implement the datapipline/dataloader tutorial - but when trying to iterate on DataLoader the system doesn’t do anything, doesn’t time out but keeps of running…

...
for idx, batch in enumerate(dl):
    print(idx)

Another issue is that when running twice the following example - on the first time the system works, but on the second time (without restarting the kernel) the fs.protocole in the open_file_by_fsspec class returns a file_uri without gs:// thus failing

datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
ds = Mapper(file_dp, PIL_open)
for i in ds:     
    print(f'{i=}')
    show_image(i)

nivek · June 6, 2022, 2:42pm

For 1, can you let us know what version of the libraries (torch and torchdata) you have installed? And what is the exact code snippet that you are running?

By the hotfix, are you running the code from this PR or the fix that you described above?

cc: @ejguan

ejguan · June 6, 2022, 4:05pm

sephib:

Another issue is that when running twice the following example - on the first time the system works, but on the second time (without restarting the kernel) the fs.protocole in the open_file_by_fsspec class returns a file_uri without gs:// thus failing
datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
ds = Mapper(file_dp, PIL_open)
for i in ds:     
    print(f'{i=}')
    show_image(i)

What do you mean about running twice? You mean iterating ds twice or running the script twice?

Does your image_bucket start with gs://?

sephib · June 6, 2022, 4:51pm

the version are

Name: torch
Version: 1.11.0

Name: torchdata
Version: 0.3.0

The above code is the exact code that I’m running
here is a simpler version

datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
list(file_dp)

after this if you run

datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
list(file_dp)

again - you get a FileNotFoundError .
Once I restart the kernel it works.
If I run the script as below there is no problem - I’m guessing it is something regarding the how jupyter kernel uses fsspec

python my_script.py

there is no problem. something is

the hotfix - is my local one - where I added thetoken= to fsspec.core.url_to_fs

sephib · June 6, 2022, 4:52pm

yes - my URI image_bucket starts with gs://

ejguan · June 6, 2022, 5:11pm

sephib:

here is a simpler version

datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
list(file_dp)

after this if you run

datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
list(file_dp)

again - you get a FileNotFoundError .

This is interesting. I am not sure if this is something with the caching within fsspec. Could you please open an issue on Github?

sephib · June 6, 2022, 5:12pm

Rergarding the the datapipline/dataloader tutorial - this is the code that i’m trying to run wich doesn’t return any answer.

def build_datapipes(root_dir=image_bucket):
    datapipe = FSSpecFileLister(root=root_dir, masks=['*.png'])
    file_dp = datapipe.open_file_by_fsspec(mode='rb')  
    datapipe = file_dp.map(PIL_open)
    return datapipe

datapipe = build_datapipes()
dl = DataLoader(dataset=datapipe, batch_size=1, num_workers=1)
list(dl)

When killing the service I can see in the stack that it is waiting on the python3.9/multiprocessing/connection.py", line 936,

ejguan · June 6, 2022, 6:49pm

Are you running this in ikernel? I believe notebook has some problems with multiprocessing (num_workers>0)

sephib · June 7, 2022, 4:45am

No - I’m running via a regular python environment

python my_script.py

sephib · June 7, 2022, 5:34am

Running with

dl = DataLoader(dataset=datapipe)

Solved the issue
Will need to understand why the num_workers is causing issues on my setup

sephib · June 7, 2022, 5:50am

Opened an issue on GitHub

sephib · July 21, 2022, 5:30am

Hi
here is a write up of my solution