Using a Google Cloud Storage bucket for dataset

I’ve been stuck with this for a while now, and I’m certain this is a problem people have dealt with in the past.

So I’m trying to use torch dataloaders on a Google Cloud VM with a GPU attached. I want to run training on ImageNet. So my options are to either get a persistent disk and download ImageNet onto it, or otherwise download ImageNet onto a GCS bucket.

If I were to go the bucket route, does anyone know what the best way to then have this interface with a torch dataloader would be? I’ve looked into WebDataset as well, and it seems cool - but I’d likely have to pass in URLs (which isn’t necessarily a problem).

Long story short, I’m not sure if I should be mounting the bucket, or if things “just work” if I were to use WebDataset, and overall just what the best/most economical path is going forward.

Thanks very much!

You may checkout TorchData library we just released. We do have a DataPipe that takes URL from Google drive data/online.py at 652986b14e893e08edaea2c519e21bc61706de5c · pytorch/data · GitHub
Let us know if this covers your request

I don’t think GDrive is the same as GCS. Being able to load data from AWS s3 buckets and GCP gcs are features I am looking forward to.

The fsspec and iopath DataPipes here should allow you to load data from S3. Let us know that works for you.

I don’t think we have looked in to Google Cloud Storage but I’ll make a note of it.

1 Like

Thanks! I’ll check if the functionalities allow for usage of GCS buckets.

Hi,
I’ve also started to check this option out.
what I have found is the following:

  1. there is an issue with setting credentials to fsspec in the FSSpecFileListerIterDataPipe. a possible solution is to add a token argument to fsspec.core.url)
    fs, path = fsspec.core.url_to_fs(self.root, token=path/to/json)
  2. the output from fs.protocol is a tuple ('gcs', 'gs') .
    I’ve currently patched the code to take the relevant prefix for our GCS path.
    I hope to update once I get the pipeline running

Hi @sephib, thanks for spotting the issue, we appreciate it!

Please open an issue (and a PR if you have a proposed fix) on GitHub.

We currently have a PR: Improve fsspec DataPipe to accept extra keyword arguments by ejguan · Pull Request #495 · pytorch/data · GitHub

Let us know if it doesn’t resolve your need to pass a token argument.

:ok_hand:t4:
BTW - currently investigating 2 strange issues with the hotfix of the token

  1. I tried to implement the datapipline/dataloader tutorial - but when trying to iterate on DataLoader the system doesn’t do anything, doesn’t time out but keeps of running…
...
for idx, batch in enumerate(dl):
    print(idx)
  1. Another issue is that when running twice the following example - on the first time the system works, but on the second time (without restarting the kernel) the fs.protocole in the open_file_by_fsspec class returns a file_uri without gs:// thus failing :confused:
datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
ds = Mapper(file_dp, PIL_open)
for i in ds:     
    print(f'{i=}')
    show_image(i)

For 1, can you let us know what version of the libraries (torch and torchdata) you have installed? And what is the exact code snippet that you are running?

  1. By the hotfix, are you running the code from this PR or the fix that you described above?

cc: @ejguan

What do you mean about running twice? You mean iterating ds twice or running the script twice?

Does your image_bucket start with gs://?

  1. the version are
Name: torch
Version: 1.11.0

Name: torchdata
Version: 0.3.0

The above code is the exact code that I’m running
here is a simpler version

datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
list(file_dp)

after this if you run

datapipe = FSSpecFileLister(root=image_bucket, masks=['*.png'])
file_dp = datapipe.open_file_by_fsspec(mode='rb')
list(file_dp)

again - you get a FileNotFoundError .
Once I restart the kernel it works.
If I run the script as below there is no problem - I’m guessing it is something regarding the how jupyter kernel uses fsspec :confused:

python my_script.py

there is no problem. something is

  1. the hotfix - is my local one - where I added thetoken= to fsspec.core.url_to_fs

yes - my URI image_bucket starts with gs://

This is interesting. I am not sure if this is something with the caching within fsspec. Could you please open an issue on Github?

Rergarding the the datapipline/dataloader tutorial - this is the code that i’m trying to run wich doesn’t return any answer.

def build_datapipes(root_dir=image_bucket):
    datapipe = FSSpecFileLister(root=root_dir, masks=['*.png'])
    file_dp = datapipe.open_file_by_fsspec(mode='rb')  
    datapipe = file_dp.map(PIL_open)
    return datapipe

datapipe = build_datapipes()
dl = DataLoader(dataset=datapipe, batch_size=1, num_workers=1)
list(dl)

When killing the service I can see in the stack that it is waiting on the python3.9/multiprocessing/connection.py", line 936,

Are you running this in ikernel? I believe notebook has some problems with multiprocessing (num_workers>0)

No - I’m running via a regular python environment

python my_script.py

1 Like

Running with

dl = DataLoader(dataset=datapipe)

Solved the issue
Will need to understand why the num_workers is causing issues on my setup

Opened an issue on GitHub

1 Like