How does one download a data set from a file automatically with Pytorch?

pinocchio · March 24, 2020, 6:01pm

I want to download a dataset from a specific url to specific path.

I tried the following:

    from torchvision.datasets.utils import download_and_extract_archive

    ## download mini-imagenet
    url = 'https://drive.google.com/file/d/1rV3aj_hgfNTfCakffpPm7Vhpr1in87CR'
    filename = 'miniImagenet.tgz'
    root = '~/tmp/'
    download_and_extract_archive(url, root, filename)

but it didn’t work.

Why? How do we fix it?

Error:

Traceback (most recent call last):
  File "/Users/me/pytorch_playground.py", line 79, in <module>
    download_mini_imagenet()
  File "/Users/me/pytorch_playground.py", line 72, in download_mini_imagenet
    download_and_extract_archive(url, root, filename)
  File "/Users/me/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 268, in download_and_extract_archive
    extract_archive(archive, extract_root, remove_finished)
  File "/Users/me/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 250, in extract_archive
    raise ValueError("Extraction of {} not supported".format(from_path))
ValueError: Extraction of /Users/me/tmp/1rV3aj_hgfNTfCakffpPm7Vhpr1in87CR not supported

related gitissue: https://github.com/pytorch/vision/issues/1028

simaiden · March 24, 2020, 6:09pm

That’s because with google drive you can’t get a direct download link. Try with this:

 from torchvision.datasets.utils import download_file_from_google_drive

and extract by yourself or adapt the code and use torchvision.datasets.utils.extract_archive

pinocchio · March 24, 2020, 6:35pm

how do you do extract the contents? Is it dependent on the zip file format? any examples?

pinocchio · March 24, 2020, 6:56pm

Do you know how to NOT download the file if the dataset has already been download?

I was reading:

    def _check_integrity(self):
        zip_filename = self._get_target_folder()
        if not check_integrity(join(self.root, zip_filename + '.zip'), self.zips_md5[zip_filename]):
            return False
        return True

and it doesn’t seem to be doable with mine because I do not have an md5 has for this dataset…

pinocchio · March 24, 2020, 7:03pm

Temporary solution:

def download_and_extract_miniImagenet(root):
    import os
    from torchvision.datasets.utils import download_file_from_google_drive, extract_archive

    ## download miniImagenet
    #url = 'https://drive.google.com/file/d/1rV3aj_hgfNTfCakffpPm7Vhpr1in87CR'
    file_id = '1rV3aj_hgfNTfCakffpPm7Vhpr1in87CR'
    filename = 'miniImagenet.tgz'
    download_file_from_google_drive(file_id, root, filename)
    fpath = os.path.join(root, filename) # this is what download_file_from_google_drive does
    ## extract downloaded dataset
    from_path = os.path.expanduser(fpath)
    extract_archive(from_path)

pinocchio · March 28, 2020, 9:54pm

I am having errors with this: How does one download dataset from gdrive using pytorch?