How does one download a data set from a file automatically with Pytorch?

I want to download a dataset from a specific url to specific path.

I tried the following:

    from torchvision.datasets.utils import download_and_extract_archive

    ## download mini-imagenet
    url = 'https://drive.google.com/file/d/1rV3aj_hgfNTfCakffpPm7Vhpr1in87CR'
    filename = 'miniImagenet.tgz'
    root = '~/tmp/'
    download_and_extract_archive(url, root, filename)

but it didn’t work.

Why? How do we fix it?

Error:

Traceback (most recent call last):
  File "/Users/me/pytorch_playground.py", line 79, in <module>
    download_mini_imagenet()
  File "/Users/me/pytorch_playground.py", line 72, in download_mini_imagenet
    download_and_extract_archive(url, root, filename)
  File "/Users/me/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 268, in download_and_extract_archive
    extract_archive(archive, extract_root, remove_finished)
  File "/Users/me/lib/python3.7/site-packages/torchvision/datasets/utils.py", line 250, in extract_archive
    raise ValueError("Extraction of {} not supported".format(from_path))
ValueError: Extraction of /Users/me/tmp/1rV3aj_hgfNTfCakffpPm7Vhpr1in87CR not supported

related gitissue: https://github.com/pytorch/vision/issues/1028

That’s because with google drive you can’t get a direct download link. Try with this:

 from torchvision.datasets.utils import download_file_from_google_drive

and extract by yourself or adapt the code and use torchvision.datasets.utils.extract_archive

1 Like

how do you do extract the contents? Is it dependent on the zip file format? any examples?

Do you know how to NOT download the file if the dataset has already been download?

I was reading:

    def _check_integrity(self):
        zip_filename = self._get_target_folder()
        if not check_integrity(join(self.root, zip_filename + '.zip'), self.zips_md5[zip_filename]):
            return False
        return True

and it doesn’t seem to be doable with mine because I do not have an md5 has for this dataset…

Temporary solution:

def download_and_extract_miniImagenet(root):
    import os
    from torchvision.datasets.utils import download_file_from_google_drive, extract_archive

    ## download miniImagenet
    #url = 'https://drive.google.com/file/d/1rV3aj_hgfNTfCakffpPm7Vhpr1in87CR'
    file_id = '1rV3aj_hgfNTfCakffpPm7Vhpr1in87CR'
    filename = 'miniImagenet.tgz'
    download_file_from_google_drive(file_id, root, filename)
    fpath = os.path.join(root, filename) # this is what download_file_from_google_drive does
    ## extract downloaded dataset
    from_path = os.path.expanduser(fpath)
    extract_archive(from_path)
1 Like

I am having errors with this: How does one download dataset from gdrive using pytorch?