Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None

Hello,

Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

The error above showed up after calling the build_vocab_from_iterator() here:

vocab_src = build_vocab_from_iterator(
        yield_tokens(train + val + test, tokenize_de, index=0),
        min_freq=2,
        specials=["<s>", "</s>", "<blank>", "<unk>"],
    )

I am using Google colab to run notebook on this repo. Link to notebook: Google Colab

Thank you in advance.

1 Like

The problem is from the dataset multi30k (source url: “http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz”) that is not accessible right now.


Note: the torchtext.vocab.build_vocab_from_iterator() on the Google Colab notebook above is calling this dataset. (Sorry for not being specific in describing the problem)

1 Like

I’m experiencing the same problem for some time, and can’t seem to find a solution to it.

I think you could use other datasets instead.

I think I found the train, val , test file url , you can change the multi30k URL to update the origal invalid url:

from torchtext.datasets import multi30k, Multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

if you do not want to update the URL of multi30k, you can just download the file from url above and put the tar.gz file to the torch cache directory. for my machine, the directory is :/root/.cache/torch/text/datasets/Multi30k , copy the tar.gz file into directory ,and run the code. pytorch will uncompress the file , get train.de train.en files

1 Like

On the google colab, when trying to use the archive from the URL above, there is this error on the block with build_vocabulary() function

RuntimeError: The computed hash 6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36 of /root/.torchtext/cache/Multi30k/mmt16_task1_test.tar.gz does not match the expectedhash 0681be16a532912288a91ddd573594fbdd57c0fbb81486eff7c55247e35326c2. Delete the file manually and retry.

Do you know how to fix this?

Here is my code I inserted before that:

!rm *.tar.gz
! rm -rf /root/.torchtext/cache/Multi30k/*

# datasets.multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
# datasets.multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
# datasets.multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"


!wget https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz
!wget https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz
!wget https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz
! cp -f *.tar.gz /root/.torchtext/cache/Multi30k/

Were u able to fix this?