Build_vocab_from_iterator does not work in notebook

Hi. I was trying to run this notebook, but the following line times out:

vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)

Specifically, it raises a TimeoutError: [Errno 110] Connection timed out, and the last line of the trace is:
Exception: Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

What can be done to circumvent this issue? Thanks in advance for any help you can provide.

Met the same question today.
I try to directly access the file link. But it fails either, saying " The requested URL could not be retrieved".
I guess the database server is down or their network is broken.

Hope it can recover soon… Or are there any alternatives?

Already tried archive.org, but sadly the files are unavailable there too

I am having exactly the same problem. I am new to NLP so I am not sure what to do. Perhaps somebody knows another source of English/German (or another language) word pairs that we can use instead. I will check back later.

The maintainer of this GitHub file: PyTorch-NLP/multi30k.py at master · PetrochukM/PyTorch-NLP · GitHub

claims the following:

Status:
    Host ``www.quest.dcs.shef.ac.uk`` forgot to update their SSL
    certificate; therefore, this dataset does not download securely.
References:
    * http://www.statmt.org/wmt16/multimodal-task.html
    * http://shannon.cs.illinois.edu/DenotationGraph/

He seems to have constructed a workaround program but I have not managed to get it to work.

the page is not delivering any content though

Same problem… Tried to email the site owner but no feedbacks…

I have those files but don’t know where to put them to make them available for everyone.

is it possible to put them in the dropbox or google drive and share them using a public link?

First author of the Multi30K dataset here :wave:.

I didn’t know these were being used in a PyTorch tutorial so we are working on hosting these files elsewhere. Alternatively, if someone understands how the files are being used by torchtext.datasets.Multi30K, would one solution be to re-route the data loading to the Multi30K Github repository?

1 Like

Im a beginner but I found the source code of torchtext.datasets.multi30k here. One may change the URLs and MD5s to make it work ~

that is preciously what I was thinking but I do not own that repo so couldn’t do it.

Please note that I have been working on the following code:

http://nlp.seas.harvard.edu/annotated-transformer/

This code uses the same Multi30K database. I was able to get the code to work by using another data file. The basic idea is that the training, validation, and test sets are all lists of tuples. The tuples consist of sentence pairs in each language. This insight is nice since it makes it easy to create any language pairing you would like. Here is my implementation in Colab along with lots of notes:

Hope this helps. Any comments are welcome.

The slightly different way the dataset is downloaded here is working right now.

1 Like

Thank you for this link, Yaniel. This is a very nice, compact, and up-to-date implementation of a transformer using Pytorch!

-Alex

I think I found the train, val , test file url , you can change the multi30k URL to update the origal invalid url:

from torchtext.datasets import multi30k, Multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

if you do not want to update the URL of multi30k, you can just download the file from url above and put the tar.gz file to the torch cache directory. for my machine, the directory is :/root/.cache/torch/text/datasets/Multi30k , copy the tar.gz file into directory ,and run the code. pytorch will uncompress the file , get train.de train.en files

I have changed the following lines in your code:

    print("Building German Vocabulary ...")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))

to

    print("Building German Vocabulary ...")
    from torchtext.datasets import multi30k
    multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
    multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
    multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

    multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e"
    multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c"
    multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

    train, val, test = datasets.Multi30k(language_pair=("de", "en"))

and it worked.

1 Like