Build_vocab_from_iterator does not work in notebook

AfonsoSalgadoSousa · June 7, 2022, 2:43pm

Hi. I was trying to run this notebook, but the following line times out:

vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
                                                    min_freq=1,
                                                    specials=special_symbols,
                                                    special_first=True)

Specifically, it raises a TimeoutError: [Errno 110] Connection timed out, and the last line of the trace is:
Exception: Could not get the file at http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz. [RequestException] None.

What can be done to circumvent this issue? Thanks in advance for any help you can provide.

dongqi_cai · June 8, 2022, 11:24am

Met the same question today.
I try to directly access the file link. But it fails either, saying " The requested URL could not be retrieved".
I guess the database server is down or their network is broken.

Hope it can recover soon… Or are there any alternatives?

JensHeinrich · June 8, 2022, 2:34pm

Already tried archive.org, but sadly the files are unavailable there too

Alex_Ambrioso · June 8, 2022, 4:47pm

I am having exactly the same problem. I am new to NLP so I am not sure what to do. Perhaps somebody knows another source of English/German (or another language) word pairs that we can use instead. I will check back later.

Alex_Ambrioso · June 8, 2022, 5:50pm

The maintainer of this GitHub file: PyTorch-NLP/multi30k.py at master · PetrochukM/PyTorch-NLP · GitHub

claims the following:

Status:
    Host ``www.quest.dcs.shef.ac.uk`` forgot to update their SSL
    certificate; therefore, this dataset does not download securely.
References:
    * http://www.statmt.org/wmt16/multimodal-task.html
    * http://shannon.cs.illinois.edu/DenotationGraph/

He seems to have constructed a workaround program but I have not managed to get it to work.

JensHeinrich · June 9, 2022, 7:29am

the page is not delivering any content though

ttssp · June 17, 2022, 2:04am

Same problem… Tried to email the site owner but no feedbacks…

pete1313 · June 17, 2022, 9:52pm

I have those files but don’t know where to put them to make them available for everyone.

ttssp · June 20, 2022, 2:08am

is it possible to put them in the dropbox or google drive and share them using a public link?

elliottd · June 20, 2022, 1:34pm

First author of the Multi30K dataset here .

I didn’t know these were being used in a PyTorch tutorial so we are working on hosting these files elsewhere. Alternatively, if someone understands how the files are being used by torchtext.datasets.Multi30K, would one solution be to re-route the data loading to the Multi30K Github repository?

ttssp · June 20, 2022, 2:38pm

Im a beginner but I found the source code of torchtext.datasets.multi30k here. One may change the URLs and MD5s to make it work ~

pete1313 · June 20, 2022, 7:35pm

that is preciously what I was thinking but I do not own that repo so couldn’t do it.

Alex_Ambrioso · June 23, 2022, 2:52pm

Please note that I have been working on the following code:

http://nlp.seas.harvard.edu/annotated-transformer/

This code uses the same Multi30K database. I was able to get the code to work by using another data file. The basic idea is that the training, validation, and test sets are all lists of tuples. The tuples consist of sentence pairs in each language. This insight is nice since it makes it easy to create any language pairing you would like. Here is my implementation in Colab along with lots of notes:

Hope this helps. Any comments are welcome.

yanielc · June 26, 2022, 10:19am

The slightly different way the dataset is downloaded here is working right now.

Alex_Ambrioso · June 27, 2022, 3:13pm

Thank you for this link, Yaniel. This is a very nice, compact, and up-to-date implementation of a transformer using Pytorch!

-Alex

deng_wang · August 6, 2023, 8:49am

I think I found the train, val , test file url , you can change the multi30k URL to update the origal invalid url:

from torchtext.datasets import multi30k, Multi30k

multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

if you do not want to update the URL of multi30k, you can just download the file from url above and put the tar.gz file to the torch cache directory. for my machine, the directory is :/root/.cache/torch/text/datasets/Multi30k , copy the tar.gz file into directory ,and run the code. pytorch will uncompress the file , get train.de train.en files

Alberto_Ferreira_De · November 15, 2023, 10:49pm

I have changed the following lines in your code:

    print("Building German Vocabulary ...")
    train, val, test = datasets.Multi30k(language_pair=("de", "en"))

to

    print("Building German Vocabulary ...")
    from torchtext.datasets import multi30k
    multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
    multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
    multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

    multi30k.MD5["train"] = "20140d013d05dd9a72dfde46478663ba05737ce983f478f960c1123c6671be5e"
    multi30k.MD5["valid"] = "a7aa20e9ebd5ba5adce7909498b94410996040857154dab029851af3a866da8c"
    multi30k.MD5["test"] = "6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36"

    train, val, test = datasets.Multi30k(language_pair=("de", "en"))

and it worked.