Recently I want to pre-train an attention model using my own idea.
Since WMT14 is no longer available in torchtext.dataset, I try to build a customized torchtext dataset using WMT14 English-German txt file, from: The Stanford Natural Language Processing Group.
So the vocabulary will be based on WMT14.
I already prepocess the training data into .json with the format:
{“German”: “german sentence”, “English”: “English sentence”}
I followed the tutorial:
https://pytorch.org/tutorials/beginner/translation_transformer.html
and think if I can build something like the Multi30k dataset and use it.
I struggling with the customized dataset recently and hope I can get some advice from the community.