Customize Torchtext Dataset WMT14

Recently I want to pre-train an attention model using my own idea.
Since WMT14 is no longer available in torchtext.dataset, I try to build a customized torchtext dataset using WMT14 English-German txt file, from: The Stanford Natural Language Processing Group.
So the vocabulary will be based on WMT14.

I already prepocess the training data into .json with the format:
{“German”: “german sentence”, “English”: “English sentence”}

I followed the tutorial:
https://pytorch.org/tutorials/beginner/translation_transformer.html
and think if I can build something like the Multi30k dataset and use it.

I struggling with the customized dataset recently and hope I can get some advice from the community.