When I did machine translation task(using the Transformer model), I find a question:
As I want the model to predict the token but not have it be an input into the model, I simply slice the
<eos> token off the end of the sequence. Thus:
trg = [sos, x_1, x_2, x_3, eos]
trg[:-1] = [sos, x_1, x_2, x_3]
But when I use torchtext generate my datasets, I find the last elements of sentence is the
<pad> token, such as:
trg = [sos, x_1, x_2, x_3, eos, pad, pad, pad]
trg[:-1] = [sos, x_1, x_2, x_3, eos, pad, pad]
so i can’t slice the
<eos> token, how can I solve this question?
can you open an issue on torchtest github with a code snippet. The translation dataset has not been re-written and there may be some bugs there. But we can try to have some people take a look at it.
Ok, I will do it. And could you tell me how to solve this problem?
Please post the issue along with some code snippets so we could help you figure out a solution.
This is my code:
spacy_de = spacy.load('de')
spacy_en = spacy.load('en')
Tokenizes German text from a string into a list of strings (tokens) and reverses it
return [tok.text for tok in spacy_de.tokenizer(text)][::-1]
Tokenizes English text from a string into a list of strings (tokens)
return [tok.text for tok in spacy_en.tokenizer(text)]
SRC = Field(tokenize=tokenize_de,
TRG = Field(tokenize=tokenize_en,
train_data, valid_data = IWSLT.splits(exts=('.de', '.en'),
filter_pred=lambda x: len(vars(x)['src']) <= max_seq_len and
len(vars(x)['trg']) <= max_seq_len)
train_iter, valid_iter = BucketIterator.splits(
When I print the data in train_iter, the output as follow:
As i say at the beginning, I think the last element of src and trg should be
<eos> token, which is 3, but in fact the last element of src and trg is 1, that is
May I ask have you resolved this issue?