Torchtext and token <eos>, <pad>

Tomas_fang · March 19, 2020, 5:41am

When I did machine translation task(using the Transformer model), I find a question:

As I want the model to predict the token but not have it be an input into the model, I simply slice the <eos> token off the end of the sequence. Thus:

trg = [sos, x_1, x_2, x_3, eos]
trg[:-1] = [sos, x_1, x_2, x_3]

But when I use torchtext generate my datasets, I find the last elements of sentence is the <pad> token, such as:

trg = [sos, x_1, x_2, x_3, eos, pad, pad, pad]
trg[:-1] = [sos, x_1, x_2, x_3, eos, pad, pad]

so i can’t slice the <eos> token, how can I solve this question?

zhangguanheng66 · March 19, 2020, 4:11pm

can you open an issue on torchtest github with a code snippet. The translation dataset has not been re-written and there may be some bugs there. But we can try to have some people take a look at it.

Tomas_fang · March 21, 2020, 10:09am

Ok, I will do it. And could you tell me how to solve this problem?

zhangguanheng66 · March 23, 2020, 2:41pm

Please post the issue along with some code snippets so we could help you figure out a solution.

Tomas_fang · March 24, 2020, 8:43am

This is my code:

spacy_de = spacy.load('de')
spacy_en = spacy.load('en')
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings (tokens) and reverses it
    """
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]
SRC = Field(tokenize=tokenize_de,
            init_token=init_token,
            eos_token=eos_token,
            lower=True,
            batch_first=True)
TRG = Field(tokenize=tokenize_en,
            init_token=init_token,
            eos_token=eos_token,
            lower=True,
            batch_first=True)

train_data, valid_data = IWSLT.splits(exts=('.de', '.en'),
                                                 fields=(SRC, TRG),
                                                 test=None,
                                                 filter_pred=lambda x: len(vars(x)['src']) <= max_seq_len and
                                                 len(vars(x)['trg']) <= max_seq_len)
SRC.build_vocab(train_data, min_freq=min_freq)
TRG.build_vocab(train_data, min_freq=min_freq)


train_iter, valid_iter = BucketIterator.splits(
    (train_data, valid_data),
    batch_size=batch_size,
    device=device)

Tomas_fang · March 24, 2020, 8:52am

When I print the data in train_iter, the output as follow:

As i say at the beginning, I think the last element of src and trg should be <eos> token, which is 3, but in fact the last element of src and trg is 1, that is <pad> token.

jimmykobe · February 1, 2022, 2:41pm

May I ask have you resolved this issue?