Torchtext textclasification with custom tokenizer

yonigottesman · December 30, 2019, 6:51pm

When I use the IMDB dataset I use TEXT,LABEL fields to specify how to tokenize and preprocess the data. Then i build vocab and i can choose min_freq and max_size.
How can i do all of the above when using YelpReviewPolarity for example?
Looking at the code and tutorial it seems i cannot change the tokenizer.
Also, how can i use BucketIterator with YelpReviewPolarity dataset?

travis-harper · December 31, 2019, 5:32am

You’re in luck! Torchtext does allow you to use a custom tokenizer:

from torchtext.data import Field

def custom_tokenizer(text):
    return [token.text for token in nlp.tokenizer(text)]

TEXT = Field(sequential=True, tokenize=custom_tokenizer)
LABEL = Field(sequential=False)

It looks like the YelpReviewPolarity dataset is formatted in CSV. The easiest way to parse it would be with the TabularDataset class:

from torchtext.data import TabularDataset

train_td, test_td = TabularDataset.splits(
    path='path-to-yelp-review-polarity-data/', train='train.csv', test='test.csv',
    format='csv', skip_header=False, fields=[('text', TEXT), ('label', LABEL)])

TEXT.build_vocab(train_td)
text_vocab = TEXT.vocab
LABEL.build_vocab(train_td)
label_vocab = LABEL.vocab

I haven’t used the BucketIterator but it’s not too different from the standard Iterator. I haven’t tested this but try:

from torchtext.data import BucketIterator

train_iter, test_iter = BucketIterator.splits(
    (train_td, test_td),
    sort_key=lambda x: len(x.text),
    shuffle=True,
    sort_within_batch=False,
    batch_sizes=(batch_size, test_batch_size),
    device=device)

You could use my authorship attribution training notebook as a reference to see how it all comes together: https://github.com/travis-harper/authorship-attribution/blob/master/gutenberg-10-author-gru-v5-training.ipynb

All I’m doing here is feeding words into an RNN to make a categorical prediction. This isn’t the best way to perform authorship attribution, but it does use the torchtext tools you are interested in. I hope this helps!

yonigottesman · December 31, 2019, 6:47am

Oh ok thanks, I can use this standard way of opening a custom csv files…
I was wondering why the API is different between the IMDB and the TextClassification classes.
The IMDB works with fields and TextClassification doesnt.