Torchtext textclasification with custom tokenizer

When I use the IMDB dataset I use TEXT,LABEL fields to specify how to tokenize and preprocess the data. Then i build vocab and i can choose min_freq and max_size.
How can i do all of the above when using YelpReviewPolarity for example?
Looking at the code and tutorial it seems i cannot change the tokenizer.
Also, how can i use BucketIterator with YelpReviewPolarity dataset?

You’re in luck! Torchtext does allow you to use a custom tokenizer:

from import Field

def custom_tokenizer(text):
    return [token.text for token in nlp.tokenizer(text)]

TEXT = Field(sequential=True, tokenize=custom_tokenizer)
LABEL = Field(sequential=False)

It looks like the YelpReviewPolarity dataset is formatted in CSV. The easiest way to parse it would be with the TabularDataset class:

from import TabularDataset

train_td, test_td = TabularDataset.splits(
    path='path-to-yelp-review-polarity-data/', train='train.csv', test='test.csv',
    format='csv', skip_header=False, fields=[('text', TEXT), ('label', LABEL)])

text_vocab = TEXT.vocab
label_vocab = LABEL.vocab

I haven’t used the BucketIterator but it’s not too different from the standard Iterator. I haven’t tested this but try:

from import BucketIterator

train_iter, test_iter = BucketIterator.splits(
    (train_td, test_td),
    sort_key=lambda x: len(x.text),
    batch_sizes=(batch_size, test_batch_size),

You could use my authorship attribution training notebook as a reference to see how it all comes together:

All I’m doing here is feeding words into an RNN to make a categorical prediction. This isn’t the best way to perform authorship attribution, but it does use the torchtext tools you are interested in. I hope this helps!

Oh ok thanks, I can use this standard way of opening a custom csv files…
I was wondering why the API is different between the IMDB and the TextClassification classes.
The IMDB works with fields and TextClassification doesnt.