When I use the IMDB dataset I use TEXT,LABEL fields to specify how to tokenize and preprocess the data. Then i build vocab and i can choose min_freq and max_size.
How can i do all of the above when using YelpReviewPolarity for example?
Looking at the code and tutorial it seems i cannot change the tokenizer.
Also, how can i use BucketIterator with YelpReviewPolarity dataset?
You’re in luck! Torchtext does allow you to use a custom tokenizer:
from torchtext.data import Field
def custom_tokenizer(text):
return [token.text for token in nlp.tokenizer(text)]
TEXT = Field(sequential=True, tokenize=custom_tokenizer)
LABEL = Field(sequential=False)
It looks like the YelpReviewPolarity dataset is formatted in CSV. The easiest way to parse it would be with the TabularDataset
class:
from torchtext.data import TabularDataset
train_td, test_td = TabularDataset.splits(
path='path-to-yelp-review-polarity-data/', train='train.csv', test='test.csv',
format='csv', skip_header=False, fields=[('text', TEXT), ('label', LABEL)])
TEXT.build_vocab(train_td)
text_vocab = TEXT.vocab
LABEL.build_vocab(train_td)
label_vocab = LABEL.vocab
I haven’t used the BucketIterator
but it’s not too different from the standard Iterator
. I haven’t tested this but try:
from torchtext.data import BucketIterator
train_iter, test_iter = BucketIterator.splits(
(train_td, test_td),
sort_key=lambda x: len(x.text),
shuffle=True,
sort_within_batch=False,
batch_sizes=(batch_size, test_batch_size),
device=device)
You could use my authorship attribution training notebook as a reference to see how it all comes together: https://github.com/travis-harper/authorship-attribution/blob/master/gutenberg-10-author-gru-v5-training.ipynb
All I’m doing here is feeding words into an RNN to make a categorical prediction. This isn’t the best way to perform authorship attribution, but it does use the torchtext tools you are interested in. I hope this helps!
Oh ok thanks, I can use this standard way of opening a custom csv files…
I was wondering why the API is different between the IMDB and the TextClassification classes.
The IMDB works with fields and TextClassification doesnt.