Custom torchtext.data problem

I already read the tutorial about TEXT CLASSIFICATION WITH TORCHTEXT.
I want to do my custom text classification task, but there are some different between torch.data and torchtext.data.
take a look in the tutorial.

import torch
import torchtext
from torchtext.datasets import text_classification
NGRAMS = 2
import os
if not os.path.isdir(’./.data’):
os.mkdir(’./.data’)
train_dataset, test_dataset = text_classification.DATASETS[‘AG_NEWS’](
root=’./.data’, ngrams=NGRAMS, vocab=None)
BATCH_SIZE = 16
device = torch.device(“cuda” if torch.cuda.is_available() else “cpu”)

In these example, datasets are load after download.
If I have a custom data look like “train.csv”.
This “train.csv” is a table, have two columns.
One of column name is “text”, containing text.
One of column name is “label”, containing my target.

text label
this is an apple 0
NBA is started today 1
We are family 2

How to write a standard torchtext.data class in these case?

We have to start with declaring fields
from torchtext.data import Field, LabelField
TEXT = Field(tokenize=word_tokenize) # converting strings to int
LABEL = LabelField(dtype=torch.float)

Then construct dataset, this step will tell fields what data to work on

datafields = [(“text”, TEXT), (“labels”, LABEL)]
trn, tst = torchtext.data.TabularDataset.splits(path = ‘Data’,train = ‘train.csv’,
test = ‘test.csv’,
format = ‘csv’,
skip_header = True,
fields = datafields)

After you follow these steps, you have your datasets ready, next steps would be building vocab using TEXT.build_vocab(trn)

Then you will define your batch size and construct iterators.

Hope this will help!!

I think it’s very similar to the text classification datasets in torchtext. You should be able to use most of the building blocks there. The only part to change is the _csv_iterator function here. In the original datasets, labels come before the text. But in your case, the text comes before the labels.

We suggest to use the new abstraction (text classification datasets follow the new abstraction) because it’s more compatible with torch.data.utils. Feel free to open issues on Github if you have any questions.