Efficient way to use torchtext TabularDataset on large dataset

I found problem when loading torchtext TabularDataset when using large dataset,
It takes too much ram.

I tried to load as given which is too slow to run,

tokenize = lambda x:x.split()
TEXT = data.Field(sequential=True, tokenize=tokenize)
LABEL = data.LabelField()
fields=[('customer_review',  TEXT),('polarity', LABEL)]
train,test = data.TabularDataset.splits(path='.',format='csv', train="/content/drive/My Drive/cleaned_train.csv", test="/content/drive/My Drive/cleaned_test.csv", fields=fields)

Is there any efficient way to load.

1 Like

If your dataset is indeed too large to fit into memory, you need to split it and train using the different “sub-datasets” one after another. In this case, I would actually prepocess the text data completely so that the file already contained the sequences of indexes to speed up the training. Otherwise, you would have to do the preprocessing potentially in each epoch.

As a side comment, when using tokenize = lambda x:x.split() I hope your input text documents have whitespaces before punctuation marks – and after them, which is not a given in user-generated data.

1 Like

Thanks but it is alot of messy doing this.
However , when using through spacy It is running smoothly without taking too much ram.

To run on pytorch I had to split training csv to 5 pieces.