Loading data using torchtext

Cat12 · October 27, 2018, 1:47pm

Does torchtext support loading large files of 2GB?
I am unable to load 2.7GB json file using torch text.

ptrblck · October 27, 2018, 1:49pm

Do you get a specific error message or does your code just hang?
Could you post a code snippet reproducing this error?

Cat12 · October 27, 2018, 1:52pm

neither the code hangs nor i get any error message. i am trying in notebook and the cell is just executing from a long time (hours)
Code :
train_data , test_data = data.Tabulardataset.splits(path = path, train = “train.json” , test = “test.json” , format = “json”, fields = fields")

ptrblck · October 27, 2018, 1:54pm

Is the train.json somewhere available?
If not, could you just post a few sample rows so that I could create a dummy and try it on my machine?

Cat12 · October 27, 2018, 1:57pm

i tired using a small dataset in the same format. i tried it using just two samples and it loaded in 1 minute. I understand i have a lot of samples in my original file but still its loading from 3-4 hours.

the sample is like a normal json :
{key 1: value, key2:value}
{}
{}…

Yingqiang_Gao · June 27, 2019, 12:22pm

I have the same problem here. I was trying to load a file that is 1.4 GB, minutes after the process got killed

>>> REF = data.Field(lower=True, tokenize=tokenize_char, init_token='<sos>',eos_token='<eos>')
>>> SRC = data.Field(lower=True, tokenize=tokenize_char)
>>> train = data.TabularDataset('./train.csv', format='csv', fields=[('src', SRC), ('ref', REF)])
Killed

and this works fine for a smaller dataset.

Anyone can tell me why this is happening? Thanks!