Batch loading large dataset to fit the memory

xdwang0726 · November 9, 2021, 2:52am

I have a very large NLP dataset in csv (if loading all into memory it causes 187G). I am wondering how do I read and batch such large dataset without reading it all into the memory, like batch read? (I do not prefer split the large file into small pieces, btw) Thanks!

ptrblck · November 9, 2021, 4:36am

You could use pandas.read_csv with its chunksize argument to read chunks from the file:

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)