I have a very large NLP dataset in csv (if loading all into memory it causes 187G). I am wondering how do I read and batch such large dataset without reading it all into the memory, like batch read? (I do not prefer split the large file into small pieces, btw) Thanks!
You could use pandas.read_csv
with its chunksize
argument to read chunks from the file:
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)