I have a very large NLP dataset in csv (if loading all into memory it causes 187G). I am wondering how do I read and batch such large dataset without reading it all into the memory, like batch read? (I do not prefer split the large file into small pieces, btw) Thanks!
You could use
pandas.read_csv with its
chunksize argument to read chunks from the file:
chunksize = 10 ** 6 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk)