I’m looking for the best practice for handling large amounts of data. Right now I have a single large datafile with about 250 M samples taking up around 120 GB in ASCII format, and the data is stored in a horrible way so I will need to go through the data anyway and transform it.
But I don’t have any experience working with such large datasets, so I don’t know what the best approach is. Should I split the data into smaller files, (will that be easier to work with/ handle for the dataloader?). Should I convert the data into a binary format? or will that be irrelevant?
What is the best way of storing the data, such that I can read in only parts of the data?