Data processing as a batch way

rasbt · March 2, 2018, 4:48pm

I would recommend looking into HDF5. The handling is similar to numpy arrays (with the indexing), but the dataset is not loaded into memory until you access it. I just wrote a quick example for converting a CSV into HDF5 using Iris for illustration purposes; here, imagine the iris dataset is a super large dataset that doesn’t fit into memory

import pandas as pd
import numpy as np
import h5py

# suppose this is a large CSV that does not 
# fit into memory:
csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# Get number of lines in the CSV file if it's on your hard drive:
#num_lines = subprocess.check_output(['wc', '-l', in_csv])
#num_lines = int(nlines.split()[0]) 
num_lines = 150
num_features = 4

class_dict = {'Iris-setosa': 0,
              'Iris-versicolor': 1,
              'Iris-virginica': 2}

# use 10,000 or 100,000 or so for large files
chunksize = 10

# this is your HDF5 database:
with h5py.File('iris.h5', 'w') as h5f:
    
    # use num_features-1 if the csv file has a column header
    dset1 = h5f.create_dataset('features',
                               shape=(num_lines, num_features),
                               compression=None,
                               dtype='float32')
    dset2 = h5f.create_dataset('labels',
                               shape=(num_lines,),
                               compression=None,
                               dtype='int32')

    # change range argument from 0 -> 1 if your csv file contains a column header
    for i in range(0, num_lines, chunksize):  

        df = pd.read_csv(csv_path,  
                header=None,  # no header, define column header manually later
                nrows=chunksize, # number of rows to read at each iteration
                skiprows=i)   # skip rows that were already read
        
        df[4] = df[4].map(class_dict)

        features = df.values[:, :4]
        labels = df.values[:, -1]
        
        # use i-1 and i-1+10 if csv file has a column header
        dset1[i:i+10, :] = features
        dset2[i:i+10] = labels[0]

Once you have converted the dataset, you can check and work with it as follows:

with h5py.File('iris.h5', 'r') as h5f:
    print(h5f['features'].shape)
    print(h5f['labels'].shape)

prints

(150, 4)
(150,)

with h5py.File('iris.h5', 'r') as h5f:
    print('First feature entry', h5f['features'][0])
    print('First label entry', h5f['labels'][0])

prints

First feature entry [ 5.0999999   3.5         1.39999998  0.2       ]
First label entry 0

Afer that, you can use the HDF5 database to set up a DataLoader via PyTorch