I would recommend looking into HDF5. The handling is similar to numpy arrays (with the indexing), but the dataset is not loaded into memory until you access it. I just wrote a quick example for converting a CSV into HDF5 using Iris for illustration purposes; here, imagine the iris dataset is a super large dataset that doesn’t fit into memory
import pandas as pd
import numpy as np
import h5py
# suppose this is a large CSV that does not
# fit into memory:
csv_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
# Get number of lines in the CSV file if it's on your hard drive:
#num_lines = subprocess.check_output(['wc', '-l', in_csv])
#num_lines = int(nlines.split()[0])
num_lines = 150
num_features = 4
class_dict = {'Iris-setosa': 0,
'Iris-versicolor': 1,
'Iris-virginica': 2}
# use 10,000 or 100,000 or so for large files
chunksize = 10
# this is your HDF5 database:
with h5py.File('iris.h5', 'w') as h5f:
# use num_features-1 if the csv file has a column header
dset1 = h5f.create_dataset('features',
shape=(num_lines, num_features),
compression=None,
dtype='float32')
dset2 = h5f.create_dataset('labels',
shape=(num_lines,),
compression=None,
dtype='int32')
# change range argument from 0 -> 1 if your csv file contains a column header
for i in range(0, num_lines, chunksize):
df = pd.read_csv(csv_path,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i) # skip rows that were already read
df[4] = df[4].map(class_dict)
features = df.values[:, :4]
labels = df.values[:, -1]
# use i-1 and i-1+10 if csv file has a column header
dset1[i:i+10, :] = features
dset2[i:i+10] = labels[0]
Once you have converted the dataset, you can check and work with it as follows:
with h5py.File('iris.h5', 'r') as h5f:
print(h5f['features'].shape)
print(h5f['labels'].shape)
prints
(150, 4)
(150,)
with h5py.File('iris.h5', 'r') as h5f:
print('First feature entry', h5f['features'][0])
print('First label entry', h5f['labels'][0])
prints
First feature entry [ 5.0999999 3.5 1.39999998 0.2 ]
First label entry 0
Afer that, you can use the HDF5 database to set up a DataLoader via PyTorch