I’m quite new to PyTorch and am a bit unsure if my method of storage and retrieval of training data is efficient. To clarify, my code works as expected and runs without errors, but slowly.
I’m using h5py (which I am also new at) and tried to model my functions after the suggestions in the forum post: DataLoader, when num_worker >0, there is bug
My training data has the size X = (400000,1000)
and y = (400000,3)
which is saved as a hdf5 file using:
with h5py.File(fileName, 'w') as f:
for i in range(X.shape[0]):
f.create_dataset('%s/data_X' % i,data = X[i])
f.create_dataset('%s/data_y' % i,data = y[i])
Then later when I want to train my network my retrieval method looks like:
class H5Dataset(Data.Dataset):
def __init__(self, h5_path):
self.h5_path = h5_path
self._h5_gen = None
def __getitem__(self, index):
if self._h5_gen is None:
self._h5_gen = self._get_generator()
next(self._h5_gen)
return self._h5_gen.send(index)
def _get_generator(self):
with h5py.File( self.h5_path, 'r') as record:
index = yield
while True:
X = record[str(index)]['data_X'][()]
y = record[str(index)]['data_y'][()]
index = yield X, y
def __len__(self):
with h5py.File(self.h5_path,'r') as record:
return len(record)
BATCH_SIZE = 400
loader = Data.DataLoader(
dataset=H5Dataset(fileName),
batch_size=BATCH_SIZE,
shuffle=True, num_workers=0)
for i, (X_batch, y_batch) in enumerate(loader):
# Training occurs here
Running through one full iteration of the loop (without training) takes about 4 min, which to my untrained eye seems like it could be more efficient. If so what could I do to improve?