I was initially getting an OS B-Tree error when using multiple processes. So I followed the advice in this thread here:
And created a dataclass like this:
class Features_Dataset(data.Dataset):
def __init__(self, archive, phase):
self.archive = archive
self.phase = phase
def __getitem__(self, index):
with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
datum = archive[str(self.phase) + '_all_arrays'][index]
label = archive[str(self.phase) + '_labels'][index]
path = archive[str(self.phase) + '_img_paths'][index]
return datum, label, path
def __len__(self):
with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
datum = archive[str(self.phase) + '_all_arrays']
return len(datum)
if __name__ == '__main__':
train_dataset = Features_Dataset(archive= "featuresdata/train.hdf5", phase= 'train')
trainloader = data.DataLoader(train_dataset, num_workers=8, batch_size=128)
print(len(trainloader))
for i, (data, label, path) in enumerate(trainloader):
print(path)
Now I don’t get an error anymore, but loading data is super slow. Because of that, the 4 GPUs that I’m trying to utilize are at zero % volatility. I think there should be a fix, or I have written something completely inefficient. I have 150k instances, where the data, labels and paths are in 3 different datasets within the H5 file. I’m not sure if that plays a problem.