Hello all, I have a dataset that requires to use h5 files. The dataset size has 1000 images with the total size of 100GB. I have two options:
Save each image to an hdf5 file, then I have total 1000 hdf5 file. In dataloader, I will call
class h5_loader(data.Dataset):
def __init__(self, file_path):
self.file_list = [f for f in glob.glob(os.path.join(file_path, '*.h5'))]
def __getitem__(self, index):
h5_file = h5py.File(file_list[index])
data = h5_file.get('data')
Save all images into a single hdf5 file, then dataloader can call
class h5_loader(data.Dataset):
def __init__(self, file_path):
h5_file = h5py.File(file_path)
data = h5_file.get('data')
def __getitem__(self, index):
...
Which option should I use to speed up data loader time? The first way load each hdf4 in __getitem__ function, while the second way load a single (combined of 1000 images) in the __init__ function.
If your whole dataset fits into your ram, you could load it from one large file inside your datasetās __init__. This would load all your data once and take a bit longer for the initialization, but would be the fastest way during training, since no additional data has to be loaded.
If that is not possible, you could increase your number of workers and load the images inside your __getitem__. If I remember correctly, HDF5 works as a generator, which means you could also use one large HDF5 file. The advantage is that you donāt need that much RAM, on the other side you would have to load your data during training, which may be a bottleneck (depending on the size of your images, the performance of your GPU, your CPU and your number of workers).
Hey!
Iām a first time responder and pretty new to pytorch and deep learning. I am also trying to train a CNN (UNet to be specific) with image data and think Iām running into a problem related to this topic.
I have a h5py data base file that is too big to load (~27GB). It has 8,000 sampls and each sample shape is (14,257,256). I think Itās worth to mention that I am creating that data base and I can control the number of files and each file size.
Acording to the solution here I can create 8000 different h5py files and use the getitem method to get a different sample from a different file every time (In that case does the len method still returns 8000?).
I was thinking of a way to combain these 2 ways and creating 8 files 1000 samples each. In that case the loader would look like(?):
class h5_loader(data.Dataset):
def __init__(self):
self.file_path = '/content/drive/My Drive/Project Dataset/trainFile'
def __getitem__(self, index):
i = index/1000
mod = index%1000
h5_file = h5py.File(file_path+str(i)+'.h5','r')
input = h5_file['Input'][mod]
target = h5_file['Target'][mod]
return input, target
Would that work?
What would you say is the most ācleanā and fastest way to load my data into the model?
Iām working on google colab from my 8GB RAM laptop (I use 0 workers).
So apparently this is a very BAD idea. I tried to train my model using this option and it was very slow, and I think I figured out why.
The disadvantage of using 8000 files (1 file for each sample) is that the getitem method has to load a file every time the dataloader wants a new sample (but each file is relatively small, because it contain only one sample).
Doing what I did makes the dataloader to load a 1000 samples file just to get out 1 sample out of it, And by the next time the dataloader will want a new sample it will have to load the same file over again.
I havenāt found an opitmal solution, for the meantime I settled for a smaller dataset.
At first I tried to use John second option (the get item load each sample separately), but it was really slow on my computer (about 28 sec to load each sample).
Than I tried to get creative and divide each epoch into 8 parts (of 1000 samples each), and after each part I deleted the dataloader object and created a new dataloader object with a different dataset, like that:
for j in np.arange(25):
########Starting train epoch#########
running_loss = 0
train_loss_vector = []
for k in range(8):
DB_T = My_Dataloader_Train(k)
train_loader_train = data.DataLoader(dataset = DB_T, batch_size = 16, num_workers=0)
net.train() # setting the model for training mode
for batch_inx, (specs,masks) in enumerate(train_loader_train):
specs = specs.to(device).float()
.
.
It looked like it was working but after (I think) 10 epoches colab crashed with an āUnable to open fileā error (which is weird because it already succeeded opening the same file an epoch ago.
I was initially getting an OS B-Tree error when using multiple processes. So I followed the advice in this thread here:
And created a dataclass like this:
class Features_Dataset(data.Dataset):
def __init__(self, archive, phase):
self.archive = archive
self.phase = phase
def __getitem__(self, index):
with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
datum = archive[str(self.phase) + '_all_arrays'][index]
label = archive[str(self.phase) + '_labels'][index]
path = archive[str(self.phase) + '_img_paths'][index]
return datum, label, path
def __len__(self):
with h5py.File(self.archive, 'r', libver='latest', swmr=True) as archive:
datum = archive[str(self.phase) + '_all_arrays']
return len(datum)
if __name__ == '__main__':
train_dataset = Features_Dataset(archive= "featuresdata/train.hdf5", phase= 'train')
trainloader = data.DataLoader(train_dataset, num_workers=8, batch_size=128)
print(len(trainloader))
for i, (data, label, path) in enumerate(trainloader):
print(path)
Now I donāt get an error anymore, but loading data is super slow. Because of that, the 4 GPUs that Iām trying to utilize are at zero % volatility. I think there should be a fix, or I have written something completely inefficient. I have 150k instances, where the data, labels and paths are in 3 different datasets within the H5 file. Iām not sure if that plays a problem.