Hi everyone, I am wondering if anyone has some suggestion on how best to store the data that I am about to describe.

I am working with data that can be partitioned into roughly 30,000 groups, and each of these groups has between 100 to 1000 subgroups. Now, within each subgroup we have variable number of data points (i.e, group A has 100 subgroups and the first subgroup has 10 data points associated with it, group B has 200 subgroups and the second subgroup has 100 data points, etc).

My dataset **getitem** method should be able to retrieve two random elements from a particular subgroup and so currently I store each group as a separate folder, each subgroup as a separate folder within the corresponding group folder and each data point as a separate file within the corresponding subgroup folder. (I ended up with hundreds of thousands folders in the end)

This is easy to access from the dataset point of view but I am sure this is not the best way to represent this data. I am wondering if anyone knows what is the best practice to structure hierarchical data like this **so that it is convenient to store but at the same time we can access this during training time concurrently** (I tried storing this as one single hdf file but has problem accessing the file concurrently when I set the num_workers for dataloader > 1)