SubsetRandomSampler not splitting data correctly

My training data has a size pf 72200, and I want to split it into 5 portions so that I can feed it in without maxing out the RAM during training. I have split my data using subsetRandomSampler, but when I check the lengths of the subsets, its not 14440 (72200), its 57.

num_partions = 5
total = train_loader.__len__()
partitions = []
partitions = [int(i*total/num_partions) for i in range(num_partions+1)]

for j in range(len(partitions)-1):
  indices = range(total)
  samples = indices[partitions[j]:partitions[j+1]]
  train_sampler = SubsetRandomSampler(samples)
  train_loader_p1 = DataLoader(MyDataset(train, 12), batch_size=256, num_workers=8, pin_memory=True,sampler=train_sampler)

This is the output I am getting:

[0, 14440, 28880, 43320, 57760, 72200]
57 72200
57 72200
57 72200
57 72200
57 72200

The subsets have a length of 57 instead of 14440. I have checked the length of samples, and its 14440. Where am I going wrong? Also, is there a better way to do the same using random_split?

I guess you are mixing up the number of samples, which is given as the length of the Dataset, and the number of batches, which is given as the length of the DataLoader.

Based on the posted numbers, I assume train_loader uses a batch size of 1, so that len(dataset)==len(train_loader).
However, inside the loop you are using a batch size of 256. Assuming the subset has now 14440 samples, the DataLoader would yield 57 batches (14440/256=56.40625).

1 Like