DataLoader returning training set in size of validation set

ptinn · May 30, 2022, 5:43pm

Hi,

I am encountering a strange issue where my DataLoader is returning the training set in the same size as my validation set. With the below implementation on my data, I get a 5911-1478 split. But coming out from the DataLoader with batch=16, I get the lengths of both in 93, which is based on 1478/16. What might have happened to the training set with 5911 samples?

def split_data(data):
  train = int(len(data)* .8)
  val = len(data) - train
  train_set, val_set = random_split(data, [train, val])
  
  return train_set, val_set

train_loader = DataLoader(train_set, batch_size=16, shuffle=True)
val_loader = DataLoader(val_set, batch_size=16, shuffle=True)

ptrblck · May 30, 2022, 7:53pm

Your code looks correct and also works for me:

def split_data(data):
  train = int(len(data)* .8)
  val = len(data) - train
  train_set, val_set = torch.utils.data.random_split(data, [train, val])
  
  return train_set, val_set


dataset = TensorDataset(torch.randn(5911+1478))
train_set, val_set = split_data(dataset)

print(len(train_set))
# 5911
print(len(val_set))
# 1478

train_loader = DataLoader(train_set, batch_size=16, shuffle=True)
val_loader = DataLoader(val_set, batch_size=16, shuffle=True)
print(len(train_loader))
# 370 = 369 full batches and 1 batch with 7 samples
print(len(val_loader))
# 93 = 92 full batches and 1 batch with 6 samples

ptinn · May 30, 2022, 8:44pm

Thank you for verifying @ptrblck . It was an obvious mistake on my end for returning one set as both. My bad