ptinn
(akito)
May 30, 2022, 5:43pm
1
Hi,
I am encountering a strange issue where my DataLoader is returning the training set in the same size as my validation set. With the below implementation on my data, I get a 5911-1478 split. But coming out from the DataLoader with batch=16, I get the lengths of both in 93, which is based on 1478/16. What might have happened to the training set with 5911 samples?
def split_data(data):
train = int(len(data)* .8)
val = len(data) - train
train_set, val_set = random_split(data, [train, val])
return train_set, val_set
train_loader = DataLoader(train_set, batch_size=16, shuffle=True)
val_loader = DataLoader(val_set, batch_size=16, shuffle=True)
Your code looks correct and also works for me:
def split_data(data):
train = int(len(data)* .8)
val = len(data) - train
train_set, val_set = torch.utils.data.random_split(data, [train, val])
return train_set, val_set
dataset = TensorDataset(torch.randn(5911+1478))
train_set, val_set = split_data(dataset)
print(len(train_set))
# 5911
print(len(val_set))
# 1478
train_loader = DataLoader(train_set, batch_size=16, shuffle=True)
val_loader = DataLoader(val_set, batch_size=16, shuffle=True)
print(len(train_loader))
# 370 = 369 full batches and 1 batch with 7 samples
print(len(val_loader))
# 93 = 92 full batches and 1 batch with 6 samples
ptinn
(akito)
May 30, 2022, 8:44pm
3
ptrblck:
orch.utils.data.ra
Thank you for verifying @ptrblck . It was an obvious mistake on my end for returning one set as both. My bad