How to use sklearn's train_test_split on PyTorch's dataset

Optiflow · December 7, 2018, 8:12am

Hello,

I wish to use sklearn’s train_test_split to create a validation set from the train set. I am loss on the next steps.

# Load datasets
train_set = torchvision.datasets.CIFAR10(
    root='./data', train=True, transform=transform['train'])

# Create dataloader
train_loader = torch.utils.data.DataLoader(
    train_set, batch_size=64, shuffle=True)

# Turn dataloader into iterator
images, labels = next(iter(train_loader))

# Convert image to numpy
images_np = images.to('cpu').numpy()
labels_np = labels.to('cpu').numpy()

# Split validation data from train set
X_train, X_test, y_train, y_test = train_test_split(
    images_np, labels_np, test_size=0.2, random_state=42, shuffle=True, stratify=labels_np)

After splitting the dataset, how do I combine them back to feed into the dataloader?

I also notice that I am converting the images based on their batch size from next(iter()). How can I convert everything at one go?

ptrblck · December 7, 2018, 12:52pm

Have a look at @kevinzakka’s approach here. It might give you a good starter code for your implementation.
Since you apparently would like to split your CIFAR10 dataset in a stratified fashion, you could use the internal targets to achieve that:

targets = dataset.targets

train_idx, valid_idx= train_test_split(
    np.arange(len(targets)), test_size=0.2, random_state=42, shuffle=True, stratify=targets)

print(np.unique(np.array(targets)[train_idx], return_counts=True))
print(np.unique(np.array(targets)[valid_idx], return_counts=True))

These indices can then be used for the SubsetRandomSampler.