Splitting Pytorch Dataset Into Separate Tensor of Labels and Images

James_H · April 9, 2020, 5:22am

Hi,
I’m currently trying to train a basic CNN on the CIFAR10 dataset, which I loaded using train_dataset = torchvision.datasets.CIFAR10(DATA_PATH, train=True, transform=transform, download=True), and was able to achieve decent accuracy. However, I noticed that I was tuning the hyperparameters to the test set and seeing the response, possibly overfitting it. So, I looked into using the GridSearchCV class of Sklearn to find the best combination of hyperparameters through K-Fold cross validation. Code below:

net = NeuralNetClassifier(module=cNet, criterion=nn.CrossEntropyLoss,
    optimizer=torch.optim.Adam, lr=learning_rate, max_epochs=100,
    batch_size=batch_size,)
params = {
    'lr': [0.001, 0.005, 0.01, 0.05, 0.1],
    'max_epochs': np.linspace(5, 30, 6, dtype=np.int64),
    'batch_size': np.exp2(np.arange(5,8))
}
gs = GridSearchCV(estimator=net, param_grid=params, refit=False, scoring='accuracy', cv=10)
gs.fit(train_dataset)
optimizednet = gs.best_estimator_
lr = optimizednet.learning_rate
batch_size = optimizednet.batch_size
num_epochs = optimizednet.max_epochs

In the process, I was unable to find a way to split the train_dataset into X (images) and Y(labels) for gs.fit() which threw this error TypeError: fit() missing 1 required positional argument: 'y'. Does anyone know a way to split the dataset back into images and labels?

boto · April 11, 2020, 2:45am

Hi James!

Take a look at torchvision.datasets.CIFAR10. As you can see the __getitem__ returns a tuple with (image, target), where target corresponds to the class/label of the image. You can iterate through the dataset in the following manner:

...
dataset = torchvision.datasets.CIFAR10(DATA_PATH, train=True, transform=transform, download=True)
for i, data in enumerate(dataset): # i == Index
   image, label = data
...

Now you have your image as well as the corresponding label and you can work with it.

James_H · April 16, 2020, 4:07am

Hi boto,
Sorry for the late reply. I tried split the dataset into x and y by

def split_XY(dataset):

    l = []
    a = torch.Tensor(50000, 3, 32, 32)
    for i, (image, label) in enumerate(train_dataset):
        a[i, :, :, :] = image
        l.append(label)
    return a, torch.Tensor(l)

and tried

x, y = split_XY(train_dataset)
gs.fit(x, y)

but when I ran it, it threw ValueError: Cannot perform a CV split if dataset and y have different lengths. which didn’t make sense to me because x is 50000 x 3 x 32 x 32 and y is 50000 x 1. Maybe I’m missing something? Thanks!

boto · April 16, 2020, 3:31pm

Which line of code throws that error?

James_H · April 16, 2020, 11:12pm

 File "cifar10.py", line 75, in <module>
    gs.fit(x, y)

boto · April 16, 2020, 11:43pm

Unfortunately, I don’t have any experience with GridSearchCV.
But the error states that x and y do not have the length/shape, which is true as x.shape=(500000,3,32,32) and y.shape=(50000, 1). That’s all I can tell you. Maybe try looking at Skorch FAQ.

James_H · April 17, 2020, 4:19am

Thanks for the help!

yhl3051 · October 21, 2023, 11:35pm

James_H:

Hi,
I’m currently trying to train a basic CNN on the CIFAR10 dataset, which I loaded using train_dataset = torchvision.datasets.CIFAR10(DATA_PATH, train=True, transform=transform, download=True), and was able to achieve decent accuracy. However, I noticed that I was tuning the hyperparameters to the test set and seeing the response, possibly overfitting it. So, I looked into using the GridSearchCV class of Sklearn to find the best combination of hyperparameters through K-Fold cross validation. Code below:
net = NeuralNetClassifier(module=cNet, criterion=nn.CrossEntropyLoss,
    optimizer=torch.optim.Adam, lr=learning_rate, max_epochs=100,
    batch_size=batch_size,)
params = {
    'lr': [0.001, 0.005, 0.01, 0.05, 0.1],
    'max_epochs': np.linspace(5, 30, 6, dtype=np.int64),
    'batch_size': np.exp2(np.arange(5,8))
}
gs = GridSearchCV(estimator=net, param_grid=params, refit=False, scoring='accuracy', cv=10)
gs.fit(train_dataset)
optimizednet = gs.best_estimator_
lr = optimizednet.learning_rate
batch_size = optimizednet.batch_size
num_epochs = optimizednet.max_epochs
In the process, I was unable to find a way to split the train_dataset into X (images) and Y(labels) for gs.fit() which threw this error TypeError: fit() missing 1 required positional argument: 'y'. Does anyone know a way to split the dataset back into images and labels?

I am surprised that this has not been suggested yet, but the map function is much more efficient than using a for loop, instantly creating the arrays:

inds = list(range(1, len(dataset)))
images = list(map(lambda x: np.copy(dataset[x][0]),inds))
labels = list(map(lambda x: np.copy(dataset[x][1]),inds))

Slicing doesn’t seem to work, from what I know. I added np.copy to prevent too many sockets from being loaded at once, but I think that is optional.