What exactly is train_test_split doing to the data?

RaphaelL · December 12, 2019, 6:20am

Is
test_x, test2_x, test_y, test2_y = train_test_split(test_x, test_y, test_size=0.001, random_state=134515, stratify=test_y)
doing anything (reshaping?) besides splitting the dataset? I have a trained model and I’m getting completely different accuracies (0.85 vs. 0.52!) if I make a forward pass with the whole dataset or with 99.99% of the dataset (after uncommenting the relevant line in the code below). Obviously the two images not included in test_x after the split don’t explain the difference so I 'm guessing the split command is something else but I can’t figure out what it is!

loading dataset

test = pd.read_csv(‘img_label_list04.csv’)
test.head()

#---------

loading test images

test_img = []
for img_name in tqdm(test[‘image_names’]):
# defining the image path
image_path = ‘d:/Pytorch/Data/leaves/set04/’ + img_name
img = imread(image_path)
# normalizing the pixel values
img = img/255
# resizing the image to (224,224,3)
img = resize(img, output_shape=(224,224,3), mode=‘constant’, anti_aliasing=True)
# converting the type of pixel to float 32
img = img.astype(‘float32’)
# appending the image into the list
test_img.append(img)

converting the list to numpy array

test_x = np.array(test_img)
test_x.shape

defining the target

test_y = train[‘infected_or_not’].values

#-----------------

split the data sest (or not)

test_x, test2_x, test_y, test2_y = train_test_split(test_x, test_y, test_size=0.001, random_state=134515, stratify=test_y)

shape of test data

test_x.shape, test_y.shape

converting the test images into torch format

test_x = test_x.reshape(len(test_x), 3, 224, 224)
test_x = torch.from_numpy(test_x)

test_y = test_y.astype(int)
test_y = torch.from_numpy(test_y)

data_y = []
label_y = []

inputs,labels = test_x, test_y

for i in tqdm(range(int(test_x.shape[0]/batch_size))):
input_data = inputs[i*batch_size:(i+1)*batch_size]
label_data = labels[i*batch_size:(i+1)*batch_size]
input_data , label_data = Variable(input_data.cuda()),Variable(label_data.cuda())
x = model.features(input_data)
data_y.extend(x.data.cpu().numpy())
label_y.extend(label_data.data.cpu().numpy())

#--------------------------

converting the features into torch format

x_test = torch.from_numpy(np.array(data_y))
x_test = x_test.view(x_test.size(0), -1)
y_test = torch.from_numpy(np.array(label_y))

#--------------------------

prediction for test set

prediction_test = []
target_test = []
permutation = torch.randperm(x_test.size()[0])
for i in tqdm(range(0,x_test.size()[0], batch_size)):
indices = permutation[i:i+batch_size]
batch_x, batch_y = x_test[indices], y_test[indices]

if torch.cuda.is_available():
    batch_x, batch_y = batch_x.cuda(), batch_y.cuda()

with torch.no_grad():
    output = model.classifier(batch_x.cuda())

softmax = torch.exp(output).cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
prediction_test.append(predictions)
target_test.append(batch_y)

test accuracy

accuracy_test = []
for i in range(len(prediction_test)):
np_target_test=target_test[i].cpu().numpy()
accuracy_test.append(accuracy_score(np_target_test,prediction_test[i]))
accuracy.append(accuracy_score(np_target_test,prediction_test[i]))

print(‘test accuracy: \t’, np.average(accuracy_test))

ptrblck · December 12, 2019, 3:22pm

The method just splits the data using your arguments as described in the docs.

I assume you want to permute the dimensions in this line of code:

test_x = test_x.reshape(len(test_x), 3, 224, 224)

which won’t work as expected, since reshape will interleave the data. Use transpose in numpy or permute in PyTorch instead to swap the dimensions.

RaphaelL · December 15, 2019, 12:22pm

Thanks for your suggestion but unfortunately this doesn’t solve the problem. To make things easier, the code pasted below includes 2 parts (TEST1 and TEST 2), where the second part was created simply by pasting the first part and deleting the “train_test_split” line. The accuracy at the end of part 1 is 0.93 and the accuracy at the end of part 2 is 0.70!

TEST 1 after splitting the dataset and using only 99% of the images

#--------------------------------------------------------------------

loading dataset

test = []
test = pd.read_csv(‘img_label_list04.csv’)
test.head()

#---------

loading test images

test_img = []
for img_name in tqdm(test[‘image_names’]):
# defining the image path
image_path = ‘d:/Pytorch/Data/leaves/set04/’ + img_name
# reading the image
img = imread(image_path)
# normalizing the pixel values
img = img/255
# resizing the image to (224,224,3)
img = resize(img, output_shape=(224,224,3), mode=‘constant’, anti_aliasing=True)
# converting the type of pixel to float 32
img = img.astype(‘float32’)
# appending the image into the list
test_img.append(img)

converting the list to numpy array

test_x = np.array(test_img)
test_x.shape

defining the target

test_y = test[‘infected_or_not’].values

#-----------------

split the data set

test_x, test2_x, test_y, test2_y = train_test_split(test_x, test_y, test_size=0.001, random_state=134515, stratify=test_y)

shape of test data

test_x.shape, test_y.shape

converting the test images into torch format

test_x = test_x.transpose(0,3,1,2)
test_x = torch.from_numpy(test_x)

test_y = test_y.astype(int)
test_y = torch.from_numpy(test_y)

data_test = []
label_test = []

inputs,labels = test_x, test_y

for i in tqdm(range(int(test_x.shape[0]/batch_size))):
input_data = inputs[i*batch_size:(i+1)*batch_size]
label_data = labels[i*batch_size:(i+1)*batch_size]
input_data , label_data = Variable(input_data.cuda()),Variable(label_data.cuda())
x = model.features(input_data)
data_test.extend(x.data.cpu().numpy())
label_test.extend(label_data.data.cpu().numpy())

#--------------------------

converting the features into torch format

x_test = torch.from_numpy(np.array(data_test))
x_test = x_test.view(x_test.size(0), -1)
y_test = torch.from_numpy(np.array(label_test))

#--------------------------

prediction for test set

prediction_test = []
target_test = []

permutation = torch.randperm(x_test.size()[0])

for i in tqdm(range(0,x_test.size()[0], batch_size)):
# indices = permutation[i:i+batch_size]
indices = range(i,i+batch_size)
batch_x, batch_y = x_test[indices], y_test[indices]

if torch.cuda.is_available():
    batch_x, batch_y = batch_x.cuda(), batch_y.cuda()

with torch.no_grad():
    output = model.classifier(batch_x.cuda())

softmax = torch.exp(output).cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
prediction_test.append(predictions)
target_test.append(batch_y)

test accuracy

accuracy_test = []
for i in range(len(prediction_test)):
np_target_test=target_test[i].cpu().numpy()
accuracy_test.append(accuracy_score(np_target_test,prediction_test[i]))

print(‘test accuracy 1: \t’, np.average(accuracy_test))

TEST 2 using all the images

-------------------------------------------------

test = []
test = pd.read_csv(‘img_label_list04.csv’)
test.head()

#---------

loading test images

test_img = []
for img_name in tqdm(test[‘image_names’]):
# defining the image path
image_path = ‘d:/Pytorch/Data/leaves/set04/’ + img_name
# reading the image
img = imread(image_path)
# normalizing the pixel values
img = img/255
# resizing the image to (224,224,3)
img = resize(img, output_shape=(224,224,3), mode=‘constant’, anti_aliasing=True)
# converting the type of pixel to float 32
img = img.astype(‘float32’)
# appending the image into the list
test_img.append(img)

converting the list to numpy array

test_x = np.array(test_img)
test_x.shape

defining the target

test_y = test[‘infected_or_not’].values

#-----------------

shape of test data

test_x.shape, test_y.shape

converting the test images into torch format

test_x = test_x.transpose(0,3,1,2)
test_x = torch.from_numpy(test_x)

test_y = test_y.astype(int)
test_y = torch.from_numpy(test_y)

data_test = []
label_test = []

inputs,labels = test_x, test_y

for i in tqdm(range(int(test_x.shape[0]/batch_size))):
input_data = inputs[i*batch_size:(i+1)*batch_size]
label_data = labels[i*batch_size:(i+1)*batch_size]
input_data , label_data = Variable(input_data.cuda()),Variable(label_data.cuda())
x = model.features(input_data)
data_test.extend(x.data.cpu().numpy())
label_test.extend(label_data.data.cpu().numpy())

#--------------------------

converting the features into torch format

x_test = torch.from_numpy(np.array(data_test))
x_test = x_test.view(x_test.size(0), -1)
y_test = torch.from_numpy(np.array(label_test))

#--------------------------

prediction for test set

prediction_test = []
target_test = []

permutation = torch.randperm(x_test.size()[0])

for i in tqdm(range(0,x_test.size()[0], batch_size)):
# indices = permutation[i:i+batch_size]
indices = range(i,i+batch_size)
batch_x, batch_y = x_test[indices], y_test[indices]

if torch.cuda.is_available():
    batch_x, batch_y = batch_x.cuda(), batch_y.cuda()

with torch.no_grad():
    output = model.classifier(batch_x.cuda())

softmax = torch.exp(output).cpu()
prob = list(softmax.numpy())
predictions = np.argmax(prob, axis=1)
prediction_test.append(predictions)
target_test.append(batch_y)

test accuracy

accuracy_test = []
for i in range(len(prediction_test)):
np_target_test=target_test[i].cpu().numpy()
accuracy_test.append(accuracy_score(np_target_test,prediction_test[i]))

print(‘test accuracy 2: \t’, np.average(accuracy_test))

saba · February 25, 2020, 12:55am

Dear Ptrblck ,

I applied:

train_test_split(WholeData1.numpy(), 
wholetargetArray,train_size=0.7,test_size=0.3,stratify=wholetargetArray)```


the wholetargetArray is the whole balanced data 4000positive and 4000 negative, and wholetargetArray is the whole 8000 target regarding the dataset.
the number of portion is good , the ValidLabel, and TrainLabel are binary labels  but the ValidationData1 , and TrainData1 are zero!

ptrblck · February 25, 2020, 4:57am

I’m not sure, why the data split should be zero.
Anyway, you could only pass the target array to stratify and use np.arange(len(target)) as the “data” input, to get the split indices as the output.