How can i use sklearn.Kfold with ImageFolder?

skyunyoo · February 7, 2019, 6:21am

My code is…

batch_size = 16

transform = transforms.Compose([transforms.Resize((299,299))
                                       ,transforms.ToTensor()
                                       ,transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])

dataset = ImageFolder('.data/',transform=transform)

kf = KFold(n_splits=5, shuffle=True)

for i, (train_index, test_index) in enumerate(kf.split(dataset)):

    trainloader = torch.utils.data.DataLoader(train_index, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=False)
    testloader = torch.utils.data.DataLoader(test_index, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=False)
    
    print('Fold : {}, train : {}, test : {}'.format(i+1, len(trainloader.dataset), len(testloader.dataset)))
    
    for batch_idx, (data, target) in enumerate(trainloader):
        print('Train Batch idx : {}, data shape : {}, target shape : {}'.format(batch_idx, data.shape, target.shape))

error occurred.


Fold : 1, train : 579, test : 145

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-0fe2bfb82b09> in <module>
     16     print('Fold : {}, len train : {}, len test : {}'.format(i+1, len(trainloader.dataset), len(testloader.dataset)))
     17 
---> 18     for batch_idx, (data, target) in enumerate(trainloader):
     19         print('Train Batch idx : {}, data shape : {}, target shape : {}'.format(batch_idx, data.shape, target.shape))

ValueError: too many values to unpack (expected 2)

I don’t know how to handle train_index, test_index.
Could anyone give me a help ?

ptrblck · February 7, 2019, 7:13am

kf.split will return the train and test indices as far as I know.
Currently you are passing these indices to a DataLoader, which will just return a batch of indices.

I think you should pass the train and test indices to a Subset to create new Datasets and pass these to the DataLoaders.
Let me know, if that works for you.

skyunyoo · February 7, 2019, 7:52am

The code I modified is as follows:

for i, (train_index, test_index) in enumerate(kf.split(dataset)):
    
    train = torch.utils.data.Subset(dataset, train_index)
    test = torch.utils.data.Subset(dataset, test_index)

    trainloader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=False)
    testloader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=False)

Thank you for reply. It’s works well!

skyunyoo · February 7, 2019, 8:16am

Label has mismatched with pic.
I have ‘0’, ‘1’ in label, but output is always ‘0’.
I’d did something miss in the code…
Could you help me one more time?

ptrblck · February 7, 2019, 8:18am

Is your model only predicting class0?
If so, could you create a new thread and tag me in it?
Also, could you provide some information regarding your model, training, class distribution etc. in the new thread?

skyunyoo · February 7, 2019, 8:33am

I’m not yet ran all of code.
I ran just ‘imshow’

def imshow(inp, img_num):
    inp = inp.numpy().transpose((1, 2, 0))
    mean = np.array([0.485, 0.456, 0.406])
    std = np.array([0.229, 0.224, 0.225])
    inp = std * inp + mean
    inp = np.clip(inp, 0, 1)
    print('label : {}'.format(train[img_num][1]))
    plt.imshow(inp)

imshow(train[40][0], 40)

But output shows mismatched label with pic

ptrblck · February 7, 2019, 8:37am

The index you are passing to your Dataset doesn’t correspond to the target for the sample.
I’m not sure, if I understand your code correctly, but you can get the target using target = train[40][1].

skyunyoo · February 7, 2019, 8:49am

I get the target by
print('label : {}'.format(train[img_num][1]))
but as you mentioned, the target doesn’t correspond to sample.
how could I modify the code that the target correspond to the sample?

ptrblck · February 7, 2019, 8:54am

ImageFolder will create the labels based on the passed folders, so the labels should also mismatch in the original Dataset before the splitting. Could you check that and see if some images might be stored in the wrong folder?

skyunyoo · February 7, 2019, 9:15am

data/
                melanoma/
                    AM(1).jpg
                    AM(2).jpg
                    AM(3).jpg
                    ...
                benign/
                    BN(1).jpg
                    BN(2).jpg
                    BN(3).jpg
                    ....

all images are stored in the right folder…
And

print(dataset.class_to_idx)

{'benign': 0, 'melanoma': 1}

I need to think about what the problem is.

ptrblck · February 7, 2019, 11:06am

If your original Dataset is fine, the Subsets shouldn’t be changed, since only the passed indices are called in this line of code.
Could you check which index in your subset is giving the wrong class and then try to get the original index based on this information? Maybe this helps debugging which file seems to be wrong.

Alternatively, you could write your own custom Dataset and return the image names along the data and target, so that debugging might be a bit easier.

skyunyoo · February 7, 2019, 1:11pm

Oh… your first reply was worked well…
I had print all classes index and that showed class ‘1’ were gatherd end of index in Dataset.
I’m sorry for waste your time.
Thanks a lot of your detailed reply!

vishnu_vardhan1 · August 26, 2020, 2:22pm

Hi @skyunyoo, @ptrblck
if we want to use StratifiedKFold for ImageFolder, then we have to pass the labels to split as well.
so it will be something like this : kf.split(dataset,y)
but for y, how can we assign labels from imagefolder?

ptrblck · August 26, 2020, 6:16pm

ImageFolder contains the labels in its .target attribute.