Loading .npy files using torchvision

jingjing · November 1, 2018, 12:08am

Dear all,

I am trying to train my own Resnet model using .npy format files.

I am wondering that are there any functions like torchvision.datasets.ImageFolder that can load .npy files in a folder and label these numpy array with their folder name?

albanD · November 1, 2018, 10:44am

Hi,

No such dataset exist at the moment.
I guess you can load the files, convert these to torch tensor and create a TensorDataset from them.

ptrblck · November 1, 2018, 12:32pm

Alternatively to @albanD’s solution, you could also use DatasetFolder, which basically is the underlying class of ImageFolder.
Using this class you can provide your own files extensions and loader to load the samples.

def npy_loader(path):
    sample = torch.from_numpy(np.load(path))
    return sample
    

dataset = datasets.DatasetFolder(
    root='PATH',
    loader=npy_loader,
    extensions=['.npy']
)

If you want to use transformations, you would need to convert the sample tensors to PIL.Images in your loader.

jingjing · November 1, 2018, 8:06pm

Thank you ptrblck! That did work!

Tanya_Boone · May 27, 2020, 4:04pm

Hello ptrblck…I am facing the same problem but I am getting some errors…

RuntimeError: Found 0 files in subfolders of: cifar-100//cifar-100-python//pylearn2_gcn_whitened//train//
Supported extensions are: .npy

so, if I change the code:

path='cifar-100/cifar-100-python/pylearn2_gcn_whitened/train/train.npy'
sample_train = torch.from_numpy(np.load(path))

    
print(len(sample_train)) #50000

path='cifar-100/cifar-100-python/pylearn2_gcn_whitened/test/test.npy'
sample_test = torch.from_numpy(np.load(path))

    
print(len(sample_test)) #10000

I can get them as tensors on each variables, but I dont know how to transform them to type dataset.

note: this datasets were transformed from the original cifar100 but using the library pylearn2

ptrblck · May 28, 2020, 2:27am

The use case in the original question was that instead of images in subfolders, .npy files were located in these subfolders.
Since ImageFolder only looks for image files, I suggested to use DatasetFolder with a custom loader.

However, your use case is different, since you have a single .npy file with apparently all images stored inside of it.

In that case, you should stick to @albanD’s solution and use TensorDataset(torch.from_numpy(np.load(path))).

Tanya_Boone · May 28, 2020, 5:46am

Thanks ptrblck

I solved that with saving them in a pt file and then creating a Custom Dataset

__author__ = 'mangate'

from six.moves import cPickle as pickle
import numpy as np
import os
import fnmatch
import sys
#import matplotlib.pyplot as plt
from pylearn2.datasets.cifar100 import CIFAR100
import torch
"""
This file opens the CIFAR100 data after whitening and ZCA made by 'process_cifar_100_data' script
which uses pylearn2 library
This file also re-arragne the data so it can enter a nueral net properly
"""

image_size = 32
num_channels = 3
num_classes = 100
pixel_depth = 255.0

def unpickle(file):
    #import pickle
    #import cPickle
    import _pickle as cPickle
    fo = open(file, 'rb')
    dict = cPickle.load(fo)
    fo.close()
    return dict

def rearrange_data(data,labels):
    #data = np.cast['float32'](data)
    #data = data / 255.
    #data = data - data.mean(axis=0)
    images = np.ndarray(shape=(len(labels),image_size,image_size,num_channels), dtype=np.float32)
    labels_out = np.zeros(shape=(len(labels),num_classes),dtype=np.float32)
    max = len(labels)
    for i in range(max):
        images[i] = np.reshape(data[i],(3,32,32)).transpose(1,2,0)
        #plt.imshow(images[i])
        #plt.show()
        labels_out[i][labels[i]]=1.0
    return images,labels_out

def process_data(file_name):
    data = unpickle(file_name)
    # images = data['data']
    # labels = data['fine_labels']
    images = np.asarray(data.X)
    labels = np.asarray(data.y)
    images,labels = rearrange_data(images,labels)
    return images,labels

def get_data():
    ROOT_FOLDER ="D:\\Neural_Nets\\ELU_NETWORK\\cifar100\\cifar-100-python\\pylearn2_gcn_whitened\\"
    train_images, train_labels = process_data(ROOT_FOLDER+'train.pkl')
    test_images,test_labels = process_data(ROOT_FOLDER+'test.pkl')
    # ROOT_FOLDER = '/cs/img/mangate/thesis/Cifar-100/cifar-100-python/'
    #
    # train_images, train_labels = process_data(ROOT_FOLDER+'train')
    # test_images,test_labels = process_data(ROOT_FOLDER+'test')
    print ('Train Date shape is',train_images.shape, 'and labels is',train_labels.shape)
    print ('Test Date shape is',test_images.shape, 'and labels is',test_labels.shape)
    torch.save(train_images,'train_images.pt')
    torch.save(train_labels,'train_labels.pt')
    torch.save(test_images,'test_images.pt')
    torch.save(test_labels,'test_labels.pt')
    return  train_images, train_labels, test_images,test_labels

get_data()

Muhammad_Saqlain · July 27, 2021, 7:17am

Thank you @ptrblck for providing a solution to this problem. I was facing the same problem, luckily I found this post. But when I run this code I face another Error. Can you please help me to solve it.

TypeError: endswith first arg must be str or a tuple of str, not list

ptrblck · July 27, 2021, 7:26am

Based on the error message it seems you are passing a list to endswith as the first argument, which is not supported, so you would need to pass either a str or a tuple of str.

text = 'lala'
text.endswith('a') # works
text.endswith(('a', 'la')) # works
text.endswith(['a', 'la'])
> TypeError: endswith first arg must be str or a tuple of str, not list

Muhammad_Saqlain · July 27, 2021, 10:29am

Thank you @ptrblck for your quick reply, but I am still confused with the same Error. Below is the pic of my code. Training folder contains three subfolders with labels name and each subfolder contains various .npy files. Where do I should change my code or update any library or folder? How can I load this data?

ptrblck · July 27, 2021, 7:58pm

Use extensions='.npy' or extensions=('.npy'), as the error is raised by DatasetFolder.

Muhammad_Saqlain · July 28, 2021, 1:50am

Thanks @ptrblck. Yes, now it works.

devsgnr · February 5, 2023, 11:31pm

Hey @ptrblck - I have a FileNotFoundError error, what is it that I am missing?

ptrblck · February 6, 2023, 1:50am

root should point to the root directory containing subfolders with the actual data for each corresponding class. FileNoFoundErrors are raised if e.g. np.load cannot load the specified file.

devsgnr · February 6, 2023, 12:58pm

Thank you very much for the nodge @ptrblck - I checked the documentation and apparently, I have to use the naming convention of class_x and class_y for it to work.

DAdak · July 21, 2023, 10:22am

No It did not work for me. I made following changes that worked:

class NPYDatasetFile(Dataset):
“”“@Deba.”“”

def __init__(self,  root_dir, transform=None):
    """
    Args:
        
        root_dir (string): Directory with all the images.
        transform (callable, optional): Optional transform to be applied
            on a sample.
    """
    self.csv_file="face_landmarks.csv" ###must have
    self.landmarks_frame = pd.read_csv(root_dir + self.csv_file)
    self.root_dir = root_dir
    self.transform = transform

def __len__(self):
    return len(self.landmarks_frame)

def __getitem__(self, idx):
    if torch.is_tensor(idx):
        idx = idx.tolist()

    img_name = os.path.join(self.root_dir,
                            self.landmarks_frame.iloc[idx, 0])
    image = np.load(img_name)
    sample=np.zeros((1,4*nside,3*nside)) ###for my data shape ##
    
    
    sample[0] = image#{'image': image}

    if self.transform:
        sample = self.transform(sample)

    return sample

transformed_dataset = NPYDatasetFile(root_dir=path)
dataloader = DataLoader(transformed_dataset, batch_size=4,
shuffle=True, num_workers=4)

ptrblck · July 21, 2023, 4:19pm

It’s unclear what exactly did not work for you using the DatasetFolder approach using a custom loader explained in my post, as you are not even using this class. Also your code is neither executable nor does it show any errors.