Input numpy ndarray instead of images in a CNN

carioka_88 · May 28, 2018, 4:47pm

Hello,

I am kind of new with Pytorch.

I would like to run my CNN with some ordered datasets that I have.
I have n-dimensional arrays, and I would like to pass them like the input dataset.
Is there any way to pass it with torch.utils.data.DataLoader?
Or how can I transform the n-dimensional array into a DataLoader object?

For example, right now I have something like this for images:

image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
                                             shuffle=True, num_workers=4)

But what I have is something like:

>>> x
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],
       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],
       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]],
        .......
        ])

At the end I would like to treat this n-dimensional array, like the pixel values of one image.
Is there a way to pass this values to the CNN like the dataloaders?

ptrblck · May 28, 2018, 5:16pm

You could create a Dataset, and load and transform your arrays there.
Here is a small example:

import torch
from torch.utils.data import Dataset, DataLoader

import numpy as np


class MyDataset(Dataset):
    def __init__(self, data, target, transform=None):
        self.data = torch.from_numpy(data).float()
        self.target = torch.from_numpy(target).long()
        self.transform = transform
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        
        if self.transform:
            x = self.transform(x)
        
        return x, y
    
    def __len__(self):
        return len(self.data)


numpy_data = np.random.randn(100, 3, 24, 24)
numpy_target = np.random.randint(0, 5, size=(100))

dataset = MyDataset(numpy_data, numpy_target)
loader = DataLoader(
    dataset,
    batch_size=10,
    shuffle=True,
    num_workers=2,
    pin_memory=torch.cuda.is_available()
)

for batch_idx, (data, target) in enumerate(loader):
    print('Batch idx {}, data shape {}, target shape {}'.format(
        batch_idx, data.shape, target.shape))

jain26iitd · November 21, 2018, 12:13pm

@ptrblck Thanks for the code.
I used the code as below to create a custom dataloader.

# Creating custom Dataset classes

class MyDataset(Dataset):
    def __init__(self, data, target, transform=None):
        self.data = torch.from_numpy(data).float()
        self.target = torch.from_numpy(target).long()
        self.transform = transform
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        
        if self.transform:
            x = self.transform(x)
            
        return x, y
    
    def __len__(self):
        return len(self.data)

numpy_data = np.random.randn(100,3,224,224) # 10 samples, image size = 224 x 224 x 3
numpy_target = np.random.randint(0,5,size=(100))

dataset = MyDataset(numpy_data, numpy_target)
loader = DataLoader(dataset, batch_size=1, shuffle=True, num_workers=2, pin_memory=False)  # Running on CPU

Till this part everything is fine.
The below mentioned part of the code is taking too much time to run.
Any help on this??

for batch_idx, (data, target) in enumerate(loader):
    print('Batch idx {}, data shape {}, target shape {}'.format(batch_idx, data.shape, target.shape))

ptrblck · November 21, 2018, 1:37pm

How long does your code take?
It’s running on my machine basically immediatelly.

jain26iitd · November 21, 2018, 2:33pm

@ptrblck

waited for around 20 minutes with no luck!

ptrblck · November 21, 2018, 2:41pm

Try to set num_workers=0 and run it again.
If that’s working, could you post your PyTorch version, OS, and how you’ve installed PyTorch?

jain26iitd · November 21, 2018, 3:00pm

Thanks @ptrblck

It worked!

OS : Windows 10 (64 bit)
PyTorch Version : 0.4.1
Python : 3.6.6

PyTorch installed using

pip3 install http://download.pytorch.org/whl/cpu/torch-0.4.1-cp36-cp36m-win_amd64.whl
pip3 install torchvision

ptrblck · November 21, 2018, 3:03pm

Thanks for the info.
Since you are using Windows, could you add the if-clause protection as described here?

Let me know, if this approach works with multiple workers.

NIRVANALAN · April 7, 2019, 3:06pm

Hey. I ran into the same problem, loading attribute labels from large(1 GB+) numpy array saved as .npy on the server. It seems that I cannot load them using multiple workers(num_worker>0 does not work and the code just hang there, waiting).When I set num_worker=0, each 100 iteration costs 100 seconds on loading& pre processing data and 50 seconds for training(forward backward step), which is really slow compared to not loading those numpy labels. I got stuck on this problem and eagerly looking for solutions. Would you give me some advice?

Sincerely.

ptrblck · April 7, 2019, 3:12pm

Are you loading this large file once in your __init__?
If so, num_workers=0 should pre-load the data once and just slice it in __getitem__.
Using multiple workers might not be the best idea, as each worker will load the whole dataset, while the __getitem__ should be quite fast compared to the __init__.

Could you post the Dataset implementation?
Maybe using shared arrays might work better?

NIRVANALAN · April 7, 2019, 3:22pm

Thanks for replying so quickly

My implementation is as follows:

	def __init__(self, image_list_path, height, width, attribute_list_path=None, use_attribute=False, root=None,
	             transform=None):
		self.height = height
		self.width = width
		self.root = root
		self.transform = transform
		self.image_list = []
		self.attribute_list_path = attribute_list_path
		self.num_classes = 0
		with open(image_list_path, 'r') as f:
			for line in f:
				x, y = line.split()
				self.image_list.append((x, int(y)))
				self.num_classes = max(self.num_classes, int(y) + 1)
		self.attribute_list = list(np.load(attribute_list_path))  # attribute numpy array
	
	def __len__(self):
		return len(self.image_list)
	
	def __getitem__(self, index):
		filename = self.image_list[index][0]
		if self.root is not None:
			filename = osp.join(self.root, self.image_list[index][0])
		if not osp.exists(filename):
			raise IOError('File is not exist: ', filename)
		
		try:
			img = pil_loader(filename)
		except:
			raise IOError('File error: ', filename)
		img = cv2.resize(img, (self.width, self.height))
		img = torchvision.transforms.ToPILImage()(img)
		
		if self.transform is not None:
			img = self.transform(img)


		self.attribute_list[index] = dict(sorted(self.attribute_list[index].items(), key=lambda x:x[0]))
		value_list = list(self.attribute_list[index].values())  # does not support dict_values
		attr_labels = [list(v.values()) for v in value_list]

		return img, attr_labels, self.image_list[index][1]  # image label ID
	
	# else:
	# 	return img, self.image_list[index][1]  # image & ID
	
	def get_num_classes(self):
		return self.num_classes

NIRVANALAN · April 7, 2019, 3:24pm

In general, I want to load attribute label from a large numpy file. This file is loaded in init and the labels are returned in get_item(). I wonder if multiple workers are supported under this situation…

ptrblck · April 7, 2019, 3:28pm

The code for loading the image paths looks alright, although you could also pre-create the lists and just pass it to your Dataset instead of re-creating it in the __init__.
The same applies for attribute_list_path.
Note that the Dataset will be re-created if you are using multiple workers for each epoch, so that each worker will reload the large numpy array.
Just preload it outside of your Dataset and pass it as an argument to it.

NIRVANALAN · April 7, 2019, 3:37pm

Thanks ! That is really a helpful advice and I will try it soon. By the way, do you mean the reason that my code just stuck is because of loading too much data into the memory? Will multi workers help if I preload the numpy array and just pass an argument to my Dataset, will it cause multi-processing resource competition problem if several workers want to read the same data(as reference) ?

Your reply is really helpful and thanks again here

ptrblck · April 7, 2019, 3:41pm

I guess your code might just hang, if the loading takes so long or if you run out of memory.
However, I think I might be mistaken and also the pre-loaded dataset might be copied using multiple workers.
The shared array approach might speed thing up.

NIRVANALAN · April 7, 2019, 3:49pm

Thanks. I will try pre-loading method and shared array approach soon, and let you know which approach works better

Thanks again for your ardent and helpful reply.

Rui_Zhang · August 28, 2019, 10:44pm

Hi,

As I linked in another ticket, I found that this implementation is lack of vectorisation. When one retrieves data in loader, MyDataset.__getitem__ will be called millions of times. This becomes a bottleneck of my training on GPU. In Keras, we know that larger batch_size will reduce the training time; however here, batch_size will have small effect on the training time due to the loop over the training points. Is there any suggestion to avoid this?

ptrblck · August 28, 2019, 10:57pm

Each worker in your DataLoader will create the next batch in the background by calling __getitem__ to load the corresponding sample.
I’m not sure, if there is a library to load e.g. images in a batch way.
As explained in the other topic: using multiple workers might speed your data loading up, if your hard disk is sufficiently fast.
Have a look at this post for more information.

Caroline_Barcelos · April 28, 2020, 11:46am

Hi,

I’m using the custom dataset class proposed but I ran into the problems when applying transform to the converted numpy array.
The transform I’m trying to apply is:

transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

the error I get is “RuntimeError: output with shape [1, 224, 224] doesn’t match the broadcast shape [3, 224, 224]”
Would you give me some advice of how my transform should be in order to apply it to the converted numpy array?

axki · April 28, 2020, 12:58pm

Hi!

When comparing the transform you defined and the error you got, it seems like your input is only one layer while your transform expects three layers (looking at the normalization-part):

You should change your normalization so that it only has one value for the mean and one for standard deviation. Then the transform should be valid and you can just apply it to your converted numpy array