Input numpy ndarray instead of images in a CNN

Hello,

I am kind of new with Pytorch.

I would like to run my CNN with some ordered datasets that I have.
I have n-dimensional arrays, and I would like to pass them like the input dataset.
Is there any way to pass it with torch.utils.data.DataLoader?
Or how can I transform the n-dimensional array into a DataLoader object?

For example, right now I have something like this for images:

image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}
dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=4,
                                             shuffle=True, num_workers=4)

But what I have is something like:

>>> x
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],
       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],
       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]],
        .......
        ])

At the end I would like to treat this n-dimensional array, like the pixel values of one image.
Is there a way to pass this values to the CNN like the dataloaders?

You could create a Dataset, and load and transform your arrays there.
Here is a small example:

import torch
from torch.utils.data import Dataset, DataLoader

import numpy as np


class MyDataset(Dataset):
    def __init__(self, data, target, transform=None):
        self.data = torch.from_numpy(data).float()
        self.target = torch.from_numpy(target).long()
        self.transform = transform
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        
        if self.transform:
            x = self.transform(x)
        
        return x, y
    
    def __len__(self):
        return len(self.data)


numpy_data = np.random.randn(100, 3, 24, 24)
numpy_target = np.random.randint(0, 5, size=(100))

dataset = MyDataset(numpy_data, numpy_target)
loader = DataLoader(
    dataset,
    batch_size=10,
    shuffle=True,
    num_workers=2,
    pin_memory=torch.cuda.is_available()
)

for batch_idx, (data, target) in enumerate(loader):
    print('Batch idx {}, data shape {}, target shape {}'.format(
        batch_idx, data.shape, target.shape))
6 Likes

@ptrblck Thanks for the code.
I used the code as below to create a custom dataloader.

# Creating custom Dataset classes

class MyDataset(Dataset):
    def __init__(self, data, target, transform=None):
        self.data = torch.from_numpy(data).float()
        self.target = torch.from_numpy(target).long()
        self.transform = transform
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        
        if self.transform:
            x = self.transform(x)
            
        return x, y
    
    def __len__(self):
        return len(self.data)
numpy_data = np.random.randn(100,3,224,224) # 10 samples, image size = 224 x 224 x 3
numpy_target = np.random.randint(0,5,size=(100))

dataset = MyDataset(numpy_data, numpy_target)
loader = DataLoader(dataset, batch_size=1, shuffle=True, num_workers=2, pin_memory=False)  # Running on CPU

Till this part everything is fine.
The below mentioned part of the code is taking too much time to run.
Any help on this??

for batch_idx, (data, target) in enumerate(loader):
    print('Batch idx {}, data shape {}, target shape {}'.format(batch_idx, data.shape, target.shape))

How long does your code take?
It’s running on my machine basically immediatelly.

@ptrblck

waited for around 20 minutes with no luck!

Try to set num_workers=0 and run it again.
If that’s working, could you post your PyTorch version, OS, and how you’ve installed PyTorch?

Thanks @ptrblck

It worked!

OS : Windows 10 (64 bit)
PyTorch Version : 0.4.1
Python : 3.6.6

PyTorch installed using

pip3 install http://download.pytorch.org/whl/cpu/torch-0.4.1-cp36-cp36m-win_amd64.whl
pip3 install torchvision

Thanks for the info.
Since you are using Windows, could you add the if-clause protection as described here?

Let me know, if this approach works with multiple workers.

1 Like

Hey. I ran into the same problem, loading attribute labels from large(1 GB+) numpy array saved as .npy on the server. It seems that I cannot load them using multiple workers(num_worker>0 does not work and the code just hang there, waiting).When I set num_worker=0, each 100 iteration costs 100 seconds on loading& pre processing data and 50 seconds for training(forward backward step), which is really slow compared to not loading those numpy labels. I got stuck on this problem and eagerly looking for solutions. Would you give me some advice?

Sincerely.

Are you loading this large file once in your __init__?
If so, num_workers=0 should pre-load the data once and just slice it in __getitem__.
Using multiple workers might not be the best idea, as each worker will load the whole dataset, while the __getitem__ should be quite fast compared to the __init__.

Could you post the Dataset implementation?
Maybe using shared arrays might work better?

Thanks for replying so quickly :grin:

My implementation is as follows:

	def __init__(self, image_list_path, height, width, attribute_list_path=None, use_attribute=False, root=None,
	             transform=None):
		self.height = height
		self.width = width
		self.root = root
		self.transform = transform
		self.image_list = []
		self.attribute_list_path = attribute_list_path
		self.num_classes = 0
		with open(image_list_path, 'r') as f:
			for line in f:
				x, y = line.split()
				self.image_list.append((x, int(y)))
				self.num_classes = max(self.num_classes, int(y) + 1)
		self.attribute_list = list(np.load(attribute_list_path))  # attribute numpy array
	
	def __len__(self):
		return len(self.image_list)
	
	def __getitem__(self, index):
		filename = self.image_list[index][0]
		if self.root is not None:
			filename = osp.join(self.root, self.image_list[index][0])
		if not osp.exists(filename):
			raise IOError('File is not exist: ', filename)
		
		try:
			img = pil_loader(filename)
		except:
			raise IOError('File error: ', filename)
		img = cv2.resize(img, (self.width, self.height))
		img = torchvision.transforms.ToPILImage()(img)
		
		if self.transform is not None:
			img = self.transform(img)


		self.attribute_list[index] = dict(sorted(self.attribute_list[index].items(), key=lambda x:x[0]))
		value_list = list(self.attribute_list[index].values())  # does not support dict_values
		attr_labels = [list(v.values()) for v in value_list]

		return img, attr_labels, self.image_list[index][1]  # image label ID
	
	# else:
	# 	return img, self.image_list[index][1]  # image & ID
	
	def get_num_classes(self):
		return self.num_classes

In general, I want to load attribute label from a large numpy file. This file is loaded in init and the labels are returned in get_item(). I wonder if multiple workers are supported under this situation…

The code for loading the image paths looks alright, although you could also pre-create the lists and just pass it to your Dataset instead of re-creating it in the __init__.
The same applies for attribute_list_path.
Note that the Dataset will be re-created if you are using multiple workers for each epoch, so that each worker will reload the large numpy array.
Just preload it outside of your Dataset and pass it as an argument to it.

Thanks ! That is really a helpful advice and I will try it soon. By the way, do you mean the reason that my code just stuck is because of loading too much data into the memory? Will multi workers help if I preload the numpy array and just pass an argument to my Dataset, will it cause multi-processing resource competition problem if several workers want to read the same data(as reference) ?

Your reply is really helpful and thanks again here :laughing:

I guess your code might just hang, if the loading takes so long or if you run out of memory.
However, I think I might be mistaken and also the pre-loaded dataset might be copied using multiple workers.
The shared array approach might speed thing up.

Thanks. I will try pre-loading method and shared array approach soon, and let you know which approach works better

Thanks again for your ardent and helpful reply.:smile:

1 Like

Hi,

As I linked in another ticket, I found that this implementation is lack of vectorisation. When one retrieves data in loader, MyDataset.__getitem__ will be called millions of times. This becomes a bottleneck of my training on GPU. In Keras, we know that larger batch_size will reduce the training time; however here, batch_size will have small effect on the training time due to the loop over the training points. Is there any suggestion to avoid this?

Each worker in your DataLoader will create the next batch in the background by calling __getitem__ to load the corresponding sample.
I’m not sure, if there is a library to load e.g. images in a batch way.
As explained in the other topic: using multiple workers might speed your data loading up, if your hard disk is sufficiently fast.
Have a look at this post for more information.