Is Numpy array a DataSet?

Intel_Novel · June 7, 2019, 7:02pm

Can someone explain how this construct works w/o problems when the PyTorch documentation shows we need TensorDataset to feed a DataLoader

x1 = np.array([1,2,3])
d1 = DataLoader( x1, batch_size=3)

LeviViana · June 7, 2019, 10:18pm

x1 = np.array([1,2,3]) isn’t a Dataset as properly defined by PyTorch. Actually, Dataset is just a very simple abstract class (pure Python).

Indeed, the snippet below works as expected, i.e., it will sample correctly:

import torch
import numpy as np
x = np.arange(6)
d = DataLoader(x, batch_size=2)
for e in d:print(e)

It works mainly because the methods __len__ and __getitem__ are well defined for numpy arrays. The __add__ method is implemented too, but it doesn’t perform concatenation. Maybe the snippet below will make it clear:

import torch
from torch.utils.data import Dataset, DataLoader

class NpDataset(Dataset):
  def __init__(self, array):
    self.array = array
  def __len__(self): return len(self.array)
  def __getitem__(self, i): return self.array[i]

x = np.arange(3)
y = np.arange(3, 6)

data_x = NpDataset(x)
data_y = NpDataset(y)

for e in DataLoader(x + y, batch_size=1):print(e) # prints 3, 5, 7

for e in DataLoader(data_x + data_y, batch_size=1):print(e) # prints 0, 1, 2, 3, 4, 5

Intel_Novel · June 8, 2019, 8:42am

It’s nice when you can share these goodies with your friends.

n1 = np.array([1,2,3])
d1 = DataLoader( n1, batch_size=3)
t1 = torch.Tensor(n1) 
d2 = DataLoader( t1, batch_size=3)
print(d1)
print(d1.dataset)
print(type(d1.dataset))
print(d2)
print(d2.dataset)
print(type(d2.dataset))

<torch.utils.data.dataloader.DataLoader object at 0x0000024C52514588>
[1 2 3]
<class 'numpy.ndarray'>
<torch.utils.data.dataloader.DataLoader object at 0x0000024C52515A20>
tensor([1., 2., 3.])
<class 'torch.Tensor'>

As you can see, in the first case we feed with the numpy the dataset inside dataloader will be numpy, and in the second case will be torch.Tensor.

Looks like no internal conversion as I can see and I would say it is better to feed with Tensors because this way we can relay on cuda stuff.
Any comments?

Update

Further check shows that the feed element need to be a sequence. Any kind of sequence where insde are numeric values. There is no container type checking. It can work for single int 0, or list [0,1], …

n1 = 0#[0,1] #np.array([])#[1,2,3,...])
d1 = DataLoader( n1, batch_size=3)
t1 = torch.Tensor(n1) 
d2 = DataLoader( t1, batch_size=3)
print(d1)
print(d1.dataset)
print(type(d1.dataset))
print(d2)
print(d2.dataset)
print(type(d2.dataset))

LeviViana · June 8, 2019, 8:47pm

No conversion is made. You are right, I’d recommend you to use Tensors instead of Numpy arrays. I encourage you to spend some time checking out the implementation of Dataset and Dataloaders, it’s pure Python. You will see eventually that no conversion is made.

As I said, the object must have at least the methods len and getitem defined, otherwise it won’t work. For instance, the method getitem in what allows you to use the brackets operator:

>>> [1, 2, 3].__getitem__(1)
2
>>> [1, 2, 3][1]
2
>>>

Intel_Novel · June 9, 2019, 7:37pm

Yes, I checked the example you sent. Great one.

It may be even smart to implement __add__ in your original example, because this way you can alter ConcatDataset type when you print the type.

d1= DataLoader(x + y, batch_size=1)
print(d1.dataset)
print(type(d1.dataset)) #ConcatDataset type

import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
import torch.utils.data as data_utils

class NpDataset(Dataset):
  def __init__(self, array):
    self.array = array
  def __len__(self): return len(self.array)
  def __add__(self, ds): 
        return data_utils.TensorDataset(  torch.from_numpy(self.array + ds))
        
        
        
  def __getitem__(self, i): return self.array[i]

x = np.arange(3)
print(x)
y = np.arange(3, 6)
print(y)

data_x = NpDataset(x)
data_y = NpDataset(y)

d1 =DataLoader(x + y, batch_size=1)
    
print("...")
d2 = DataLoader(data_x + data_y, batch_size=1)

print(d1.dataset)    
print(d2.dataset)

outputs

[0 1 2]
[3 4 5]
...
[3 5 7]
<torch.utils.data.dataset.TensorDataset object at 0x000001ABF4C8F208>

I am not sure what is the best practice.
Should we have our DataLoader. dataset as Tensor or np array, or TensorDataset, or ContactDataset.

I assume ContactDataset is very good as I can do d2.dataset.datasets, and list all the `NpDataset`s inside.

But why this may be useful?