 Is Numpy array a DataSet?

(Intel Novel) #1

Can someone explain how this construct works w/o problems when the PyTorch documentation shows we need TensorDataset to feed a DataLoader

``````x1 = np.array([1,2,3])
``````
How exactly dataloader works with csv file
#2

`x1 = np.array([1,2,3])` isn’t a `Dataset` as properly defined by PyTorch. Actually, `Dataset` is just a very simple abstract class (pure Python).

Indeed, the snippet below works as expected, i.e., it will sample correctly:

``````import torch
import numpy as np
x = np.arange(6)
for e in d:print(e)
``````

It works mainly because the methods `__len__` and `__getitem__` are well defined for numpy arrays. The `__add__` method is implemented too, but it doesn’t perform concatenation. Maybe the snippet below will make it clear:

``````import torch

class NpDataset(Dataset):
def __init__(self, array):
self.array = array
def __len__(self): return len(self.array)
def __getitem__(self, i): return self.array[i]

x = np.arange(3)
y = np.arange(3, 6)

data_x = NpDataset(x)
data_y = NpDataset(y)

for e in DataLoader(x + y, batch_size=1):print(e) # prints 3, 5, 7

for e in DataLoader(data_x + data_y, batch_size=1):print(e) # prints 0, 1, 2, 3, 4, 5

``````
2 Likes
(Intel Novel) #3

It’s nice when you can share these goodies with your friends. ``````n1 = np.array([1,2,3])
t1 = torch.Tensor(n1)
print(d1)
print(d1.dataset)
print(type(d1.dataset))
print(d2)
print(d2.dataset)
print(type(d2.dataset))
``````

``````<torch.utils.data.dataloader.DataLoader object at 0x0000024C52514588>
[1 2 3]
<class 'numpy.ndarray'>
tensor([1., 2., 3.])
<class 'torch.Tensor'>
``````

As you can see, in the first case we feed with the numpy the dataset inside dataloader will be numpy, and in the second case will be torch.Tensor.

Looks like no internal conversion as I can see and I would say it is better to feed with Tensors because this way we can relay on cuda stuff.

Update

Further check shows that the feed element need to be a sequence. Any kind of sequence where insde are numeric values. There is no container type checking. It can work for single int `0`, or `list [0,1]`, …

``````n1 = 0#[0,1] #np.array([])#[1,2,3,...])
t1 = torch.Tensor(n1)
print(d1)
print(d1.dataset)
print(type(d1.dataset))
print(d2)
print(d2.dataset)
print(type(d2.dataset))
``````
#4

No conversion is made. You are right, I’d recommend you to use Tensors instead of Numpy arrays. I encourage you to spend some time checking out the implementation of Dataset and Dataloaders, it’s pure Python. You will see eventually that no conversion is made.

As I said, the object must have at least the methods len and getitem defined, otherwise it won’t work. For instance, the method getitem in what allows you to use the brackets operator:

``````>>> [1, 2, 3].__getitem__(1)
2
>>> [1, 2, 3]
2
>>>
``````
1 Like
(Intel Novel) #5

Yes, I checked the example you sent. Great one.

It may be even smart to implement `__add__` in your original example, because this way you can alter `ConcatDataset` type when you print the type.

`d1= DataLoader(x + y, batch_size=1)`
`print(d1.dataset)`
`print(type(d1.dataset))` #ConcatDataset type

``````import torch
import numpy as np
import torch.utils.data as data_utils

class NpDataset(Dataset):
def __init__(self, array):
self.array = array
def __len__(self): return len(self.array)
return data_utils.TensorDataset(  torch.from_numpy(self.array + ds))

def __getitem__(self, i): return self.array[i]

x = np.arange(3)
print(x)
y = np.arange(3, 6)
print(y)

data_x = NpDataset(x)
data_y = NpDataset(y)

print("...")
d2 = DataLoader(data_x + data_y, batch_size=1)

print(d1.dataset)
print(d2.dataset)
``````

outputs

``````[0 1 2]
[3 4 5]
...
[3 5 7]
<torch.utils.data.dataset.TensorDataset object at 0x000001ABF4C8F208>
``````

I am not sure what is the best practice.
Should we have our `DataLoader`. `dataset` as `Tensor` or `np array`, or `TensorDataset`, or ContactDataset.

I assume `ContactDataset` is very good as I can do `d2.dataset.datasets`, and list all the ``NpDataset`s` inside.

But why this may be useful?