Can someone explain how this construct works w/o problems when the PyTorch documentation shows we need TensorDataset to feed a DataLoader

```
x1 = np.array([1,2,3])
d1 = DataLoader( x1, batch_size=3)
```

Can someone explain how this construct works w/o problems when the PyTorch documentation shows we need TensorDataset to feed a DataLoader

```
x1 = np.array([1,2,3])
d1 = DataLoader( x1, batch_size=3)
```

`x1 = np.array([1,2,3])`

isn’t a `Dataset`

as properly defined by PyTorch. Actually, `Dataset`

is just a very simple abstract class (pure Python).

Indeed, the snippet below works as expected, i.e., it will sample correctly:

```
import torch
import numpy as np
x = np.arange(6)
d = DataLoader(x, batch_size=2)
for e in d:print(e)
```

It works mainly because the methods `__len__`

and `__getitem__`

are well defined for numpy arrays. The `__add__`

method is implemented too, but it doesn’t perform concatenation. Maybe the snippet below will make it clear:

```
import torch
from torch.utils.data import Dataset, DataLoader
class NpDataset(Dataset):
def __init__(self, array):
self.array = array
def __len__(self): return len(self.array)
def __getitem__(self, i): return self.array[i]
x = np.arange(3)
y = np.arange(3, 6)
data_x = NpDataset(x)
data_y = NpDataset(y)
for e in DataLoader(x + y, batch_size=1):print(e) # prints 3, 5, 7
for e in DataLoader(data_x + data_y, batch_size=1):print(e) # prints 0, 1, 2, 3, 4, 5
```

2 Likes

It’s nice when you can share these goodies with your friends.

```
n1 = np.array([1,2,3])
d1 = DataLoader( n1, batch_size=3)
t1 = torch.Tensor(n1)
d2 = DataLoader( t1, batch_size=3)
print(d1)
print(d1.dataset)
print(type(d1.dataset))
print(d2)
print(d2.dataset)
print(type(d2.dataset))
```

```
<torch.utils.data.dataloader.DataLoader object at 0x0000024C52514588>
[1 2 3]
<class 'numpy.ndarray'>
<torch.utils.data.dataloader.DataLoader object at 0x0000024C52515A20>
tensor([1., 2., 3.])
<class 'torch.Tensor'>
```

As you can see, in the first case we feed with the numpy the dataset inside dataloader will be numpy, and in the second case will be torch.Tensor.

Looks like no internal conversion as I can see and I would say it is better to feed with Tensors because this way we can relay on cuda stuff.

Any comments?

Further check shows that the feed element need to be a sequence. Any kind of sequence where insde are numeric values. There is no container type checking. It can work for single int `0`

, or `list [0,1]`

, …

```
n1 = 0#[0,1] #np.array([])#[1,2,3,...])
d1 = DataLoader( n1, batch_size=3)
t1 = torch.Tensor(n1)
d2 = DataLoader( t1, batch_size=3)
print(d1)
print(d1.dataset)
print(type(d1.dataset))
print(d2)
print(d2.dataset)
print(type(d2.dataset))
```

No conversion is made. You are right, I’d recommend you to use Tensors instead of Numpy arrays. I encourage you to spend some time checking out the implementation of Dataset and Dataloaders, it’s pure Python. You will see eventually that no conversion is made.

As I said, the object must have at least the methods **len** and **getitem** defined, otherwise it won’t work. For instance, the method **getitem** in what allows you to use the brackets operator:

```
>>> [1, 2, 3].__getitem__(1)
2
>>> [1, 2, 3][1]
2
>>>
```

1 Like

Yes, I checked the example you sent. Great one.

It may be even smart to implement `__add__`

in your original example, because this way you can alter `ConcatDataset`

type when you print the type.

`d1= DataLoader(x + y, batch_size=1)`

`print(d1.dataset)`

`print(type(d1.dataset))`

#ConcatDataset type

```
import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
import torch.utils.data as data_utils
class NpDataset(Dataset):
def __init__(self, array):
self.array = array
def __len__(self): return len(self.array)
def __add__(self, ds):
return data_utils.TensorDataset( torch.from_numpy(self.array + ds))
def __getitem__(self, i): return self.array[i]
x = np.arange(3)
print(x)
y = np.arange(3, 6)
print(y)
data_x = NpDataset(x)
data_y = NpDataset(y)
d1 =DataLoader(x + y, batch_size=1)
print("...")
d2 = DataLoader(data_x + data_y, batch_size=1)
print(d1.dataset)
print(d2.dataset)
```

outputs

```
[0 1 2]
[3 4 5]
...
[3 5 7]
<torch.utils.data.dataset.TensorDataset object at 0x000001ABF4C8F208>
```

I am not sure what is the best practice.

Should we have our `DataLoader`

. `dataset`

as `Tensor`

or `np array`

, or `TensorDataset`

, or ContactDataset.

I assume `ContactDataset`

is very good as I can do `d2.dataset.datasets`

, and list all the ``NpDataset`s`

inside.

But why this may be useful?