Can someone explain how this construct works w/o problems when the PyTorch documentation shows we need TensorDataset to feed a DataLoader
x1 = np.array([1,2,3])
d1 = DataLoader( x1, batch_size=3)
Can someone explain how this construct works w/o problems when the PyTorch documentation shows we need TensorDataset to feed a DataLoader
x1 = np.array([1,2,3])
d1 = DataLoader( x1, batch_size=3)
x1 = np.array([1,2,3])
isn’t a Dataset
as properly defined by PyTorch. Actually, Dataset
is just a very simple abstract class (pure Python).
Indeed, the snippet below works as expected, i.e., it will sample correctly:
import torch
import numpy as np
x = np.arange(6)
d = DataLoader(x, batch_size=2)
for e in d:print(e)
It works mainly because the methods __len__
and __getitem__
are well defined for numpy arrays. The __add__
method is implemented too, but it doesn’t perform concatenation. Maybe the snippet below will make it clear:
import torch
from torch.utils.data import Dataset, DataLoader
class NpDataset(Dataset):
def __init__(self, array):
self.array = array
def __len__(self): return len(self.array)
def __getitem__(self, i): return self.array[i]
x = np.arange(3)
y = np.arange(3, 6)
data_x = NpDataset(x)
data_y = NpDataset(y)
for e in DataLoader(x + y, batch_size=1):print(e) # prints 3, 5, 7
for e in DataLoader(data_x + data_y, batch_size=1):print(e) # prints 0, 1, 2, 3, 4, 5
It’s nice when you can share these goodies with your friends.
n1 = np.array([1,2,3])
d1 = DataLoader( n1, batch_size=3)
t1 = torch.Tensor(n1)
d2 = DataLoader( t1, batch_size=3)
print(d1)
print(d1.dataset)
print(type(d1.dataset))
print(d2)
print(d2.dataset)
print(type(d2.dataset))
<torch.utils.data.dataloader.DataLoader object at 0x0000024C52514588>
[1 2 3]
<class 'numpy.ndarray'>
<torch.utils.data.dataloader.DataLoader object at 0x0000024C52515A20>
tensor([1., 2., 3.])
<class 'torch.Tensor'>
As you can see, in the first case we feed with the numpy the dataset inside dataloader will be numpy, and in the second case will be torch.Tensor.
Looks like no internal conversion as I can see and I would say it is better to feed with Tensors because this way we can relay on cuda stuff.
Any comments?
Further check shows that the feed element need to be a sequence. Any kind of sequence where insde are numeric values. There is no container type checking. It can work for single int 0
, or list [0,1]
, …
n1 = 0#[0,1] #np.array([])#[1,2,3,...])
d1 = DataLoader( n1, batch_size=3)
t1 = torch.Tensor(n1)
d2 = DataLoader( t1, batch_size=3)
print(d1)
print(d1.dataset)
print(type(d1.dataset))
print(d2)
print(d2.dataset)
print(type(d2.dataset))
No conversion is made. You are right, I’d recommend you to use Tensors instead of Numpy arrays. I encourage you to spend some time checking out the implementation of Dataset and Dataloaders, it’s pure Python. You will see eventually that no conversion is made.
As I said, the object must have at least the methods len and getitem defined, otherwise it won’t work. For instance, the method getitem in what allows you to use the brackets operator:
>>> [1, 2, 3].__getitem__(1)
2
>>> [1, 2, 3][1]
2
>>>
Yes, I checked the example you sent. Great one.
It may be even smart to implement __add__
in your original example, because this way you can alter ConcatDataset
type when you print the type.
d1= DataLoader(x + y, batch_size=1)
print(d1.dataset)
print(type(d1.dataset))
#ConcatDataset type
import torch
import numpy as np
from torch.utils.data import Dataset, DataLoader
import torch.utils.data as data_utils
class NpDataset(Dataset):
def __init__(self, array):
self.array = array
def __len__(self): return len(self.array)
def __add__(self, ds):
return data_utils.TensorDataset( torch.from_numpy(self.array + ds))
def __getitem__(self, i): return self.array[i]
x = np.arange(3)
print(x)
y = np.arange(3, 6)
print(y)
data_x = NpDataset(x)
data_y = NpDataset(y)
d1 =DataLoader(x + y, batch_size=1)
print("...")
d2 = DataLoader(data_x + data_y, batch_size=1)
print(d1.dataset)
print(d2.dataset)
outputs
[0 1 2]
[3 4 5]
...
[3 5 7]
<torch.utils.data.dataset.TensorDataset object at 0x000001ABF4C8F208>
I am not sure what is the best practice.
Should we have our DataLoader
. dataset
as Tensor
or np array
, or TensorDataset
, or ContactDataset.
I assume ContactDataset
is very good as I can do d2.dataset.datasets
, and list all the `NpDataset`s
inside.
But why this may be useful?