if a Dataset return a dictionary in getitem function then how can I get batch of each of the dictionary item in my dataloader iterator loop? Is there any automatic way or do I have to extract manually each of the item of the dictionary for each of the sample in the batch.
If you return a dict
, you would have to get all values using the specified keys in your loop:
class MyDataset(Dataset):
def __init__(self):
self.data = torch.randn(10, 3, 24, 24)
self.target = torch.randint(0, 10, (10,))
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
return {'data': x, 'target': y}
def __len__(self):
return len(self.data)
dataset = MyDataset()
loader = DataLoader(
dataset,
batch_size=2,
num_workers=2
)
for batch in loader:
data = batch['data']
target = batch['target']
print(data.shape)
Thanks for the help.
Hi, is there any defined rules for the type of batch returned by DataLoader and the type of sample returned by getitem in Dataset?
For example: if getitem returns dict, batch will be dict of batches. If getitem returns tuple of items, batch will be tuple of batches …etc ???
I am interested to know this also.
There is a collador_fn function which decides how the loader handle different dataset.
In the doc, default collador_fn only transfer np array to tensor without changing other format.
…
Dictionary seems to be collected as keys.
No idea bout tuple or list.
Also want to know.
The item which is returned is dependent on the “return” code
Hi @ptrblck
I wonder whether it is must needed to return the dict as this way?if we are doing something like below to directly return , is there problem? (each line is JSON string)
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
json_data = []
for line in all_data:
try:
json_data.append(json.loads(line))
except:
break
return json_data
I’ve got a strange behavior when using dictionary with Dataset and Dataloader.
import torch
from torch.utils.data import DataLoader
from torchtext.data import Dataset
from transformers import AutoTokenizer
class MyDataset(Dataset):
def __init__(self, tokenizer):
self.tokenizer = tokenizer
self.max_length = 8
self.samples = [
{"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
{"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
{"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
{"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"}
]
def _encode(self, sample):
encoded = {
"x1": self.tokenizer.encode(text=sample["x1"], max_length=self.max_length, pad_to_max_length=True),
"x2": self.tokenizer.encode(text=sample["x2"], max_length=self.max_length, pad_to_max_length=True)
}
print(encoded)
return encoded
def __getitem__(self, idx):
return self._encode(self.samples[idx])
def __len__(self):
return len(self.samples)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
dataset = MyDataset(tokenizer)
loader = DataLoader(
dataset,
batch_size=2,
num_workers=2
)
for batch in loader:
x1 = batch['x1']
print(x1)
print(x1.shape)
which prints:
Encoded Sample
{'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}
{'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}
{'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}
{'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}
batch x1
[tensor([101, 101]), tensor([1045, 1045]), tensor([8840, 8840]), tensor([2615, 2615]), tensor([1005, 1005]), tensor([1040, 1040]), tensor([2014, 2014]), tensor([102, 102])]
Traceback (most recent call last):
...
print(x1.shape)
AttributeError: 'list' object has no attribute 'shape'
Expected (or desired) behavior:
# first batch[x1]: torch.Size([2, 8])
tensor([
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102],
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102]
)]
# second batch[x1]: torch.Size([2, 8])
tensor([
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102],
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102]
])
Could you change use random tensors for the code and update the snippet, please?
I tried to reproduce it using this dummy Dataset
and it seems to return the expected results:
class MyDataset(Dataset):
def __init__(self):
self.samples = [
{"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
{"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
{"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
{"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
]
def __getitem__(self, idx):
return self.samples[idx]
def __len__(self):
return len(self.samples)
dataset = MyDataset()
loader = DataLoader(
dataset,
batch_size=2,
num_workers=0
)
for batch in loader:
x1 = batch['x1']
print(x1)
print(x1.shape)
I realize the difference now: your samples
is a dictionary which values are tensors, not a list of integers as in my version.
there is a bug:
def getitem(self, index):
x = self.data[index]
y = self.target[index]
data_dict={‘data’: x, ‘target’: y}
return data_dict
then all item in the batch will be the same as the first item, which is wrong.
I want to write this because sometime the data_dict is returned by a function, and it has so many keys.
I cannot reproduce the bug using some predefined tensors for data
and target
:
class MyDataset(Dataset):
def __init__(self):
self.data = torch.arange(10).float().view(-1, 1)
self.target = torch.arange(10).float().view(-1, 1) + 1
def __getitem__(self, index):
x = self.data[index]
y = self.target[index]
data_dict={'data': x, 'target': y}
return data_dict
def __len__(self):
return len(self.data)
dataset = MyDataset()
for batch in dataset:
print(batch)
loader = DataLoader(dataset, batch_size=2, shuffle=True)
for batch in loader:
print(batch)
Could you post an executable code snippet to reproduce this issue?
Is there some magic happening in Dataset to make the return type of dataset[1:2]
a dict
and not list
? I would think it would be a list!
The default_collate
method uses some checks and in particular checks the input for collections.abs.Mapping
here, which then calls default_collate
again on the values and returns a dict
.
Ah! Thanks. That’s very elegant