Dictionary in DataLoader

shivam2298 · March 20, 2019, 7:28pm

if a Dataset return a dictionary in getitem function then how can I get batch of each of the dictionary item in my dataloader iterator loop? Is there any automatic way or do I have to extract manually each of the item of the dictionary for each of the sample in the batch.

ptrblck · March 21, 2019, 11:29am

If you return a dict, you would have to get all values using the specified keys in your loop:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(10, 3, 24, 24)
        self.target = torch.randint(0, 10, (10,))
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        
        return {'data': x, 'target': y}
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=2,
    num_workers=2
)

for batch in loader:
    data = batch['data']
    target = batch['target']
    print(data.shape)

shivam2298 · March 21, 2019, 11:50am

Thanks for the help.

phucdoitoan · December 27, 2019, 4:34am

Hi, is there any defined rules for the type of batch returned by DataLoader and the type of sample returned by getitem in Dataset?
For example: if getitem returns dict, batch will be dict of batches. If getitem returns tuple of items, batch will be tuple of batches …etc ???

Sergius_Liu · March 1, 2020, 8:06am

I am interested to know this also.
There is a collador_fn function which decides how the loader handle different dataset.
In the doc, default collador_fn only transfer np array to tensor without changing other format.
…
Dictionary seems to be collected as keys.
No idea bout tuple or list.

Also want to know.

Minseok · March 4, 2020, 2:04am

The item which is returned is dependent on the “return” code

rui_zhang_331 · June 17, 2020, 5:13am

Hi @ptrblck

I wonder whether it is must needed to return the dict as this way?if we are doing something like below to directly return , is there problem? (each line is JSON string)

with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break
return json_data

ptrblck · June 17, 2020, 6:49am

Answered here.

Aldebaran · June 28, 2020, 6:14pm

I’ve got a strange behavior when using dictionary with Dataset and Dataloader.

import torch
from torch.utils.data import DataLoader
from torchtext.data import Dataset
from transformers import AutoTokenizer 

    class MyDataset(Dataset):
        def __init__(self, tokenizer):
            self.tokenizer = tokenizer
            self.max_length = 8
            self.samples = [
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"}
            ]

        def _encode(self, sample):
            encoded = {
                "x1": self.tokenizer.encode(text=sample["x1"], max_length=self.max_length, pad_to_max_length=True),
                "x2": self.tokenizer.encode(text=sample["x2"], max_length=self.max_length, pad_to_max_length=True)
            }
            print(encoded)
            return encoded

        def __getitem__(self, idx):
            return self._encode(self.samples[idx])

        def __len__(self):
            return len(self.samples)

    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    dataset = MyDataset(tokenizer)
    loader = DataLoader(
        dataset,
        batch_size=2,
        num_workers=2
    )

    for batch in loader:
        x1 = batch['x1']
        print(x1)
        print(x1.shape)

which prints:

Encoded Sample
 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

batch x1
 [tensor([101, 101]), tensor([1045, 1045]), tensor([8840, 8840]), tensor([2615, 2615]), tensor([1005, 1005]), tensor([1040, 1040]), tensor([2014, 2014]), tensor([102, 102])]

Traceback (most recent call last):
  ...
print(x1.shape)
AttributeError: 'list' object has no attribute 'shape'

Expected (or desired) behavior:

# first batch[x1]: torch.Size([2, 8])
tensor([
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102],
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102]
)]

# second batch[x1]: torch.Size([2, 8])
tensor([
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102],
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102]
])

ptrblck · June 29, 2020, 7:08am

Could you change use random tensors for the code and update the snippet, please?
I tried to reproduce it using this dummy Dataset and it seems to return the expected results:

class MyDataset(Dataset):
    def __init__(self):
        self.samples = [
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
        ]

    def __getitem__(self, idx):
        return self.samples[idx]

    def __len__(self):
        return len(self.samples)


dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=2,
    num_workers=0
)

for batch in loader:
    x1 = batch['x1']
    print(x1)
    print(x1.shape)

Aldebaran · June 29, 2020, 3:06pm

I realize the difference now: your samples is a dictionary which values are tensors, not a list of integers as in my version.

Z_Eamon · October 6, 2020, 6:07am

there is a bug:
def getitem(self, index):
x = self.data[index]
y = self.target[index]
data_dict={‘data’: x, ‘target’: y}
return data_dict
then all item in the batch will be the same as the first item, which is wrong.
I want to write this because sometime the data_dict is returned by a function, and it has so many keys.

ptrblck · October 6, 2020, 6:20am

I cannot reproduce the bug using some predefined tensors for data and target:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(10).float().view(-1, 1)
        self.target = torch.arange(10).float().view(-1, 1) + 1
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        data_dict={'data': x, 'target': y}
        return data_dict
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
for batch in dataset:
    print(batch)

loader = DataLoader(dataset, batch_size=2, shuffle=True)
for batch in loader:
    print(batch)

Could you post an executable code snippet to reproduce this issue?

aksg87 · May 8, 2021, 6:16am

Is there some magic happening in Dataset to make the return type of dataset[1:2] a dict and not list? I would think it would be a list!

ptrblck · May 8, 2021, 6:41am

The default_collate method uses some checks and in particular checks the input for collections.abs.Mapping here, which then calls default_collate again on the values and returns a dict.

aksg87 · May 8, 2021, 3:04pm

Ah! Thanks. That’s very elegant

y-vectorfield · March 7, 2024, 9:30am

I faced with this problem.
Anyone knows the solution for this, please let me know.