Dictionary in DataLoader

if a Dataset return a dictionary in getitem function then how can I get batch of each of the dictionary item in my dataloader iterator loop? Is there any automatic way or do I have to extract manually each of the item of the dictionary for each of the sample in the batch.

4 Likes

If you return a dict, you would have to get all values using the specified keys in your loop:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(10, 3, 24, 24)
        self.target = torch.randint(0, 10, (10,))
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        
        return {'data': x, 'target': y}
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=2,
    num_workers=2
)

for batch in loader:
    data = batch['data']
    target = batch['target']
    print(data.shape)
12 Likes

Thanks for the help.

Hi, is there any defined rules for the type of batch returned by DataLoader and the type of sample returned by getitem in Dataset?
For example: if getitem returns dict, batch will be dict of batches. If getitem returns tuple of items, batch will be tuple of batches …etc ???

2 Likes

I am interested to know this also.
There is a collador_fn function which decides how the loader handle different dataset.
In the doc, default collador_fn only transfer np array to tensor without changing other format.

Dictionary seems to be collected as keys.
No idea bout tuple or list.

Also want to know.

The item which is returned is dependent on the “return” code

Hi @ptrblck

I wonder whether it is must needed to return the dict as this way?if we are doing something like below to directly return , is there problem? (each line is JSON string)

with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break
return json_data

Answered here.

I’ve got a strange behavior when using dictionary with Dataset and Dataloader.

import torch
from torch.utils.data import DataLoader
from torchtext.data import Dataset
from transformers import AutoTokenizer 

    class MyDataset(Dataset):
        def __init__(self, tokenizer):
            self.tokenizer = tokenizer
            self.max_length = 8
            self.samples = [
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"}
            ]

        def _encode(self, sample):
            encoded = {
                "x1": self.tokenizer.encode(text=sample["x1"], max_length=self.max_length, pad_to_max_length=True),
                "x2": self.tokenizer.encode(text=sample["x2"], max_length=self.max_length, pad_to_max_length=True)
            }
            print(encoded)
            return encoded

        def __getitem__(self, idx):
            return self._encode(self.samples[idx])

        def __len__(self):
            return len(self.samples)

    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    dataset = MyDataset(tokenizer)
    loader = DataLoader(
        dataset,
        batch_size=2,
        num_workers=2
    )

    for batch in loader:
        x1 = batch['x1']
        print(x1)
        print(x1.shape)

which prints:

Encoded Sample
 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

batch x1
 [tensor([101, 101]), tensor([1045, 1045]), tensor([8840, 8840]), tensor([2615, 2615]), tensor([1005, 1005]), tensor([1040, 1040]), tensor([2014, 2014]), tensor([102, 102])]

Traceback (most recent call last):
  ...
print(x1.shape)
AttributeError: 'list' object has no attribute 'shape'

Expected (or desired) behavior:

# first batch[x1]: torch.Size([2, 8])
tensor([
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102],
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102]
)]

# second batch[x1]: torch.Size([2, 8])
tensor([
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102],
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102]
])

Could you change use random tensors for the code and update the snippet, please?
I tried to reproduce it using this dummy Dataset and it seems to return the expected results:

class MyDataset(Dataset):
    def __init__(self):
        self.samples = [
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
        ]

    def __getitem__(self, idx):
        return self.samples[idx]

    def __len__(self):
        return len(self.samples)


dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=2,
    num_workers=0
)

for batch in loader:
    x1 = batch['x1']
    print(x1)
    print(x1.shape)

I realize the difference now: your samples is a dictionary which values are tensors, not a list of integers as in my version.

there is a bug:
def getitem(self, index):
x = self.data[index]
y = self.target[index]
data_dict={‘data’: x, ‘target’: y}
return data_dict
then all item in the batch will be the same as the first item, which is wrong.
I want to write this because sometime the data_dict is returned by a function, and it has so many keys.

I cannot reproduce the bug using some predefined tensors for data and target:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.arange(10).float().view(-1, 1)
        self.target = torch.arange(10).float().view(-1, 1) + 1
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        data_dict={'data': x, 'target': y}
        return data_dict
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
for batch in dataset:
    print(batch)

loader = DataLoader(dataset, batch_size=2, shuffle=True)
for batch in loader:
    print(batch)

Could you post an executable code snippet to reproduce this issue?

1 Like

Is there some magic happening in Dataset to make the return type of dataset[1:2] a dict and not list? I would think it would be a list!

The default_collate method uses some checks and in particular checks the input for collections.abs.Mapping here, which then calls default_collate again on the values and returns a dict.

Ah! Thanks. That’s very elegant :slight_smile:

I faced with this problem.
Anyone knows the solution for this, please let me know.