Dictionary in DataLoader

if a Dataset return a dictionary in getitem function then how can I get batch of each of the dictionary item in my dataloader iterator loop? Is there any automatic way or do I have to extract manually each of the item of the dictionary for each of the sample in the batch.

1 Like

If you return a dict, you would have to get all values using the specified keys in your loop:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(10, 3, 24, 24)
        self.target = torch.randint(0, 10, (10,))
        
    def __getitem__(self, index):
        x = self.data[index]
        y = self.target[index]
        
        return {'data': x, 'target': y}
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=2,
    num_workers=2
)

for batch in loader:
    data = batch['data']
    target = batch['target']
    print(data.shape)
5 Likes

Thanks for the help.

Hi, is there any defined rules for the type of batch returned by DataLoader and the type of sample returned by getitem in Dataset?
For example: if getitem returns dict, batch will be dict of batches. If getitem returns tuple of items, batch will be tuple of batches …etc ???

1 Like

I am interested to know this also.
There is a collador_fn function which decides how the loader handle different dataset.
In the doc, default collador_fn only transfer np array to tensor without changing other format.

Dictionary seems to be collected as keys.
No idea bout tuple or list.

Also want to know.

The item which is returned is dependent on the “return” code

Hi @ptrblck

I wonder whether it is must needed to return the dict as this way?if we are doing something like below to directly return , is there problem? (each line is JSON string)

with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break
return json_data

Answered here.

I’ve got a strange behavior when using dictionary with Dataset and Dataloader.

import torch
from torch.utils.data import DataLoader
from torchtext.data import Dataset
from transformers import AutoTokenizer 

    class MyDataset(Dataset):
        def __init__(self, tokenizer):
            self.tokenizer = tokenizer
            self.max_length = 8
            self.samples = [
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"},
                {"x1": "I lov 'd her soon , I lov 'd her late .", "x2": "I lov 'd her late , I lov 'd her soon"}
            ]

        def _encode(self, sample):
            encoded = {
                "x1": self.tokenizer.encode(text=sample["x1"], max_length=self.max_length, pad_to_max_length=True),
                "x2": self.tokenizer.encode(text=sample["x2"], max_length=self.max_length, pad_to_max_length=True)
            }
            print(encoded)
            return encoded

        def __getitem__(self, idx):
            return self._encode(self.samples[idx])

        def __len__(self):
            return len(self.samples)

    tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
    dataset = MyDataset(tokenizer)
    loader = DataLoader(
        dataset,
        batch_size=2,
        num_workers=2
    )

    for batch in loader:
        x1 = batch['x1']
        print(x1)
        print(x1.shape)

which prints:

Encoded Sample
 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

 {'x1': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102], 'x2': [101, 1045, 8840, 2615, 1005, 1040, 2014, 102]}

batch x1
 [tensor([101, 101]), tensor([1045, 1045]), tensor([8840, 8840]), tensor([2615, 2615]), tensor([1005, 1005]), tensor([1040, 1040]), tensor([2014, 2014]), tensor([102, 102])]

Traceback (most recent call last):
  ...
print(x1.shape)
AttributeError: 'list' object has no attribute 'shape'

Expected (or desired) behavior:

# first batch[x1]: torch.Size([2, 8])
tensor([
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102],
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102]
)]

# second batch[x1]: torch.Size([2, 8])
tensor([
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102],
[101, 1045, 8840, 2615, 1005, 1040, 2014, 102]
])

Could you change use random tensors for the code and update the snippet, please?
I tried to reproduce it using this dummy Dataset and it seems to return the expected results:

class MyDataset(Dataset):
    def __init__(self):
        self.samples = [
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
            {"x1": torch.arange(10), 'x2': torch.arange(10, 20)},
        ]

    def __getitem__(self, idx):
        return self.samples[idx]

    def __len__(self):
        return len(self.samples)


dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=2,
    num_workers=0
)

for batch in loader:
    x1 = batch['x1']
    print(x1)
    print(x1.shape)

I realize the difference now: your samples is a dictionary which values are tensors, not a list of integers as in my version.