Dataloader for Name Generator Tutorial

Hi Everyone! I’m trying to use a Data Loader in the Pytorch Name Generator Tutorial. I’ve written a simple version of the Dataset and Dataloader, but I get a slightly different output with the Dataloader.

from torch.utils.data import Dataset, DataLoader

class SampleDataset(Dataset):
    def __init__(self):
        temp = {'English':['Adam','Bill','Chet']}
        categories = list(temp.keys())
        self.data = [(list(name), list(name[1:])+['<EOS>'], category) for category in categories for name in temp[category]]
        
    def __getitem__(self,index):
        return self.data[index]
    
    def __len__(self):
        return len(self.data)

x = SampleDataset()
x_DataLoader = DataLoader(x, batch_size=1)

print(x[0])

for piece in x_DataLoader:
    print(piece)
    break

([‘A’, ‘d’, ‘a’, ‘m’], [‘d’, ‘a’, ‘m’, ‘’], ‘English’)

[[(‘A’,), (‘d’,), (‘a’,), (‘m’,)], [(‘d’,), (‘a’,), (‘m’,), (’’,)], (‘English’,)]

I get this weird tuple around each of my characters with the DataLoader. I don’t understand why :confused:

Am I missing something obvious ? :slight_smile: Thanks so much!

Your .data is stored internally as:

[(['A', 'd', 'a', 'm'], ['d', 'a', 'm', '<EOS>'], 'English'),
 (['B', 'i', 'l', 'l'], ['i', 'l', 'l', '<EOS>'], 'English'),
 (['C', 'h', 'e', 't'], ['h', 'e', 't', '<EOS>'], 'English')]

since you are creating a list of tuples of lists here:

self.data = [(list(name), list(name[1:])+['<EOS>'], category) for category in categories for name in temp[category]]

How would you like to store the data instead?
(I couldn’t find a Dataset code by skimming through the tutorial)

Hey @ptrblck, Thanks for responding! :slight_smile: Yes, I create the tuples because the rest of my code uses it like that(it could also use lists so that can be changed). There is no Dataset code in the Tutorial ( I was indicating that I’m trying to take the tutorial further by using Datasets/Dataloaders)

My doubt is not around the fact that the tuple has become a list, it is more around why:

[‘A’,‘d’,‘a’,‘m’] become [(‘A’,), (‘d’,), (‘a’,), (‘m’,)]

Is this because of my tuple decision above ?

Thanks for the update!

The output “form” of your batches will be defined by the default collate function form the DataLoader as shown here.

You could thus pass in a custom collate function or change the data wrapping in your dataset.

Could you post how the desired batch should look like?

Ah okay! This is exciting!
So say I just want it in the same way :

[(['A', 'd', 'a', 'm'], ['d', 'a', 'm', '<EOS>'], 'English'),
 (['B', 'i', 'l', 'l'], ['i', 'l', 'l', '<EOS>'], 'English'),
 (['C', 'h', 'e', 't'], ['h', 'e', 't', '<EOS>'], 'English')]

Shouldn’t this be the default ? Actually maybe I get what you’re sayin. Is it like "Collate is where I’d get it ready for the model directly ? " Are they any resources you’d recommend ?

I’m not sure I understand the question correctly.
The collate function gets the samples from multiple calls into Dataset.__getitem__ and creates a batch, which you will get in the DataLoader loop.
If you are fine with the current shape, then there is no need to write a custom collate function.
However, if you would like to change something, have a look at this doc about the collate_fn.

1 Like

Thanks @ptrblck, that actually answered my question :slight_smile: I know where to look now to change the format of my Dataloader batch.

1 Like