Data loader without labels?

f3ba · January 19, 2020, 6:03pm

Is there a way to the DataLoader machinery with unlabeled data?

ptrblck · January 20, 2020, 2:11am

Yes, DataLoader doesn’t have any conditions on the number of outputs of your Dataset as seen here:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(100, 1)
        
    def __getitem__(self, index):
        x = self.data[index]
        return x
    
    def __len__(self):
        return len(self.data)

dataset = MyDataset()
loader = DataLoader(
    dataset,
    batch_size=5,
    num_workers=2,
    shuffle=True
)

for data in loader:
    print(data.shape)

naty88 · April 11, 2021, 2:33pm

Hello, I am working with DataLoader for the first time and have some problems.
I have my data in json file like {“doc_id_1”: [sentence1, sentence2, …], …} and want to create a DataLoader with these data to get then a sentence embedding. The problem is that, there are no labels (doc_id is not a true label, it is just the id of the document).

I have defined the class straightforward like in the example above:

class DocDataset(Dataset):
    def __init__(self, json_file):
        self.data = json.load(open(json_file))

    def __getitem__(self, index):
        x = self.data[index]
        return x

    def __len__(self):
        return len(self.data)

json_file = 'dataset.json'
dataset = DocDataset(json_file)
loader = DataLoader(dataset, batch_size=5, num_workers=2, shuffle=True)

I might have a problem in getitem function, because I have a “KeyError”.

Thank you an advance!

naty88 · April 11, 2021, 3:53pm

Hello again, I think, I could solve the problem with KeyError:

    def __getitem__(self, index):
        for i, sent_list in enumerate(self.data.values()):
            x = list(self.data.values())[i]
        return x

I can’t get data.shape like in your example, since I’m working with list. And according to the next Error I’ve got, this must not be a list. So I defined it still wrong. Could you please tell me, where I get wrong?

train_dataloader = DataLoader(dataset, batch_size=5, num_workers=2, shuffle=True)
train_loss = losses.CosineSimilarityLoss(model)
#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

AttributeError: 'list' object has no attribute 'texts'

ptrblck · April 12, 2021, 5:14am

I’m not sure where the new error is raised, as I cannot see any usage of the text attribute. Could you check, which function is trying to access this attribute and make sure it’s using the right object?