Shuffling Concatenated Datasets

PopHSKE · March 21, 2024, 2:09am

Hello.

I am training a recommender model using DDP and currently have two datasets, one that returns a positive interaction (customer id and item id) and the other returns a negative example ( customer id and a random item id from a list of candidates for that user).

The data looks like

The first dataset takes in the user and item IDs and outputs these with a label of 1

class IntDataset(Dataset):
    def __init__(self, user_item_inters):
        self.inters = torch.LongTensor(user_item_inters) #  user and item
        
    def __len__(self):
        return len(self.inters)  # number interactions

    def __getitem__(self, idx):
        inter = self.inters[idx]

        return inter[0], inter[1], torch.DoubleTensor(np.array([1.0])).squeeze(-1)

The second one takes in the same customers as the first one but also the list of candidate items. It outputs a customer id, a chosen item and the label (0)

class NegDataset(Dataset):
    def __init__(self, custs, negs, len_negs):
        
        self.custs = torch.tensor(custs) #  user and negs
        self.negs = torch.tensor(negs)
        self.len_negs = len_negs
        
    def __len__(self):
        return len(self.custs)  # number users
    

    def __getitem__(self, idx):
        
        user = self.custs[idx] # userID
        neg = self.negs[idx]
        
        
        neg_item = neg[np.random.randint(low =0, high = self.len_negs, size =1)]

        return user, torch.tensor(neg_item).squeeze(-1),  torch.tensor(np.array([0.0])).squeeze(-1)

Then I concatenate them:

neg_dataset = NegDataset(custs = pdf_ints['cust'].values, negs = pdf_ints['wght_new_smpl'], len_negs = args.len_hard_samples)
  
int_dataset = IntDataset(user_item_inters = pdf_ints[['cust', 'item']].values)
    
dataset = torch.utils.data.ConcatDataset([int_dataset,neg_dataset])

If I load all the data in memory and shuffle the concat dataset, this seems to work. However, when i cant load into memory, I am not sure how to proceed. I can use two iterable datasets, reading from disk but these can’t be shuffled AFAIK, the concat dataset will be stacked with all the positives on top.