Best way to handle dataloader with big pandas datatable


I’ve already read the pytorch tutorial on loading one’s data through the dataloader. But differently from the example, I’m currently not working with images. Instead I have a really big pandas dataframe and I’m trying to build an embedding. So each time the program call the getitem function it actually have to pick randomly (with a prior probability distribution over the whole table) positive and negative examples for my embedding (something like skyp-gram).

The first thing I tried was putting the pandas dataframe as a attribute for the dataset class. But this ends up consuming a lot of memory when I use several workers (I assume data gets replicated for each process). Something like this:

class EmbeddingData:

    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def get_samples(self,
        ## randomly chose positive and negative examples with respect to idx
        return positive, negative

    def __getitem__(self, idx):
        chosen_item = self.df.iloc[idx:(idx+1)]
        negative,  positive = self.get_samples(chosen_item)
        return sample

Does anybody has an idea of how to streamline this process? I thought about using the big table as a global variable, but I’m not really comfortable with this idea.

Thanks in advance