Best way to handle dataloader with big pandas datatable

gabrielvc · March 16, 2018, 2:14pm

Hello,

I’ve already read the pytorch tutorial on loading one’s data through the dataloader. But differently from the example, I’m currently not working with images. Instead I have a really big pandas dataframe and I’m trying to build an embedding. So each time the program call the getitem function it actually have to pick randomly (with a prior probability distribution over the whole table) positive and negative examples for my embedding (something like skyp-gram).

The first thing I tried was putting the pandas dataframe as a attribute for the dataset class. But this ends up consuming a lot of memory when I use several workers (I assume data gets replicated for each process). Something like this:

class EmbeddingData:

    def __init__(self, df):
        self.df = df

    def __len__(self):
        return len(self.df)

    def get_samples(self,
                    chosen_item,
                    n_close=1,
                    n_rand=1):
        ## randomly chose positive and negative examples with respect to idx
        return positive, negative

    def __getitem__(self, idx):
        chosen_item = self.df.iloc[idx:(idx+1)]
        negative,  positive = self.get_samples(chosen_item)
        return sample

Does anybody has an idea of how to streamline this process? I thought about using the big table as a global variable, but I’m not really comfortable with this idea.

Thanks in advance