I’ve already read the pytorch tutorial on loading one’s data through the dataloader. But differently from the example, I’m currently not working with images. Instead I have a really big pandas dataframe and I’m trying to build an embedding. So each time the program call the getitem function it actually have to pick randomly (with a prior probability distribution over the whole table) positive and negative examples for my embedding (something like skyp-gram).
The first thing I tried was putting the pandas dataframe as a attribute for the dataset class. But this ends up consuming a lot of memory when I use several workers (I assume data gets replicated for each process). Something like this:
class EmbeddingData: def __init__(self, df): self.df = df def __len__(self): return len(self.df) def get_samples(self, chosen_item, n_close=1, n_rand=1): ## randomly chose positive and negative examples with respect to idx return positive, negative def __getitem__(self, idx): chosen_item = self.df.iloc[idx:(idx+1)] negative, positive = self.get_samples(chosen_item) return sample
Does anybody has an idea of how to streamline this process? I thought about using the big table as a global variable, but I’m not really comfortable with this idea.
Thanks in advance