Hello,
I’ve already read the pytorch tutorial on loading one’s data through the dataloader. But differently from the example, I’m currently not working with images. Instead I have a really big pandas dataframe and I’m trying to build an embedding. So each time the program call the getitem function it actually have to pick randomly (with a prior probability distribution over the whole table) positive and negative examples for my embedding (something like skyp-gram).
The first thing I tried was putting the pandas dataframe as a attribute for the dataset class. But this ends up consuming a lot of memory when I use several workers (I assume data gets replicated for each process). Something like this:
class EmbeddingData:
def __init__(self, df):
self.df = df
def __len__(self):
return len(self.df)
def get_samples(self,
chosen_item,
n_close=1,
n_rand=1):
## randomly chose positive and negative examples with respect to idx
return positive, negative
def __getitem__(self, idx):
chosen_item = self.df.iloc[idx:(idx+1)]
negative, positive = self.get_samples(chosen_item)
return sample
Does anybody has an idea of how to streamline this process? I thought about using the big table as a global variable, but I’m not really comfortable with this idea.
Thanks in advance