Training the multiple models on different data in parallel

christiannitas · April 14, 2023, 4:40pm

Hello everyone!

I’m trying to train multiple models but on different data in parallel. In the end, I aggregate the parameters of the resulting trained models, thus I am trying to gain a speedup by parallelizing the training process.
Each model is trained inside a Node instance. This node class is responsible for loading the data from the dataset and instantiating the Dataset and DataLoaders for both training and validation.
These node instances exist in separate processes, to which my main process communicates using pipes from the pytorch.multiprocessing package.
However, my child processes just freeze when creating the Dataset instances. Through great detective work (printing “hi” in the code :D) I found that nothing executes after I convert the data frame loaded from the file to tensors.

class MyDataset(Dataset):
  def __init__(self, df:pd.DataFrame, y_labels:str | List[str], X_labels: str | List[str], seq_len:int):
    self.y_labels = y_labels
    self.X_labels = X_labels
    self.seq_len = seq_len
    self.y = torch.tensor(df[y_labels].values).float().share_memory_() # <-nothing prints after here.
    self.X = torch.tensor(df[X_labels].values).float().share_memory_()
    print('hi!!')
    ...

I’m trying to use all my cores when training, but when I instantiate the Node which is supposed to train the model, it freezes in that exact spot. This is the code I use to split my dataset into training:

def setup(self):
        len = self.df.shape[0]
        train_size = int(0.6 * len)
        val_size = int(0.2 * len)
        
        self.train_loss = []
        self.val_loss = []
        
        self.df_train = self.df.iloc[:train_size].copy()
        self.df_val = self.df.iloc[train_size:train_size + val_size].copy()
        self.df_test = self.df.iloc[train_size + val_size:].copy()
        
        self.df_train = self.scale(self.df_train)
        self.df_val = self.scale(self.df_val)
        self.df_test = self.scale(self.df_test)
        
        self.train_dataset = PrelievoDataset(self.df_train, seq_len=self.sequence_length, y_labels=self.y_label, X_labels=self.X_labels)
        self.val_dataset = PrelievoDataset(self.df_val, seq_len=self.sequence_length, y_labels=self.y_label, X_labels=self.X_labels)
        self.test_dataset = PrelievoDataset(self.df_test, seq_len=self.sequence_length, y_labels=self.y_label, X_labels=self.X_labels)
        
        self.train_loader = DataLoader(self.train_dataset, batch_size=self.batch_size, shuffle=True)
        self.val_loader = DataLoader(self.val_dataset, batch_size=self.batch_size, shuffle=True)
        self.test_loader = DataLoader(self.test_dataset, batch_size=self.batch_size, shuffle=True)

I’m on linux and I’m only training only on my CPU. I’ve also called share_memory() on all my models.

I am not sure why this is happening. I do not have much experience with Pytorch, let alone multiprocessing with Pytorch. I humbly ask for your opinion and help! Thank you!

christiannitas · April 20, 2023, 1:13pm

The issue was solved when I switched the create method to spawn instead of fork. Still not clear on why tho…