Training parallel multiple models

I’m trying to train multiple models in parallel. The models does not share weights, but the inputs to all of them is the same. I followed the guidelines for multi processing, but for some reason the newly created process hangs when executing concatenating multiple tensors. Attached code snippet - the code hangs upon the line. Executing the same un-paralleled code on the same machine works fine.
I’m using pytorch 0.4.0

def do_train():
    tensor_a = embed_text(text_a)
    tensor_b = embed_text(text_b)
        tensor_c =[tensor_a, tensor_b], -1)
    except Exception as a:

def run(text_a, text_b):
    from torch.multiprocessing import Process
    process = Process(target=do_train, args=(text_a, text_b))

Always you those models fit into your gpu/s you don’t need multiprocessing. Just create one dataloader and use data for all your models.

So how to train the models in parallel? what About CPU training?

I’m not sure at all you can do that (training with multiprocessing and cpu).
Think that training is a sequential algorithm, thus you cannot train a model which uses shared weights.
I slightly know that torch multiprocessing allows to share variables between processes but I cannot see where you could make use of that here.

To do something more or less similar to what you are doing you should create an agnostic do_train function which takes as input the model you are working with. However for each time you call the function the model should be different.

To do the cat, you should pull the outputs of both networks to the main process.

If the problem is core jumps try to use

process = Process(target=do_train, args=(text_a.clone(), text_b.clone()))

To clone the tensor before getting in the mp

Thanks for your help, but cloning the tensors did not solve the issue

Can you post a minimal example? which throws the error? I tried to emulate but couldn’t

  1. Changing the hyper parameters in (input dimension, number of layers etc.) affect what forward steps are reached when executing from child process. I any case, the forward pass from child process is not completed.
  2. If I change the fork methodology to spawn, it works, but this is not possible in my original code since there is no “_ main _” statement

import torch
import torch.nn as nn

import torch.nn.functional as F

class MyModel(nn.Module):
    def __init__(self, input_dim, output_dim = 150,  num_layers = 8):
        super(MyModel, self).__init__()
        self.lstm_1 = nn.LSTM(input_dim, output_dim, bidirectional=True, num_layers=num_layers)
        self.lstm_2 = nn.LSTM(input_dim, output_dim, bidirectional=True, num_layers=num_layers)
        self.hidden_to_tag = nn.Linear(output_dim * 4, 10)
        self.num_layers = num_layers
        self.output_dim = output_dim

    def forward(self, inputs):
            print("Inside forward method")
            seq_len = inputs.shape[1]
            hidden = (torch.randn(self.num_layers * 2, seq_len, self.output_dim),
                      torch.randn(self.num_layers * 2, seq_len, self.output_dim))  # clean out hidden state
            out_1, hidden_1 = self.lstm_1(inputs, hidden)
            print("First LSTM done")
            out_2, hidden_2 = self.lstm_2(inputs, hidden)
            print("Sec LSTM done")

            embeddings =[out_1, out_2], dim = 2)
            print("Concat done")

            tag_space = self.hidden_to_tag(embeddings)
            print("Hidden done")
            tag_scores = F.log_softmax(tag_space, dim=1)
            return tag_scores
        except Exception as e:
            print("Handling excpetion in forward pass")

import torch
from scripts.my_model import MyModel
import torch.multiprocessing as mp

print("loading: " + __name__)

output_dim = 200
num_layers = 8
input_dim = 100
seq_len = 20
batch_size = 20
parallel_model_count = 1

def create_input(input_dim = input_dim, seq_len = seq_len, batch_size = batch_size):
    inputs = [torch.randn(seq_len, input_dim) for _ in range(batch_size)]  # make a sequence of length 5
    inputs = torch.stack(inputs)
    return inputs

def create_targets(batch_size, seq_length):
    return torch.randint(0,10 , (batch_size * seq_length,)).long()

def do_forward(model, inputs, from_child_procees= False):
    log = "Start forward from CHILD process" if from_child_procees else "Start forward from MAIN process"
    res = model.forward(inputs)
    return res

def run():
    model = MyModel(input_dim=input_dim, output_dim=output_dim, num_layers=num_layers)
    net_inputs = create_input()
    res = do_forward(model, net_inputs)

    print("Done first forward")

    jobs = []
    ctx = mp.get_context('fork')

    for i in range(parallel_model_count):
        p = ctx.Process(target=do_forward, args=(model, net_inputs, True))

    print("Started child process")

    for proc in jobs:
    print("Done all flow")

from scripts.run_inner import run
print("loading: " + __name__)

if __name__ == '__main__':

Well I’m afraid that it runs for me.
I tried

output_dim = 200
num_layers = 50
input_dim = 1000
seq_len = 200
batch_size = 20
parallel_model_count = 5

Are u using pytorch 1? This worked for me with in pytorch 0.4.1 on intel cpu
Kernel dumped using the above configuration but just display cannot allocate memory for bs 50.