I have a custom model class which is basically a seq2seq network (encoder-decoder).
I have 4 GPUs available and would need to parallelize the training.
Here is the main part of my training file:
model = Model(in_size, num_rnn_units_per_layer, out_size, num_rnn_layers, dropout, embed_size=embed_size) if (is_cuda): model.cuda() model = nn.DataParallel(model, dim=0).cuda() num_gpu = torch.cuda.device_count() hidden_enc = model.module.encoder.init_hidden(batch_size, num_gpu) hidden_dec = model.module.decoder.init_hidden(batch_size, num_gpu) for x, y in train_reader.iter(): print x.shape # input shape is "bsz x seq_len" output, hidden_enc, hidden_dec = model(x, hidden_enc, hidden_dec) print output.shape # output shape is "bsz/num_gpu x seq_len"
init_hidden is a function which initializes the hidden layer of the encoder/decoder:
def init_hidden(self, bsz, num_gpu): bsz /= num_gpu return Variable(torch.randn(2 * self.n_layers, bsz, self.hidden_size)).cuda()
As mentioned in the comments of the training file, what happens is:
After a forward call of the model, on the correct input, I get partial output.
Like, let input shape be
bsz x seq_len ; my output shape is:
bsz/num_gpu x seq_len
I can say the num_gpu division because I’ve tried allocating different number of GPUs to the training, and it all generalizes to this. (Works fine with 1 GPU, of course).
Any help on this is much appreciated!