How to use DataParallel with custom classes?

I have this RNN class:

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers):
        super(RNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers 
        
        self.encoder = nn.Embedding(input_size, hidden_size)
        self.GRU = nn.GRU(hidden_size, hidden_size, n_layers) 
        self.decoder = nn.Linear(hidden_size, output_size)
        
    def forward(self, input, hidden, batch_size):
        input = self.encoder(input.view(batch_size, seq_len)) 
        input = input.permute(1, 0, 2)
        output, hidden = self.GRU(input, hidden) 
        output = self.decoder(output.view(batch_size, hidden_size))  
        return output, hidden
    
    def init_hidden(self, batch_size):
        return Variable(torch.randn(self.n_layers, batch_size, self.hidden_size).cuda())

Note that both forward and init_hidden functions receive batch_size argument.

Then I create my model using DataParallel to run it on multiple GPUs:

net = torch.nn.DataParallel(RNN(n_chars, hidden_size, n_chars, n_layers)).cuda()

Then I start training the model:

for inp, target in training_set:
    hidden = net.init_hidden(batch_size)
    ...

However I get this error:

Traceback (most recent call last):
  File "char-rnn.py", line 178, in <module>
    hidden = net.init_hidden(batch_size)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 237, in __getattr__
    return object.__getattr__(self, name)
AttributeError: type object 'object' has no attribute '__getattr__'

What is the correct way to wrap my RNN class into DataParallel, so that it can use init_hidden function?

p.s. here’s full code

2 Likes

I think I can access the init_hidden() method using net.module.init_hidden(), at least it runs past that line now.

It still does not work though, because with batch_size=64, I want to give each of my four GPUs 16 training samples, but instead I get this error:

Traceback (most recent call last):
  File "char-rnn.py", line 185, in <module>
    output, hidden = net(c, hidden, batch_size)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 61, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 71, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 45, in parallel_apply
    raise output
RuntimeError: size '[64 x 1]' is invalid for input of with 16 elements at /py/conda-bld/pytorch_1490981920203/work/torch/lib/TH/THStorage.c:59

So it seems like each GPU (or just one GPU, not sure) gets full batch of 64 samples. Any ideas?

Did you try to put the hidden as a parameter of the model ?

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers, batch_size):
        super(RNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers 
        
        self.encoder = nn.Embedding(input_size, hidden_size)
        self.GRU = nn.GRU(hidden_size, hidden_size, n_layers) 
        self.decoder = nn.Linear(hidden_size, output_size)
        self.init_hidden(batch_size)
        
    def forward(self, input, batch_size):
        input = self.encoder(input.view(batch_size, seq_len)) 
        input = input.permute(1, 0, 2)
        output, self.hidden = self.GRU(input, self.hidden) 
        output = self.decoder(output.view(batch_size, hidden_size))
        return output
    
    def init_hidden(self, batch_size):
        self.hidden = Variable(torch.randn(self.n_layers, batch_size, self.hidden_size).cuda()

Thanks! I tried your suggestion, but it only addresses my initial issue, which was fixed using net.model.init_hidden call, it didn’t help with the second issue - the batch not being split between GPUs.

My guess is it’s related to the tensors inside forward function not having first dimension as batch_size. I used batch_first=True parameter in GRU initialization, so that I could use inputs shaped as (batch, seq_len). However, I’m not sure what to do with hidden tensor, which is (num_layers, batch, hidden_size).

Now, when I use RNN class like this:

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers):
        super(RNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers 
        
        self.encoder = nn.Embedding(input_size, hidden_size)
        self.GRU = nn.GRU(hidden_size, hidden_size, n_layers, batch_first=True) 
        self.decoder = nn.Linear(hidden_size, output_size)
        
        
    def forward(self, input, batch_size):
        self.init_hidden(batch_size)
        input = self.encoder(input)
        output, self.hidden = self.GRU(input, self.hidden) 
        output = self.decoder(output) #batch_first
        return output
    
    def init_hidden(self, batch_size):
        self.hidden = Variable(torch.randn(self.n_layers, batch_size, self.hidden_size).cuda())

I get the following error (batch_size=64, hidden_size=256, 2 layers, 4 GPUs):

Traceback (most recent call last):
  File "char-rnn.py", line 232, in <module>
    output = net(c, batch_size)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 61, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 71, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 45, in parallel_apply
    raise output
RuntimeError: Expected hidden size (2, 16L, 256), got (2L, 64L, 256L)

Here’s the full modified code: https://gist.github.com/michaelklachko/540428fc112f5a6b06e842bb6a3f5e1e

This should work:

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size, n_layers):
        super(RNN, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers 
        self.encoder = nn.Embedding(input_size, hidden_size)
        self.GRU = nn.GRU(hidden_size, hidden_size, n_layers, batch_first=True) 
        self.decoder = nn.Linear(hidden_size, output_size)
        self.init_hidden(batch_size)
        
        
    def forward(self, input, batch_size):
        input = self.encoder(input)
        output, self.hidden = self.GRU(input, self.hidden)
        output = self.decoder(output.squeeze()) #batch_first
        return output
    
    def init_hidden(self, batch_size):
        self.hidden = Variable(torch.randn(self.n_layers,batch_size, self.hidden_size).cuda())

it seems that GRU adds a fake dimension to the output the output size was (64,1,256), just squeezing it done the trick.
Not that you must init the hidden once, so the call to init_hidden must be in the init function (maybe it was a typo).

Also, after correcting this bug and some other, I could’nt run your code over all the set because of out-of-memory error, but maybe it will be ok on your system

1 Like

Does not work for me, unfortunately. The problem happens in this line in the forward function (before it gets to the decoder):

output, self.hidden = self.GRU(input, self.hidden)

Can you try running this code, and post the output? I reduced the hidden_size, so you should be able to fit it in memory.

I’m getting the following:

...
forward function: encoding input of shape: (16L, 1L)
forward function: encoding input of shape: (16L, 1L)
forward function: encoding input of shape: (16L,
forward function: encoding input of shape:

forward function: GRU processing input of shape:
1L)
 ( (16L, 16L1L, 1L), 100L)
forward function: GRU processing input of shape:
 (16L, 1L,
forward function: GRU processing input of shape:100L)
 (16L
forward function: GRU processing input of shape:, 1L, 100L) (
16L, 1L, 100L)

Traceback (most recent call last):
  File "gru2.py", line 166, in <module>
    output = net(c, batch_size)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 206, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 61, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 71, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs)
  File "/root/miniconda2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 45, in parallel_apply
    raise output
RuntimeError: Expected hidden size (2, 16L, 100), got (2L, 64L, 100L)

Note that the encoder is executed correctly in parallel on 4 GPUs, but GRU function fails because the hidden tensor does not get split (I’m guessing because its leading dimension is not batch_size).

you could try using the batchFirst option for nn.GRU, maybe? (so that the leading dimension is indeed batch).

alternatively, you can manually write your DataParallel in terms of scatter/gather/parallel_apply

Were you able to use the DataParallel function with your RNN class @michaelklachko ?
I seem to face the same issue.

Something like this should do the trick:

net = torch.nn.DataParallel(RNN(n_chars, hidden_size, n_chars, n_layers), dim=1).cuda()

1 Like

Hey @michaelklachko ,
Thanks for the prompt reply, but I am unable to train my code on multiple GPUs,
It would be great if you could have a look at this thread: DataParallel not working as expected