LSTM model on cuda fails during backward calls

lakehanne · May 18, 2017, 4:13pm

How do you correctly ship an lstm model class to gpu?

I have a model defined as:

class StackRegressive(nn.Module):
    def __init__(self, **kwargs):
        super(StackRegressive, self).__init__()
           self.criterion = nn.MSELoss(size_average=False)
        # Backprop Through Time (Recurrent Layer) Params
        self.noutputs       = kwargs['noutputs']
        self.num_layers     = kwargs['numLayers']
        self.input_size     = kwargs['inputSize']
        self.hidden_size    = kwargs['nHidden']
        self.batch_size     = kwargs['batchSize']
        self.noutputs       = kwargs['noutputs']
        self.cuda           = kwargs['cuda']

        self.criterion = nn.MSELoss(size_average=False)
        self.fc = nn.Linear(32, self.noutputs)

        #define the recurrent connections
        self.lstm1 = nn.LSTM(self.input_size, self.hidden_size[0], self.num_layers, bias=False, batch_first=False, dropout=0.3)
        self.lstm2 = nn.LSTM(self.hidden_size[0], self.hidden_size[1], self.num_layers, bias=False, batch_first=False, dropout=0.3)
        self.fc    = nn.Linear(self.hidden_size[1], self.noutputs)
      
        if self.cuda:
            self.lstm1 = self.lstm1.cuda()
            self.lstm2 = self.lstm2.cuda()
            self.fc    = self.fc.cuda()

    def forward(self, x):
        nBatch = x.size(0)

        # Forward propagate RNN layer 1
        out, state_0 = self.lstm1(x)

        # Forward propagate RNN layer 2
        out, state_1 = self.lstm2(out)

        # Decode hidden state of last time step
        out = self.fc(out[:, -1, :])

        out = out.view(nBatch, -1)

        return out

When I contruct the model’s instance e.g.

regressor = StackRegressive(res_cube=res_classifier, inputSize=128, nHidden=[64,32,12], noutputs=12,\ batchSize=args.cbatchSize, cuda=args.cuda, numLayers=2)

I am able to run the program. But occasionally, I get a bad_cast runtime error:

Traceback (most recent call last):
  File "./main.py", line 390, in <module>
    main(args)
  File "./main.py", line 384, in main
    trainClassifierRegressor(train_loader, bbox_loader, resnet, args)
  File "./main.py", line 322, in trainClassifierRegressor
    rloss.backward()
  File "/home/lex/anaconda2/envs/py27/lib/python2.7/site-packages/torch/autograd/variable.py", line 146, in backward
    self._execution_engine.run_backward((self,), (gradient,), retain_variables)
RuntimeError: std::bad_cast

Why does it fail during backward() call? When I run the model on cpu, I do not have this problem.

Gautam_Bhattacharya · May 18, 2017, 6:19pm

I am pretty new, so there a very good chance I am wrong.
Have you tried not doing the .cuda() in the class definition and instead after you instantiate the class ? -

myawesomenet = StackRegressive()
myawesomenet.cuda()

lakehanne · May 18, 2017, 6:22pm

That gives:

  File "train.py", line 283, in trainClassifierRegressor
    regressor = regressor.cuda()
TypeError: 'bool' object is not callable

Gautam_Bhattacharya · May 18, 2017, 6:25pm

and if you just do : regressor.cuda() ?
(no regressor = )
I am out of ideas after that.

lakehanne · May 18, 2017, 6:26pm

The former is not an inplace transfer of an object to gpu. So it must be the latter, I am sure.

smb · May 18, 2017, 6:46pm

It is an inplace operation (unfortunately not from the function name though).

http://pytorch.org/docs/_modules/torch/nn/modules/module.html#Module

miguelvr · May 19, 2017, 12:20am

you can do at the end of the __init__ function:

if torch.cuda.is_available():
    self.cuda()

instead of the multiple calls… However, I don’t know if it’s the source of the error.

EDIT: Also, probably it is not a good idea having an attribute with the same name a class method (self.cuda and self.cuda()).

lakehanne · May 21, 2017, 12:02am

Thank you! The self.cuda variable was the problem. Do not know why I did not foresee it would be an issue.