Is there any case that 'cuda' mode doesn't work?

malofleur · December 28, 2018, 7:22am

I design my lstm model like this:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
class LSTMpred(nn.Module):
    def __init__(self, input_size, hidden_dim):
        super(LSTMpred, self).__init__()
        self.input_dim = input_size
        self.hidden_dim = hidden_dim
        self.lstm = nn.LSTM(input_size, hidden_dim)
        self.hidden2out = nn.Linear(hidden_dim, 1)
        self.hidden = self.initHidden()

    def initHidden(self):
        return (Variable(torch.zeros(1, 1, self.hidden_dim)),
                Variable(torch.zeros(1, 1, self.hidden_dim)))

    def forward(self, *input):
        x = input[0]
        lstm_out, self.hidden = self.lstm(
            x.view(len(x), 1, -1), self.hidden
        )
        outdat = self.hidden2out(lstm_out.view(len(x), -1))
        return outdat

I use model = LSTMModel.LSTMpred(1,40).to(device) to initialize my model, but when I try to train it, it’s just stuck on the code modelout = model(indata) and the whole program gets existed after a few seconds saying Process finished with exit code -1073741819 (0xC0000005).
However, when I initialize device to cpu, like device = torch.device('cpu') it starts to work
I’m confused about it, why does it happen? I would appreciate your help.

smth · December 29, 2018, 1:04am

it’s possible that CUDA itself is not correctly working on your machine?
Does any other program in CUDA mode work correctly?

malofleur · December 29, 2018, 1:13am

Yep, if I use the model provided by pytorch itself, using MNIST dataset, and set as ‘CUDA’, it works.

smth · December 29, 2018, 1:32am

One thing I do see is model = LSTMModel.LSTMpred(1,40).to(device) does not move the self.hidden over to GPU, because it’s not registered as a parameter or a buffer.

I think you want to move self.initHidden() into the forward function, rather being in the constructor.

Something like:

def forward(self, *input):
    hidden = self.initHidden()

The reason self.hidden doesn’t get moved to GPU is because it’s not an instance of nn.Parameter, or it’s not been declared as a buffer via https://pytorch.org/docs/stable/nn.html?highlight=register_buffer#torch.nn.Module.register_buffer so PyTorch doesn’t know when you do .to('cuda') to move this onto GPU.

malofleur · December 29, 2018, 2:16am

Still doesn’t work
I try to debug to find the problem, and it stuck on the following code(filename: torch/nn/_functions/rnn.py):

output, hy, cy, reserve, new_weight_buf = torch._cudnn_rnn(
            input, weight_arr, weight_stride0,
            flat_weight,
            hx, cx,
            mode, hidden_size, num_layers,
            batch_first, dropout, train, bool(bidirectional),
            list(batch_sizes.data) if variable_length else (),
            dropout_ts)

these are the parameters used above

can you figure out the reason

_ike · October 11, 2019, 5:57am

Were you figure that? I have met the same question.