"invalid configuration" error when I try to run on GPU

agt · March 24, 2021, 3:52am

My code works fine on the CPU, but when I move everything to GPU and try to run it, I get an error on this line:
print(ctx["input"])

The shape of ctx["input"] is torch.Size([1, 50]) and its dtype is torch.float32.

The error says: “RuntimeError: CUDA error: invalid configuration argument”.

Actually, the error appears whenever I try to do anything with ctx["input"].

How can I fix this?

This is the entire traceback: dpaste/qDRq (Python)

ptrblck · March 24, 2021, 5:16am

Could you rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and post the stack trace here?
The error points to a wrong kernel launch config. Are you using any custom CUDA code in your application? If not, which PyTorch and CUDA version as well as GPU are you using?

agt · March 24, 2021, 11:30pm

If I’m in Google Colab, can I just put !CUDA_LAUNCH_BLOCKING=1 at the beginning of the cell?

I think the root of the problem is in this module:

class MessageMaker(torch.nn.Module):
    def __init__(self, embed_size, hidden_size):
        super().__init__()
        
        node_edge_concat_size = embed_size * 2
        self.rnn = torch.nn.LSTM(node_edge_concat_size, hidden_size)

    def forward(self, edges):
        node_reps, edge_reps = edges.src["rep"], edges.data["rep"]
        inputs = torch.cat([node_reps, edge_reps], 1)
        inputs = inputs.unsqueeze(0) #Since we're processing only 1 seq element at a time.
        rnn_state = edges.src["sum_incoming_hidden_and_cell"]
        hiddens, cells = rnn_state[:, 0].contiguous(), rnn_state[:, 1].contiguous()
        hiddens, cells = hiddens.unsqueeze(0), cells.unsqueeze(0) #Since it's only 1 layer+direction.
        outputs, (updated_hiddens, updated_cells) = self.rnn(inputs, (hiddens, cells))
        updated_hiddens, updated_cells = updated_hiddens.squeeze(0), updated_cells.squeeze(0)
        updated_rnn_state = torch.stack([updated_hiddens, updated_cells], 1)
        new_features = { "hidden_and_cell": updated_rnn_state }

        return new_features

After moving to GPU I got an error “rnn: hx is not contiguous”, so I added .contiguous() to the 5th line of forward() to fix it. Could the CUDA error have something to do with that?

ptrblck · March 25, 2021, 12:57am

It could potentially be related. Could you post an executable code snippet, which raises the initial error using this module?

It could work, if you set this environment variable before importing any other library, which might initialize the CUDA context. As it’s often not straightforward to use it properly in a Jupyter notebook, I usually recommend to run it in a terminal instead.

agt · March 25, 2021, 1:38am

I don’t have a personal GPU so I have to use a Colab notebook. Will try setting it at the beginning.