CUDA error in Colab

I am trying to run a script in a Google Colab Notebook where I am using CUDA. However, I am running into the following error when I am trying to initialize my neural ODE network:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Here is where the error is occurring:

# Some code above

# Set data type to doubles
torch.set_default_tensor_type(torch.DoubleTensor)

# Set the device
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Some code below

class NeuralModel(nn.Module):
    """
    A simple neural ODE with nlayers fully connected internal and ninternal internal variables

    The network should account for (stress, internal_state, strain_rate, T) = 3 + nstate inputs
    and have (stress, internal_state) = 1 + nstate outputs

    Args:
        w_in:                       (n_features, n_inputs) size tensor containing the weights for the input layer
        b_in:                       (n_features,) size tensor containing the biases for the input layer
        w_hid:                      (n_features, n_features, n_layers) size tensor containing the weights for each of n_layers hidden layers
        b_hid:                      (n_features, n_layers) size tensor containing the biases for each of n_layers hidden layers
        w_out:                      (n_outputs, n_features) size tensor containing the weights for the output layer
        b_out:                      (n_outputs,) size tensor containing the biases for the output layer

        activation (optional):      the activation function to use in the hidden layers; default is ReLU
        out_activation (optional):  the activation function to use in the output layer; default is Sigmoid
    """

    def __init__(self, w_in, b_in, w_hid, b_hid, w_out, b_out, erate, T, time, activation = nn.Sigmoid()):
        super().__init__()
        
        self.w_in = w_in
        self.b_in = b_in
        self.w_hid = w_hid
        self.b_hid = b_hid
        self.w_out = w_out
        self.b_out = b_out

        self.activation = activation

        # Check that the number of output features is exactly 2 less than the number of input features
        if self.w_in.shape[1] - self.w_out.shape[0] != 2:
            raise ValueError("The number of input features must be exactly 2 greater than the number of output features")
        
        self.model = self.network_factory()
        self.initialize_weights()

        self.model.nsize = self.w_out.shape[0]

        self.d0 = torch.zeros((1000,)).to(device)

        self.force1_interp = utility.ArbitraryBatchTimeSeriesInterpolator(time, erate)
        self.force2_interp = utility.ArbitraryBatchTimeSeriesInterpolator(time, T)

    
    def network_factory(self):
        '''
        Simple factory function to create the network
        '''
        layers = []
        layers.append(nn.Linear(self.w_in.shape[1], self.w_in.shape[0]))
        layers.append(self.activation)

        for i in range(self.w_hid.shape[2]):
            layers.append(nn.Linear(self.w_hid.shape[1], self.w_hid.shape[0]))
            layers.append(self.activation)

        layers.append(nn.Linear(self.w_out.shape[1], self.w_out.shape[0]))

        return nn.Sequential(*layers)
    
   # Redacted some other code

Specifically, the error is occurring points to the line self.d0 = torch.zeros((1000,)).to(device). I have also initialized my device to ‘cuda:0’.

  1. This is the first time I am encountering this error, so what is the issue here?
  2. How can I resolve this issue?
  3. How can I prevent this issue from occurring again in the future?

Thanks a lot, and I appreciate the help.

1 Like

device specifies an invalid CUDA device so are you sure a GPU is available and not masked via env variables etc.?

I have set the notebook to use the T4 GPU, and when I tried torch.cuda.is_available(), the return was True.

Edit:

I also believe that Torch is recognizing the GPU correctly:

torch.cuda.get_device_name(0)

Out:

Tesla T4

In this case the error might be misleading and something else might be failing before the error message was raised. Did you try to export CUDA_LAUNCH_BLOCKING=1 before running the script as suggested in the error message?

Yes, I did. I still get the same error message.

I got around this error by running the Python files themselves on Colab; I couldn’t find a direct fix.

Can you please explain what you have done in detail? Facing the similar problem in colab.