Model fails in cuda, not in CPU

Hi,

I’m trying to implement the model from a paper into PyTorch and have run into a very strange error. When the model is in the CPU, it can be run with no problems. However, when I move it to the GPU I get a warning about compacting weights and the first LSTM of the model fails. The model itself is below

import torch.nn as nn
import torch
import torch.nn.functional as F

class AbduallahTransformerBlock(nn.Module):
    def __init__(self, input_shape, dropout=0.1, mha_dropout=0.1, n_heads=2, ff_dim=4):
        super().__init__()

        self.dropout = nn.Dropout(dropout)
        self.n_heads = n_heads
        self.ff_dim = ff_dim
        self.mha_dropout = mha_dropout

        self.mha = nn.MultiheadAttention(
            embed_dim=input_shape[1], num_heads=self.n_heads, dropout=self.mha_dropout
        )

        self.batch_norm = nn.BatchNorm1d(input_shape[1])

        self.conv1d_1 = nn.Conv1d(
            in_channels=input_shape[0],
            out_channels=self.ff_dim,
            kernel_size=1,
        )

        self.conv1d_2 = nn.Conv1d(
            in_channels=self.ff_dim, out_channels=input_shape[0], kernel_size=1
        )

        self.lstm = nn.LSTM(
            hidden_size=400, input_size=input_shape[1], num_layers=1, batch_first=True
        )

    def forward(self, input):
        x = self.dropout(input)
        print("TRANSFORMER BLOCK INPUT: ", x.shape)

        x, _ = self.mha(x, x, x)
        print("TRANSFORMER BLOCK MHA OUTPUT: ", x.shape)

        x = self.batch_norm(x.permute(0, 2, 1)).permute(0, 2, 1)

        x = x + input

        x = F.relu(self.conv1d_1(x))
        print("TRANSFORMER BLOCK CONV1D_1 OUTPUT: ", x.shape)

        x = F.relu(self.conv1d_2(x))
        print("TRANSFORMER BLOCK CONV1D_2 OUTPUT: ", x.shape)

        self.lstm.flatten_parameters()
        x, (h_n, c_n) = self.lstm(x)
        print("TRANSFORMER BLOCK LSTM OUTPUT: ", x.shape)

        return x


class SolarFlareNet(nn.Module):
    def __init__(
        self,
        input_size: tuple,
        dropout: float = 0.4,
        n_blocks: int = 4,
        out_logits: bool = True,
    ) -> None:
        super().__init__()
        print("input_size: ", input_size)
        self.out_logits = out_logits

        self.batch_norm_1 = nn.BatchNorm1d(input_size[1])

        self.conv1d_1 = nn.Conv1d(
            in_channels=input_size[0], out_channels=32, kernel_size=1
        )

        self.conv1d_2 = nn.Conv1d(in_channels=32, out_channels=32, kernel_size=1)

        self.lstm = nn.LSTM(
            input_size=32,
            hidden_size=400,
            batch_first=True,
        )

        self.dropout = nn.Dropout(dropout)

        self.batch_norm_2 = nn.BatchNorm1d(400)

        self.transformer_input_shape = (32, 400)

        self.transformer_encoder = nn.Sequential(
            *[
                AbduallahTransformerBlock(
                    self.transformer_input_shape,
                    dropout=dropout,
                    mha_dropout=dropout,
                    n_heads=2,
                    ff_dim=14,
                )
                for _ in range(n_blocks)
            ]
        )

        self.dense_1 = nn.Linear(
            self.transformer_input_shape[0] * self.transformer_input_shape[1], 200
        )

        self.dense_2 = nn.Linear(200, 500)

        self.dense_3 = nn.Linear(500, 1)

    def forward(self, input, metadata):
        x = input
        print("Input shape", x.shape)
        x = self.batch_norm_1(x.permute(0, 2, 1)).permute(0, 2, 1)

        print("Batch norm shape", x.shape)

        x = F.relu(self.conv1d_1(x))
        print("Conv1d_1 shape", x.shape)

        x = F.relu(self.conv1d_2(x))

        print("Conv1d_2 shape", x.shape)

        x, (h_n, c_n) = self.lstm(x)

        print("LSTM shape", x.shape)

        x = self.dropout(x)
        x = self.batch_norm_2(x.permute(0, 2, 1)).permute(0, 2, 1)

        print("Transformer input shape", x.shape)

        x = self.transformer_encoder(x)

        print("Transformer out size", x.shape)
        # x = torch.mean(x, dim=1)

        # Flatten the output
        x = torch.flatten(x, start_dim=1)

        x = F.relu(self.dense_1(x))
        x = F.relu(self.dense_2(x))
        x = self.dense_3(x)

        if not self.out_logits:
            x = torch.sigmoid(x)

        return x

To reproduce

model = SolarFlareNet((60, 40))
test_input = torch.rand((16,60,40))

model.to("cpu")
model(test_input, metadata={}) # This will run

model.to("cuda")
model(test_input.to("cuda"), metadata={}) # This will raise the errors below

And the errors I get when running on GPU are

/home/julio/.local/lib/python3.10/site-packages/torch/nn/modules/rnn.py:879: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters(). (Triggered internally at ../aten/src/ATen/native/cudnn/RNN.cpp:982.)
  result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[55], line 1
----> 1 model(fake_input.to("cuda:0"), metadata={})

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

Cell In[49], line 68, in SolarFlareNet.forward(self, input, metadata)
     64 x = F.relu(self.conv1d_2(x))
     66 print("Conv1d_2 shape", x.shape)
---> 68 x, (h_n, c_n) = self.lstm(x)
     70 print("LSTM shape", x.shape)
     72 x = self.dropout(x)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/.local/lib/python3.10/site-packages/torch/nn/modules/rnn.py:879, in LSTM.forward(self, input, hx)
    876         hx = self.permute_hidden(hx, sorted_indices)
    878 if batch_sizes is None:
--> 879     result = _VF.lstm(input, hx, self._flat_weights, self.bias, self.num_layers,
    880                       self.dropout, self.training, self.bidirectional, self.batch_first)
    881 else:
    882     result = _VF.lstm(input, batch_sizes, hx, self._flat_weights, self.bias,
    883                       self.num_layers, self.dropout, self.training, self.bidirectional)

RuntimeError: shape '[64000, 1]' is invalid for input of size 51200

If anyone can help me find the source of this error I’d really appreciate it. I’ve tested this both in my training scrip (remote machine) and in my local machine in an ipython instance (to ensure the error is not coming from any of the other training bits) and both produce the same error. I’ve also tried putting self.lstm.flatten_parameters() in the forward method before the call to the LSTM and the same thing keeps happening.

EDIT: Added missing class declaration and removed relative import.

Your code doesn’t work:

    from .parts.abduallah_transformer_block import AbduallahTransformerBlock

ImportError: attempted relative import with no known parent package

My bad, forgot to add one of the class declarations and remove the relative import. It should work now, thanks for pointint it out!

Thanks for fixing the code.
Both runs fail with the same error:

model = SolarFlareNet((60, 40))
test_input = torch.rand((16,60,40))

model.to("cpu")
model(test_input, metadata={}) # This will run
# RuntimeError: input.size(-1) must be equal to input_size. Expected 32, got 40

model.to("cuda")
model(test_input.to("cuda"), metadata={}) # This will raise the errors below
# RuntimeError: input.size(-1) must be equal to input_size. Expected 32, got 40

Mmm, that’s strange. I’ve just tested it again and the CPU one runs for me, same error as before for the GPU. Could it be related to the pytorch version? I’m using ‘2.1.0+cu121’. Does it happen during the init call?

Another thing I’ve noticed is that, when I use a random test input, sometimes the first LSTM returns NaNs everywhere, while others it doesn’t. Is that to be expected with random inputs?

So, problem seemed to be something to do with a mismatch in dimensions. How that could have been working in the CPU and not in the GPU I’ve no idea, but it’s solved now.

Unsure, but maybe some shape checks were missing or broken in 2.1.0+cu121. I’ve used the latest nightly binary from today btw.
Were you able to locate the shape mismatches and fix them?

Yes! Since I was translating some code from TensorFlow I had assumed some things about how Conv1D work there that weren’t true, so after changing that everything works. Thanks!

1 Like