Bad result when training with RTX 30series card

Hi all,
I have a RNN that when I train with cuda 10.1 on google colab or 11.0 on a Quadro P4000 then we get very good results, but when trained with any rtx 30 series card then the gradient always diverge too soon and result is not as good (sometimes even not converging at all)

it’s the same code, same hyper-parameters.

Has anyone encountered the same problem?

Could you post the model definition and explain your use case a bit?
Which 30 series are you using exactly?

Here’s a copy of my LSTM definition

class LSTM(nn.Module):

    def __init__(self, input_dim, hidden_dim, num_layers, output_dim):

        super(LSTM, self).__init__()

        # Hidden dimensions

        self.hidden_dim = hidden_dim

        # Number of hidden layers

        self.num_layers = num_layers

        # Building your LSTM

        # batch_first=True causes input/output tensors to be of shape

        # (batch_dim, seq_dim, feature_dim)

        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers, batch_first=True, dropout = 0.3)

        # Readout layer

        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):

        # Initialize hidden state with zeros

        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).cuda().requires_grad_()

        # Initialize cell state

        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).cuda().requires_grad_()

        # One time step

        # We need to detach as we are doing truncated backpropagation through time (BPTT)

        # If we don't, we'll backprop all the way to the start even after going through another batch

        out, (hn, cn) = self.lstm(x, (h0.detach(), c0.detach()))

        # Index hidden state of last time step

        # out.size() --> 100, 28, 100

        # out[:, -1, :] --> 100, 100 --> just want last time step hidden states! 

        out = self.fc(out[:, -1, :]) 

        # out.size() --> 100, 10

        return out

model = LSTM(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_layers=num_layers)

model =

loss_fn = torch.nn.MSELoss(size_average=True)

optimizer = torch.optim.Adam(model.parameters(), lr=0.005)

I’m using LSTM RNN to make time-series prediction based on historical steps there it uses past trajectory data to predict next position.

the problem with training as I described before is best shown in these two images below

the top graph is training on google colab T4 or our Quadro P4000
the bottom graph is training on our 3070 and 3090

the train score diverged much earlier and doesn’t perform as well

Thank you

Could you post the scale of the loss as well?
Also, how reproducible is this issue? I.e. if you rerun the code with different seeds is the run on the 3090 always diverging earlier? Since the model is diverging in each case, I’m unsure if an “earlier divergence” would be problematic (if would be, if the model is always converging in one setup).

here’s an older dataset with a different set of hyper parameter. the top graph is training with colab and P4000, the bottom graph is training with 3070 and 3090. This issue is always reproducible in my case, the accuracy of the model trained with the 30 series is always lower. I’m training for 1000-1500 epochs

Thank you! Which dataset are you using?

I’m using simulation data that we generated in our lab.

Would it be possible to share (some) of this data or explain its statistics?

Hi, thanks for your support with my venture into deep learning.
Unfortunately the data is confidential for now.
I’m new to ML and is only using the RNN as a curve fitting for my engineering work
The network takes in voltage data from a number of sensors and predict the position of the device in term of x, y, and rotational along the Z axis. the training data is labeled with measured position.

Sure, you shouldn’t share internal data.
Would it be possible to get the input shapes of each batch as well as just the min./max./mean of the input and target, so that we could create a fake dataset in order to reproduce the issue?

I’m not sure what you mean by input shape, but I normalize the data to [-1 1] for all parameters then the scaling happens outside of the model

I meant the input tensor shapes, e.g. as in input = torch.randn(2, 3, 4, 5).

here’s my data loader code.

def load_data(x_data, y_data, step_size):
    data_samples = []
    labels = []

    for i in range(int(x_data.shape[0]/step_size)):
        data_samples.append(x_data[i*step_size: i*step_size + step_size, :])
        labels.append(y_data[i*step_size + step_size-1, : ])

    shuffle_ls = list(zip(data_samples, labels))
    data_samples, labels = zip(*shuffle_ls)

    test_set_size = int(np.round(0.15*len(data_samples)))
    val_set_size   = int(np.round(0.15*len(data_samples)))
    train_set_size = len(data_samples) - test_set_size - val_set_size

    x_train = np.asarray(data_samples[0:train_set_size])
    y_train = np.asarray(labels[0:train_set_size])

    x_val  = np.asarray(data_samples[train_set_size:train_set_size + val_set_size])
    y_val  = np.asarray(labels[train_set_size:train_set_size + val_set_size])

    x_test  = np.asarray(data_samples[train_set_size + val_set_size: ])
    y_test  = np.asarray(labels[train_set_size + val_set_size: ])

    return [x_train, y_train, x_val, y_val, x_test, y_test]

x_train, y_train, x_val, y_val, x_test, y_test = load_data(x_data_normalized, y_data_normalized, step_size = 150)

print("X_train shape: ", x_train.shape)
print("Y_train shape: ", y_train.shape)
print("X_val shape: ", x_val.shape)
print("Y_val shape: ", y_val.shape)
print("X_test shape: ", x_test.shape)
print("Y_test shape: ", y_test.shape)

# make training and test sets in torch
x_train = torch.from_numpy(x_train).type(torch.Tensor)
x_test = torch.from_numpy(x_test).type(torch.Tensor)
x_val = torch.from_numpy(x_val).type(torch.Tensor)

y_train = torch.from_numpy(y_train).type(torch.Tensor)
y_test = torch.from_numpy(y_test).type(torch.Tensor)
y_val = torch.from_numpy(y_val).type(torch.Tensor)

print("Tensor Size: ", "="*30)

print(y_train.size(), x_train.size())
print(y_val.size(), x_val.size())
print(y_test.size(), x_test.size())

print("Making Data Loader: ", "="*50)

batch_size = 128

train =, y_train)
val   =, y_val)
test =, y_test)

train_loader =, batch_size=batch_size, shuffle=False)
val_loader   = = val, batch_size= batch_size, shuffle=False)
test_loader =, batch_size=batch_size, shuffle=False)

and the tensors shapes

X_train shape:  (10500, 150, 6)
Y_train shape:  (10500, 2)
X_val shape:  (2250, 150, 6)
Y_val shape:  (2250, 2)
X_test shape:  (2250, 150, 6)
Y_test shape:  (2250, 2)
Tensor Size:  ==============================
torch.Size([10500, 2]) torch.Size([10500, 150, 6])
torch.Size([2250, 2]) torch.Size([2250, 150, 6])
torch.Size([2250, 2]) torch.Size([2250, 150, 6])