Problems with RTX5090 in compatibility with PyTorch

When training the neural network models based on PyTorch, the RTX5090(Stable 2.7.1 / Preview (Nightly) ) sometimes performs less efficiently than the RTX4090. The testing code is shown as follow. When testing the Big data, RTX5090 performs less efficiently than the RTX4090. But when testing the Small data, RTX5090 performs more efficiently than the RTX4090. I am confused by this phenomenon.

import torch
import torch.nn as nn
import torch.optim as optim
import time

#(1)Small data
# x = torch.rand((10000,3))
# y = torch.rand((10000,1))

#(2)Big data
x = torch.rand((100000,3))
y = torch.rand((100000,1))

class Net(nn.Module):
    def __init__(self,):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(3, 200)
        self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 200)
        self.fc4 = nn.Linear(200, 200)
        self.fc5 = nn.Linear(200, 1)

    def forward(self, x):
        x = torch.sin(self.fc1(x))
        x = torch.sin(self.fc2(x))
        x = torch.sin(self.fc3(x))
        x = torch.sin(self.fc4(x))
        x = self.fc5(x)
        return x

T1=time.time()
model = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

for epoch in range(300):
    optimizer.zero_grad()
    outputs = model(x)
    loss = criterion(outputs, y)
    loss.backward()
    optimizer.step()

    if epoch % 100 == 0:
        print(f'Epoch {epoch}, Loss: {loss.item():.4f}')

T2=time.time()
print(T2-T1)

You could profile the end2end use case with e.g. Nsight Systems to narrow down the bottlenecks in your workloads as they could come from e.g. the data loading and processing stage and not the GPU computation workloads (since small datasets perform well on the 5090 vs. 4090).

Thank you for your prompt reply. First, in Preview (Nightly), I conducted several tests and found that for convolutional neural networks, 5090 had a significant advantage over 4090. However, when training fully connected neural networks, especially for large datasets, 5090 seemed to perform less well than 4090. The data was not loaded or preprocessed, but was directly generated using torch.rand.