Hi PyTorch community,
I’m experiencing unusually slow training performance on my new NVIDIA GeForce RTX 5070 Laptop GPU. Even with the latest PyTorch nightly (2.9.0.dev20250704+cu128) and CUDA 12.8 installed, my training speed on GPU is much slower than expected—about 10 seconds per batch for a simple ResNet-18 model on CIFAR-10 with batch size 512. The CPU training time for the same batch size is also about 10 seconds, which makes no sense. Worse still, it takes 20 seconds to run without manual optimization on my GPU. I have never installed older versions of cuda.
Environment details:
Torch version : 2.9.0.dev20250704+cu128
CUDA available : True
Torch CUDA version : 12.8
CUDA device count : 1
Device 0 name : NVIDIA GeForce RTX 5070 Laptop GPU
Compute capability : sm_120
Supported arch list : ['sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120']
Total memory (MB) : 8150
Basic GPU compute : OK
cuDNN & TF32 Settings:
cudnn.benchmark : False
TF32 matmul allowed : False
TF32 cudnn allowed : True
I tested torch.compile()
with both default and reduce-overhead modes. The default mode compiled and ran successfully, but the reduce-overhead mode failed with an overflow error:
Testing with compile mode='reduce-overhead':
reduce-overhead mode failed: Python int too large to convert to C long
Code snippet for training benchmark:
import time
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms, models
import torch.backends.cudnn as cudnn
import torch.amp as amp
def main():
# Select device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)
# —— Global performance flags ——
cudnn.benchmark = True # Enable cuDNN autotuner
torch.backends.cuda.matmul.allow_tf32 = True # Enable TF32 for matmul
torch.backends.cudnn.allow_tf32 = True # Enable TF32 for cuDNN
# —— Data loading setup ——
transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=(0.5, 0.5, 0.5),
std=(0.5, 0.5, 0.5))
])
train_dataset = datasets.CIFAR10(
root='.', train=True, download=True, transform=transform
)
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=512,
shuffle=True,
num_workers=8, # Number of data-loading workers
pin_memory=True, # Use pinned memory for faster transfers
persistent_workers=True, # Keep workers alive between epochs
prefetch_factor=2 # Number of batches to prefetch per worker
)
# —— Model creation & JIT compile ——
model = models.resnet18(weights=None, num_classes=10).to(device)
model = torch.compile(model, backend="eager") # Avoid auto-tuning paths that don’t benefit
# —— Optimizer, loss, and mixed precision setup ——
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()
scaler = amp.GradScaler()
# —— Training loop (20 batches) ——
model.train()
print("\n📘 Starting training for 20 batches")
batch_times = []
for batch_idx, (images, labels) in enumerate(train_loader):
if batch_idx >= 20:
break
print(f"\n🟢 Batch {batch_idx}")
t0 = time.time()
# Move data to device
images = images.to(device, non_blocking=True)
labels = labels.to(device, non_blocking=True)
torch.cuda.synchronize()
print(f" 🔸 Data transfer: {time.time() - t0:.3f}s")
optimizer.zero_grad()
# Forward pass + loss
t1 = time.time()
with amp.autocast(device_type='cuda'):
outputs = model(images)
loss = loss_fn(outputs, labels)
torch.cuda.synchronize()
print(f" 🔸 Forward + loss: {time.time() - t1:.3f}s")
# Backward pass + optimizer step
t2 = time.time()
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
torch.cuda.synchronize()
print(f" 🔸 Backward + step: {time.time() - t2:.3f}s")
print(f" ✅ Loss: {loss.item():.4f}")
batch_times.append(time.time() - t0)
avg_time = sum(batch_times) / len(batch_times) if batch_times else 0.0
print(f"\n✅ Average batch time: {avg_time:.3f}s")
if __name__ == "__main__":
main()
Result:
📘 Starting training for 20 batches
🟢 Batch 0
🔸 Data transfer: 0.819s
🔸 Forward + loss: 5.622s
🔸 Backward + step: 7.883s
✅ Loss: 2.4147
🟢 Batch 1
🔸 Data transfer: 0.022s
🔸 Forward + loss: 0.246s
🔸 Backward + step: 1.893s
✅ Loss: 2.3446
🟢 Batch 2
🔸 Data transfer: 0.021s
🔸 Forward + loss: 4.790s
🔸 Backward + step: 6.619s
✅ Loss: 2.8196
🟢 Batch 3
🔸 Data transfer: 0.022s
🔸 Forward + loss: 4.910s
🔸 Backward + step: 6.618s
✅ Loss: 2.4265
🟢 Batch 4
🔸 Data transfer: 0.022s
🔸 Forward + loss: 4.827s
🔸 Backward + step: 6.633s
✅ Loss: 2.1797
🟢 Batch 5
🔸 Data transfer: 0.022s
🔸 Forward + loss: 4.835s
🔸 Backward + step: 6.681s
✅ Loss: 2.0355
🟢 Batch 6
🔸 Data transfer: 0.022s
🔸 Forward + loss: 4.976s
🔸 Backward + step: 6.618s
✅ Loss: 1.9456
🟢 Batch 7
🔸 Data transfer: 0.022s
🔸 Forward + loss: 4.819s
🔸 Backward + step: 6.621s
✅ Loss: 1.8671
🟢 Batch 8
🔸 Data transfer: 0.023s
🔸 Forward + loss: 4.803s
🔸 Backward + step: 6.612s
✅ Loss: 1.8940
🟢 Batch 9
🔸 Data transfer: 0.022s
🔸 Forward + loss: 4.798s
🔸 Backward + step: 6.638s
✅ Loss: 1.8690
🟢 Batch 10
🔸 Data transfer: 0.021s
🔸 Forward + loss: 4.828s
🔸 Backward + step: 6.638s
✅ Loss: 1.7905
🟢 Batch 11
🔸 Data transfer: 0.022s
🔸 Forward + loss: 4.859s
🔸 Backward + step: 6.957s
✅ Loss: 1.8161
🟢 Batch 12
🔸 Data transfer: 0.022s
🔸 Forward + loss: 5.010s
🔸 Backward + step: 6.836s
✅ Loss: 1.6921
🟢 Batch 13
🔸 Data transfer: 0.021s
🔸 Forward + loss: 4.990s
🔸 Backward + step: 6.869s
✅ Loss: 1.7362
🟢 Batch 14
🔸 Data transfer: 0.022s
🔸 Forward + loss: 4.996s
🔸 Backward + step: 6.884s
✅ Loss: 1.7184
🟢 Batch 15
🔸 Data transfer: 0.022s
🔸 Forward + loss: 5.033s
🔸 Backward + step: 6.869s
✅ Loss: 1.7337
... ...
✅ Average batch time: 11.345s
Questions:
Is there any known issue or additional configuration needed for RTX 5070 Laptop GPU (sm_120) to achieve expected performance?
Are my cuDNN and TF32 settings optimal? Should cudnn.benchmark and TF32 matmul be enabled?
Could PyTorch nightly 2.9.0 + CUDA 12.8 have missing or incomplete optimizations for this new GPU architecture?
Any debugging tips or profiling tools recommended to identify bottlenecks?
Thanks in advance for your help!