I’m trying to understand something about loops in PyTorch. I timed this function (modified from docs):
for batch, (X, y) in enumerate(dataloader):
print("Shape: ", X.shape)
X, y = X.to(device), y.to(device)
for i in range(10000):
# Compute prediction error
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
break
This function will run training on a single batch 10,000 times (I understand this is pointless as it would just memorize that specific batch).
The reason I wrote it is to understand why all of the following are true:
- This ran for a really small network ([784 in] → [512 dense] → ReLU → [10 out])
- I ran this on a 4090. It’s a 24gb card with 16k cuda cores and 2.3ghz clock speed.
- The time it takes to complete this is about 3.6 seconds.
From what I understand about deep learning on GPUs, 3.6 seconds seems way too high.
If the GPU has plenty of memory to keep the entire model (along with state for the batch and gradients), 10,000 iterations for a 3 layer network should complete well under a few hundred thousand clock cycles. (This could also be a complete misunderstanding of how this works, please feel free to correct me if it’s wrong)
My best theory for why:
The for loop runs on the CPU, so each iteration, the GPU communicates back to the CPU that 1 iteration is complete, and the for loop continues. But even that communication time should be marginal and shouldn’t cause such a slow training time?
I can’t find any answers on a) whether it should be faster, and b) why it isn’t.
I’d appreciate any insights. Thanks.