Slow batch_gradient

I am working on a simple regression problem (X,Y) with ~200K sample size and I tried batch-feeding and no-batch-feeding. It seems when I feed the data in batches, it is significantly slower than just using no-batch at all. I was expecting the otherwise. Here is my code. Can you guys let me know if there is any bug there causing this issue? I attached the major part of the code, thank you!
This is super slow

temp_dataloder = DataLoader(dataset = temp_data, batch_size = 256, shuffle = False)
learning_rate = 0.001
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)
num_epochs = 100
for epoch in range(num_epochs):
for X, Y in iter(temp_dataloder):
X = X.to(device)
Y = Y.to(device)
y_pred = model(X).to(device)
loss = criterion(y_pred, Y)
loss.backward()
optimizer.step()
optimizer.zero_grad()
if (epoch+1)%10==0:
print(f’epoch : {epoch+1} loss: {loss.item()}’)

And this is super fast
X = temp_data.x
X = X.to(device)
Y = temp_data.y
Y = Y.to(device)
learning_rate = 0.001
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr = learning_rate)
num_epochs = 100
for epoch in range(num_epochs):
y_pred = model(X).to(device)
loss = criterion(y_pred, Y)
loss.backward()
optimizer.step()
optimizer.zero_grad()

if (epoch+1)%10==0:
    print(f'epoch : {epoch+1} loss: {loss.item()}')

It depends on what you are profiling.
The execution of a batch with more than a sample vs. a single sample is expected to be slower, since you are using more operations. However, the epoch time would usually decrease, i.e. the samples per second performance would be faster using a batched training unless your code is bottlenecked by another operation and the actual compute speedup is not visible.

Thank you for your prompt response @ptrblck ! I was actually trying to understand if there is any bottleneck in the code. This one is fairly simple example, I printed out execution time of each operation in each batch. It seems It takes ridiculously long time in computation of gradients(~10sec). The very same operation takes a fraction of a second if no batch is used.
I was wondering if you can provide a further insight, thank you!

Based on this description I think you might not synchronize the code properly and your profiling might be wrong. CUDA operations are executed asynchronously so that you would have to synchronize the code via torch.cuda.synchronize() before starting and stopping each timer. Otherwise blocking operations would accumulate the execution times from all previous kernels.