I just upgraded from libtorch 2.1.0 to libtorch 2.10.0 and libtorch 2.10.0 seems to be about 500,000 microseconds slower, when training simple feed forward neural network. It is all the same c++ code on the same machine and the only difference is the versions of libtorch. I am on Ubunutu 24 and using Intel Xeon cpu, and I am using intel MKL for both versions
Is libtorch 2.10.0 known to be slower than 2.1.0?
Couple of things to narrow this down before guessing:
- Is that 500,000microseconds per step or total across the run?
- What’s your batch size and model size? Small FFNs hit dispatcher overhead more, larger ones are bound by matmul speed — totally different debugging paths.
- Quick check — are both builds actually linking the same MKL? Run
ldd ./your_binary | grep -i mkl on each.
Best way to find the regression is to profile both versions and diff. Add this around your training loop:
#include <torch/csrc/autograd/profiler.h>
{
torch::autograd::profiler::RecordProfile guard(“trace_2_10.json”);
// run ~100 steps
}
Then drop both trace files into chrome://tracing or perfetto.dev side by side. Whichever op regressed will jump out.
Two things worth trying first since they often explain “version X is suddenly slower”:
- Set OMP_NUM_THREADS and MKL_NUM_THREADS explicitly to your physical core count — different default thread behavior between releases can look like a regression.
- Pin in code: at::set_num_threads(N) and at::set_num_interop_threads(1).
If you spot a specific op that regressed, post the trace snippet and we can dig deeper.
Thanks
/Aditya
Thank you for your response, here is what I have
for (int i = 0; i<100/*900*/;++i) {
std::chrono::steady_clock::time_point begin_time = std::chrono::steady_clock::now();
torch::Tensor prediction = net->forward(pytorch_data.train_features_.index({train_indices.index({torch::indexing::Slice(start,end,1)}), torch::indexing::Slice(2,n_variables+2,1)}));//for when I use the order_id and time for logging
torch::nn::functional::BinaryCrossEntropyFuncOptions op;
op.weight(train_weights_);
loss /= accumulation_step;
loss = loss.requires_grad_(true);
loss.backward();
if( (i +1)% accumulation_step == 0){
optimizer.step();
optimizer.zero_grad();
}
start = end;
end += train_indices.sizes()[0]/5;
if(end > train_indices.sizes()[0]/1.25){
amount_of_zeros_predicted_within_range = false;
if(batch_iter > train_indices.sizes()[0]/5 )
batch_iter = 0;
start = batch_iter;
end = start+train_indices.sizes()[0]/5;
batch_iter += batch_iter;
if(end > train_indices.sizes()[0] - train_indices.sizes()[0]/5){
batch_iter = 5;
start = batch_iter;
end = start+train_indices.sizes()[0]/5;//12000;//rty25000;;
batch_iter += batch_iter;
}
std::chrono::steady_clock::time_point end_time = std::chrono::steady_clock::now();
std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end_time - begin_time).count() << "[µs]" << std::endl;
}
other stuff but not included in the timer
}
-
What’s your batch size and model size? Small FFNs hit dispatcher overhead more, larger ones are bound by matmul speed — totally different debugging paths.
-
train_indices.sizes()[0]// could be a couple hundred thousand examples
-
model size is the below were n_features = 148
fc1 = register_module("fc1", torch::nn::Linear(torch::nn::LinearOptions((n_features), 750).bias(false)));
torch::nn::init::xavier_normal_(fc1->weight, 5/3);//torch::nn::init::normal_(fc1->weight,0,.1);
fc2 = register_module("fc2", torch::nn::Linear(torch::nn::LinearOptions(750, 625).bias(false)));
torch::nn::init::xavier_normal_(fc2->weight, 5/3);//torch::nn::init::normal_(fc2->weight,0,.1);
fc3 = register_module("fc3", torch::nn::Linear(torch::nn::LinearOptions(625, 560).bias(false)));
torch::nn::init::xavier_normal_(fc3->weight, 5/3);//torch::nn::init::normal_(fc3->weight,0,1);
fc4 = register_module("fc4", torch::nn::Linear(torch::nn::LinearOptions(560, 80).bias(false)));
torch::nn::init::xavier_normal_(fc4->weight, 5/3);//torch::nn::init::normal_(fc4->weight,0,1);
fc5 = register_module("fc5", torch::nn::Linear(torch::nn::LinearOptions(80, 65).bias(false)));
torch::nn::init::xavier_normal_(fc5->weight, 5/3);//torch::nn::init::normal_(fc5->weight,0,1);
fc6 = register_module("fc6", torch::nn::Linear(torch::nn::LinearOptions(65, 75).bias(false)));
torch::nn::init::xavier_normal_(fc6->weight, 5/3);//torch::nn::init::normal_(fc6->weight,0,1);
fc7 = register_module("fc7", torch::nn::Linear(torch::nn::LinearOptions(75, 1).bias(false)));
torch::nn::init::xavier_normal_(fc7->weight, 5/3);//torch::nn::init::normal_(fc7->weight,0,1);
-
Quick check — are both builds actually linking the same MKL? Run ldd ./your_binary | grep -i mkl on each.
and I was already doing the followin
setenv(“OMP_NUM_THREADS”, “1”, 1);
**setenv**("MKL_NUM_THREADS", "1", 1);
torch::**set_num_threads**(5);
torch::**set_num_interop_threads**(5);
I think I might have been a little hasty in asking this question. I don’t think I recorded enough observations before I asked this question.
Thank you again for your response
Cheers,
Ryan