Libtorch 2.10.0 slower than libtorch 2.1.0?

I just upgraded from libtorch 2.1.0 to libtorch 2.10.0 and libtorch 2.10.0 seems to be about 500,000 microseconds slower, when training simple feed forward neural network. It is all the same c++ code on the same machine and the only difference is the versions of libtorch. I am on Ubunutu 24 and using Intel Xeon cpu, and I am using intel MKL for both versions

Is libtorch 2.10.0 known to be slower than 2.1.0?

Couple of things to narrow this down before guessing:

  • Is that 500,000microseconds per step or total across the run?
  • What’s your batch size and model size? Small FFNs hit dispatcher overhead more, larger ones are bound by matmul speed — totally different debugging paths.
  • Quick check — are both builds actually linking the same MKL? Run ldd ./your_binary | grep -i mkl on each.

Best way to find the regression is to profile both versions and diff. Add this around your training loop:

#include <torch/csrc/autograd/profiler.h>
{
torch::autograd::profiler::RecordProfile guard(“trace_2_10.json”);
// run ~100 steps
}

Then drop both trace files into chrome://tracing or perfetto.dev side by side. Whichever op regressed will jump out.

Two things worth trying first since they often explain “version X is suddenly slower”:

  1. Set OMP_NUM_THREADS and MKL_NUM_THREADS explicitly to your physical core count — different default thread behavior between releases can look like a regression.
  2. Pin in code: at::set_num_threads(N) and at::set_num_interop_threads(1).

If you spot a specific op that regressed, post the trace snippet and we can dig deeper.

Thanks

/Aditya

Thank you for your response, here is what I have

  • Is that 500,000microseconds per step or total across the run?

    • It is training across the below
    for (int i = 0; i<100/*900*/;++i) {
	std::chrono::steady_clock::time_point begin_time = std::chrono::steady_clock::now();
	torch::Tensor prediction = net->forward(pytorch_data.train_features_.index({train_indices.index({torch::indexing::Slice(start,end,1)}), torch::indexing::Slice(2,n_variables+2,1)}));//for when I use the order_id and time for logging
	torch::nn::functional::BinaryCrossEntropyFuncOptions op;
	op.weight(train_weights_);
	loss /= accumulation_step;
	loss = loss.requires_grad_(true);
	loss.backward();
	if( (i +1)% accumulation_step == 0){
		optimizer.step();
		optimizer.zero_grad();
	}

	start = end;
	end += train_indices.sizes()[0]/5;
	if(end > train_indices.sizes()[0]/1.25){
		amount_of_zeros_predicted_within_range = false;
		if(batch_iter > train_indices.sizes()[0]/5 )
			batch_iter = 0;
		start = batch_iter;
		end = start+train_indices.sizes()[0]/5;
		batch_iter += batch_iter;
		if(end > train_indices.sizes()[0] - train_indices.sizes()[0]/5){
			batch_iter = 5;
			start = batch_iter;
			end = start+train_indices.sizes()[0]/5;//12000;//rty25000;;
			batch_iter += batch_iter;

		}
		std::chrono::steady_clock::time_point end_time = std::chrono::steady_clock::now();
		std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end_time - begin_time).count() << "[µs]" << std::endl;
		
		}
		
		other stuff but not included in the timer
}
  • What’s your batch size and model size? Small FFNs hit dispatcher overhead more, larger ones are bound by matmul speed — totally different debugging paths.

    • train_indices.sizes()[0]// could be a couple hundred thousand examples
      
    • model size is the below were n_features = 148

      		fc1 = register_module("fc1", torch::nn::Linear(torch::nn::LinearOptions((n_features), 750).bias(false)));
      
      		torch::nn::init::xavier_normal_(fc1->weight, 5/3);//torch::nn::init::normal_(fc1->weight,0,.1);
      
      		fc2 = register_module("fc2", torch::nn::Linear(torch::nn::LinearOptions(750, 625).bias(false)));
      
      		torch::nn::init::xavier_normal_(fc2->weight, 5/3);//torch::nn::init::normal_(fc2->weight,0,.1);
      
      		fc3 = register_module("fc3", torch::nn::Linear(torch::nn::LinearOptions(625, 560).bias(false)));
      
      		torch::nn::init::xavier_normal_(fc3->weight, 5/3);//torch::nn::init::normal_(fc3->weight,0,1);
      
      		fc4 = register_module("fc4", torch::nn::Linear(torch::nn::LinearOptions(560, 80).bias(false)));
      
      		torch::nn::init::xavier_normal_(fc4->weight, 5/3);//torch::nn::init::normal_(fc4->weight,0,1);
      
      		fc5 = register_module("fc5", torch::nn::Linear(torch::nn::LinearOptions(80, 65).bias(false)));
      
      		torch::nn::init::xavier_normal_(fc5->weight, 5/3);//torch::nn::init::normal_(fc5->weight,0,1);
      
      		fc6 = register_module("fc6", torch::nn::Linear(torch::nn::LinearOptions(65, 75).bias(false)));
      
      		torch::nn::init::xavier_normal_(fc6->weight, 5/3);//torch::nn::init::normal_(fc6->weight,0,1);
      
      		fc7 = register_module("fc7", torch::nn::Linear(torch::nn::LinearOptions(75, 1).bias(false)));
      
      		torch::nn::init::xavier_normal_(fc7->weight, 5/3);//torch::nn::init::normal_(fc7->weight,0,1);
      
      
      
  • Quick check — are both builds actually linking the same MKL? Run ldd ./your_binary | grep -i mkl on each.

    • Yes

and I was already doing the followin
setenv(“OMP_NUM_THREADS”, “1”, 1);

**setenv**("MKL_NUM_THREADS", "1", 1);

torch::**set_num_threads**(5);

torch::**set_num_interop_threads**(5);

I think I might have been a little hasty in asking this question. I don’t think I recorded enough observations before I asked this question.

Thank you again for your response

Cheers,

Ryan