Reduced throughput with vector-add in libtorch

Hello,
I aim to use libtorch as a tensor/linear algebra/machine learning library for a scientific code base. While assessing my code I noticed that I couldn’t get any throughput (measured in bandwidth Gb/s) from my code. I was able to reproduce this behavior with a simple vector add example. Here I get roughly half the peak bandwidth. Now the question is, what can I do to improve the situation?
Here are the details:
The benchmark I am doing looks something like

auto options = torch::TensorOptions()
                        .dtype(torch::kFloat64);

torch::Tensor T1 = torch::randn(N,options);
torch::Tensor T2 = torch::randn(N,options);
torch::Tensor T3 = torch::randn(N,options);

double start = omp_get_wtime();
T3 = T1+T2;
double end = omp_get_wtime();

std::cout << "Timing: " << end-start << " s\n";

I did a similar thing with std::vectors to have some comparison. The std::vector version obtained peak bandwidth. Here are the results:

  • Torch Vector Add
* Memory Footprint   : 1.99931 Gb
* Min Execution Time : 0.171147 s
* Max Execution Time : 0.201715 s
* Mean Execution Time: 0.173588 +/- 0.00298119 s
* Mean Bandwidth     : 34.5527 +/- 0.593407 Gb/s
* Number threads     : 40
* use GPU            : false
  • std::vector Vector Add
* Memory Footprint   : 1.99931 Gb
* Min Execution Time : 0.0994474 s
* Max Execution Time : 0.133856 s
* Mean Execution Time: 0.0999729 +/- 0.00342435 s
* Mean Bandwidth     : 59.9955 +/- 2.05501 Gb/s
* Number threads     : 40/40
* use GPU            : false

I am testing this on a dedicated node using 1 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz CPU. For this particular test I use the default OpenMP environment. Here is my CMakeLists.txt:

cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(example-app)

find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fopenmp")

add_executable(bandwidth_torch bandwidth_torch.cpp)
target_link_libraries(bandwidth_torch "${TORCH_LIBRARIES}")
set_property(TARGET bandwidth_torch PROPERTY CXX_STANDARD 14)

add_executable(bandwidth_cpp bandwidth_cpp.cpp)
set_property(TARGET bandwidth_cpp PROPERTY CXX_STANDARD 14)

with a c++ (Debian 10.2.1-6) 10.2.1 20210110.
The torch version I am using is installed via pip3

~> pip3 show torch
Name: torch
Version: 1.10.0+cpu

The full code is available at GitHub - Marcel-Rodekamp/TorchBandwidth (same link as above).