Hello,
I aim to use libtorch as a tensor/linear algebra/machine learning library for a scientific code base. While assessing my code I noticed that I couldn’t get any throughput (measured in bandwidth Gb/s) from my code. I was able to reproduce this behavior with a simple vector add example. Here I get roughly half the peak bandwidth. Now the question is, what can I do to improve the situation?
Here are the details:
The benchmark I am doing looks something like
auto options = torch::TensorOptions()
.dtype(torch::kFloat64);
torch::Tensor T1 = torch::randn(N,options);
torch::Tensor T2 = torch::randn(N,options);
torch::Tensor T3 = torch::randn(N,options);
double start = omp_get_wtime();
T3 = T1+T2;
double end = omp_get_wtime();
std::cout << "Timing: " << end-start << " s\n";
I did a similar thing with std::vector
s to have some comparison. The std::vector
version obtained peak bandwidth. Here are the results:
- Torch Vector Add
* Memory Footprint : 1.99931 Gb
* Min Execution Time : 0.171147 s
* Max Execution Time : 0.201715 s
* Mean Execution Time: 0.173588 +/- 0.00298119 s
* Mean Bandwidth : 34.5527 +/- 0.593407 Gb/s
* Number threads : 40
* use GPU : false
-
std::vector
Vector Add
* Memory Footprint : 1.99931 Gb
* Min Execution Time : 0.0994474 s
* Max Execution Time : 0.133856 s
* Mean Execution Time: 0.0999729 +/- 0.00342435 s
* Mean Bandwidth : 59.9955 +/- 2.05501 Gb/s
* Number threads : 40/40
* use GPU : false
I am testing this on a dedicated node using 1 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz CPU. For this particular test I use the default OpenMP environment. Here is my CMakeLists.txt:
cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
project(example-app)
find_package(Torch REQUIRED)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -fopenmp")
add_executable(bandwidth_torch bandwidth_torch.cpp)
target_link_libraries(bandwidth_torch "${TORCH_LIBRARIES}")
set_property(TARGET bandwidth_torch PROPERTY CXX_STANDARD 14)
add_executable(bandwidth_cpp bandwidth_cpp.cpp)
set_property(TARGET bandwidth_cpp PROPERTY CXX_STANDARD 14)
with a c++ (Debian 10.2.1-6) 10.2.1 20210110
.
The torch version I am using is installed via pip3
~> pip3 show torch
Name: torch
Version: 1.10.0+cpu
The full code is available at GitHub - Marcel-Rodekamp/TorchBandwidth (same link as above).