Significant difference of performance between the extension that was compiled by g++ directly and that was built by "python3 setup.py install"

Hi!

I wrote a cpp extension for torch which is a custom convolutional function.

Firstly, I compiled this function with g++ directly which was used for testing, the latency is 5 milliseconds.

Secondly, I tried to integrate this function to torch and installed this extension by setuptools, following the steps shown in the tutorial provided by torch. However, the latency is now 16 milliseconds.

The function invokation will consumes about 1-2 ms, so why the performance differs so much?

The compilation by g++ directly was done by

g++ -pthread -mavx2 -mfma ...

and the directives in the source file includes

#pragma GCC diagnostic ignored "-Wformat"

#pragma STDC FP_CONTRACT ON

#pragma GCC optimize("O3","unroll-loops","omit-frame-pointer","inline") //Optimization flags

// #pragma GCC option("arch=native","tune=native","no-zero-upper") //Enable AVX

#pragma GCC target("avx")

These directives were also included in the file built by setuptools. The “setup.py” file is

setup(
    name = 'cusconv_cpp',
    ext_modules=[
        CppExtension(name='cusconv_cpp', sources=['src/cusconv.cpp'],
        extra_compile_args={'cxx': ['-O3', '-pthread', '-mavx2', '-mfma']})
    ],
    cmdclass={
        'build_ext': BuildExtension
    })

The output log by setuptools for buiding is

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/max/.local/lib/python3.6/site-packages/torch/lib/include -I/home/max/.local/lib/python3.6/site-packages/torch/lib/include/torch/csrc/api/include -I/home/max/.local/lib/python3.6/site-packages/torch/lib/include/TH -I/home/max/.local/lib/python3.6/site-packages/torch/lib/include/THC -I/usr/include/python3.6m -c src/indconv.cpp -o build/temp.linux-x86_64-3.6/src/indconv.o -O3 -pthread -mavx2 -mfma -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=indconv_cpp -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11

which indeed includes those flags but many other flags were also used.
Anyone has any ideas?

Your timings are done at the same place in your c++ code in both cases? Or do you measure the time in python for one and in cpp for the other?

Thanks, I did timeing all in cpp, invoking same function, timing same area.

Another point, I tried to use “-O3” or “-O1” for extra_compile_flags in “setup.py”, there is no difference, always 16ms.
Then I tried removing “-O3” in the source file #pragmas and using “g++” to compile it directly, the perfomance is now 35ms. Adding “-O1” will result in 6-7ms.

Removing all extra_compile_flags in “setup.py” will result in the same performance. It seems like those flags used in “setup.py” doesn’t work, some default flags adopted by setuptools prevent further optimizations.

Could you say a bit more regarding the flags?
Usually the -O3 should be the highest level of optimisation, right?

After so many days, I still didn’t find any solutions to this. I tried all the same code in a old machine, no problem. Then now I tried again in another new machine, same problem happens.

my setup.py

import os
# os.environ['CC'] = 'g++'
# os.environ['CXX'] = 'g++'
from setuptools import setup
import sys
import re
from distutils import sysconfig
from distutils.core import setup, Extension

# if sys.platform == 'linux' or sys.platform == 'darwin':
#   sysconfig.get_config_var(None)  # to fill up _config_vars
#   d = sysconfig._config_vars
#   for x in ['OPT', 'CFLAGS', 'PY_CFLAGS', 'PY_CORE_CFLAGS', 'CONFIGURE_CFLAGS', 'LDSHARED']:
#     d[x] = re.sub(' -g ', ' ', d[x])
#     d[x] = re.sub('^-g ', '',  d[x])
#     d[x] = re.sub(' -g$', '',  d[x])
from torch.utils.cpp_extension import BuildExtension, CppExtension



setup(
    name = 'op_conv_ext',
    ext_modules=[
        CppExtension(name='op_conv_ext', sources=['extension/conv3x3s1.cpp'],
       extra_compile_args={'cxx': ['-O3', '-pthread', '-mavx2', '-mfma', '-funroll-loops', '-fomit-frame-pointer', '-finline-small-functions', '-finline-functions-called-once', 
           '-finline-functions', '-ffp-contract=on', "-mno-vzeroupper", '-m64', '-g', '-DNDEBUG[-march=native] [-mtune=native]' ]},
       extra_link_args=['-O3', '-pthread', '-mavx2', '-mfma', '-funroll-loops', 
           '-fomit-frame-pointer', '-finline-small-functions', '-finline-functions-called-once', 
           '-finline-functions', '-ffp-contract=on', "-mno-vzeroupper", '-m64', '-g', '-DNDEBUG[-march=native] [-mtune=native]' ],
    )],
    cmdclass={
        'build_ext': BuildExtension
    })

The output from python3 setup.py is

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/zyx/.local/lib/python3.6/site-packages/torch/include -I/home/zyx/.local/lib/python3.6/site-packages/torch/include/torch/csrc/api/include -I/home/zyx/.local/lib/python3.6/site-packages/torch/include/TH -I/home/zyx/.local/lib/python3.6/site-packages/torch/include/THC -I/usr/include/python3.6m -c extension/conv3x3s1.cpp -o build/temp.linux-x86_64-3.6/extension/conv3x3s1.o -O3 -pthread -mavx2 -mfma -funroll-loops -fomit-frame-pointer -finline-small-functions -finline-functions-called-once -finline-functions -ffp-contract=on -mno-vzeroupper -m64 -g -DNDEBUG[-march=native] [-mtune=native] -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=op_conv_ext -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++11
x86_64-linux-gnu-g++ -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.6/extension/conv3x3s1.o -o build/lib.linux-x86_64-3.6/op_conv_ext.cpython-36m-x86_64-linux-gnu.so -O3 -pthread -mavx2 -mfma -funroll-loops -fomit-frame-pointer -finline-small-functions -finline-functions-called-once -finline-functions -ffp-contract=on -mno-vzeroupper -m64 -g -DNDEBUG[-march=native] [-mtune=native]

Flags in the source file is the same.

Now the performance invoked from python is 35ms, that invoked from c++ which is built by g++ directly is 25ms.