How to extract PTX assembly of CUDA kernels?

Hi all,

I was trying some ideas about making CUDA operations faster, and realized it would be very helpful to look at the PTX assembly code generated from .cu files.

Is there an easy way to extract PTX from the compiled PyTorch library, or find the exact nvcc command used to compile each .cu file? (If I could find the command, I think I can add -ptx option to generate PTX output.)

Also, when I run nvvp (NVidia visual profiler) and examine individual kernel calls, I see this message:

No source File Mapping

The source-assembly viewer could not be shown because source-file mappings are missing from the kernel.
You can enable source-file mappings by using the -lineinfo flag when compiling the kernels.

…so, how can I add -lineinfo to nvcc command-line options when building PyTorch?

Once upon a time there was an env variable that you could set to add additional flags to nvcc. Now it looks like all extra flags are added directly in aten/CMakeLists.txt, e.g. https://github.com/pytorch/pytorch/blob/master/aten/CMakeLists.txt#L73

Wow, it worked. Thanks!

Now I only have to find out how to extract PTX…

@ngimel TORCH_NVCC_FLAGS still exists https://github.com/pytorch/pytorch/blob/master/aten/CMakeLists.txt#L82

1 Like

Cool! I am blind, I guess.
@jick, -keep option should leave ptx files, though I don’t know where they’ll end up during pytorch build process. Alternatively, I think you can use cuobjdump to dump ptx from compiled libAten.

For the record, -keep was great success, I can give it something like:

TORCH_NVCC_FLAGS="-lineinfo -keep -keep-dir (directory to store these files)" \
  python setup.py build develop

One drawback is that this command creates ~5GB of intermediate files, when I need only a few of them.

Also, adding -src-in-ptx could give an even better ptx file (with source lines interspersed with PTX code), but unfortunately it sometimes triggers segfault in nvcc during the build. (I could still manage to get the PTX file I want by running it several times. YMMV.)

If you only need selected ones, you could have the build process show the invocations and just rerun them with additional flags…

Best regards

Thomas
(Who used to own a 20MB hard drive. :slight_smile: )

If you only need selected ones, you could have the build process show the invocations and just rerun them with additional flags…

Actually I also tried that, but then I found that searching for the correct invocation and massaging it is not as convenient as I hoped. (Maybe I’m not very good with CMake.) For one thing, those nvcc arguments are huge and easily span a dozen lines in my editor. :slight_smile:

Well, if I had to do it every five minutes, then I’d probably try and find a way to automate it, because “re-building the whole PyTorch with a different argument” takes more time than that. On the other hand, I could use some coffee break! :stuck_out_tongue:

I usually redirect the output to a file python3 setup.py bdist_wheel > log.txt 2>&1 or so and then use grep somesource.c log.txt to find the source line. It’ll be long, but it will be precise. :slight_smile:
Well, you have a solution that works for you, that is the important bit.

Hmm, interesting, when I run python setup.py, this is how my output looks like:

[1/165] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THCUNN/ATen_generated_IndexLinear.cu.o
[2/165] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THCUNN/ATen_generated_SpatialConvolutionMM.cu.o
[3/165] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THCUNN/ATen_generated_VolumetricUpSamplingNearest.cu.o
[4/165] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THCUNN/ATen_generated_SpatialDilatedMaxPooling.cu.o
[5/165] Building NVCC (Device) object src/ATen/CMakeFiles/ATen.dir/__/THC/ATen_generated_THCTensorMathBlas.cu.o
......

Is there some command-line option that tells CMake to show the actual command? (Sorry, this must be a very basic question, but I couldn’t find an answer anywhere.)

Not sure if it matters, but I have installed Ninja, following the recommendations from CONTRIBUTING.md.