Understanding Model Compilation/Optimization

I’m experimenting with the idea of generating Torchscript IR as part of a project I’m working on, and would like to better understand how it’s compiled and optimized.

I’ve followed the tutorial here: Loading a TorchScript Model in C++ — PyTorch Tutorials 1.7.1 documentation and successfully loaded a Torchscript model into C++ and ran it on some example data. I’ve also found the overview here: pytorch/OVERVIEW.md at master · pytorch/pytorch · GitHub helpful.

What I’d like to know is:

  1. When I load a module in in C++, is there any compilation / optimization that can performed before the first time I call the forward method? Or do optimizations only happen with a JIT that runs after data is first passed in?

  2. If there are JIT optimizations to be done when I call the forward method, do the optimizations persist so that next time I call the forward method on the same module object, they won’t have to be performed again?

  3. If a module/model gets optimized by the JIT, do any of those optimizations change the output I’d get if I now wrote the model back out to a torchscript .pt file?

  4. Where does kernel fusion happen? Is that part of the JIT, or is that something that has to be figured out when generating the IR in python?

  5. Does Pytorch have the ability to dynamically decide whether an operation should be run on CPU or GPU based on things like data size?

If there’s some resource I haven’t yet seen that could help me better understand this stuff, a pointer towards it would be appreciated. Thanks for your help.

I’m happy to break these questions up into different posts if that would be helpful

Hi johnc1231,

  1. Not unless you invoke it. Currently the only AOT optimization is torch.jit.freeze — PyTorch master documentation. It does a limited set of optimizations on 1.7, a little more on master, and I’m working on adding more AOT stuff. I will add a C++ api equivalent by 1.8.

  2. Yes, they do persist, however if you pass in tensors of a different type we may respecialize.

  3. JIT optimizations will not affect saving your model.

  4. That happens as part of the JIT. It’s happens after we profile the tensors that are seen at runtime.

  5. No, that one of the main design points of pytorch is that it does not automatically decide what parts of the module to run on gpu vs cpu, for both eager & jit. That is controlled by the user.

To better understand optimizations I would suggest running a simple file like:

import torch

def foo(x):
    return x + x + x

foo(torch.rand([4, 4], device='cuda'))
foo(torch.rand([4, 4], device='cuda'))

with PYTORCH_JIT_LOG_LEVEL='profiling_graph_executor_impl' foo.py

1 Like

Thanks for a very helpful response, and for planning to add freeze to the C++ API. I’ll experiment with the JIT log level printouts.

@eellison I did what you suggested, and I noticed the final optimized output looks like:

[DUMP profiling_graph_executor_impl.cpp:620] Optimized Graph:
[DUMP profiling_graph_executor_impl.cpp:620] graph(%x.1 : Tensor):
[DUMP profiling_graph_executor_impl.cpp:620]   %1 : int = prim::Constant[value=1]()
[DUMP profiling_graph_executor_impl.cpp:620]   %4 : Tensor = aten::add(%x.1, %x.1, %1) # triple_add.py:5:11
[DUMP profiling_graph_executor_impl.cpp:620]   %7 : Tensor = aten::add(%4, %x.1, %1) # triple_add.py:5:11
[DUMP profiling_graph_executor_impl.cpp:620]   return (%7)

It still makes two separate calls to add. I would think part of kernel fusion is that it could somehow make one call to add that takes in all three inputs. Am I misinterpreting the graph, or am I misunderstanding what kernel fusion does?

Were you using CUDA or CPU ? Could you post a more complete repro? CPU fusion is in progress & targeting 1.9

Ah yeah, I was using CPU, that would explain it. I didn’t realize that GPU fusion was working but CPU fusion was in progress. Thank you.