I’m experimenting with the idea of generating Torchscript IR as part of a project I’m working on, and would like to better understand how it’s compiled and optimized.
When I load a module in in C++, is there any compilation / optimization that can performed before the first time I call the forward method? Or do optimizations only happen with a JIT that runs after data is first passed in?
If there are JIT optimizations to be done when I call the forward method, do the optimizations persist so that next time I call the forward method on the same module object, they won’t have to be performed again?
If a module/model gets optimized by the JIT, do any of those optimizations change the output I’d get if I now wrote the model back out to a torchscript .pt file?
Where does kernel fusion happen? Is that part of the JIT, or is that something that has to be figured out when generating the IR in python?
Does Pytorch have the ability to dynamically decide whether an operation should be run on CPU or GPU based on things like data size?
If there’s some resource I haven’t yet seen that could help me better understand this stuff, a pointer towards it would be appreciated. Thanks for your help.
Yes, they do persist, however if you pass in tensors of a different type we may respecialize.
JIT optimizations will not affect saving your model.
That happens as part of the JIT. It’s happens after we profile the tensors that are seen at runtime.
No, that one of the main design points of pytorch is that it does not automatically decide what parts of the module to run on gpu vs cpu, for both eager & jit. That is controlled by the user.
To better understand optimizations I would suggest running a simple file like:
import torch
@torch.jit.script
def foo(x):
return x + x + x
foo(torch.rand([4, 4], device='cuda'))
foo(torch.rand([4, 4], device='cuda'))
with PYTORCH_JIT_LOG_LEVEL='profiling_graph_executor_impl' foo.py
It still makes two separate calls to add. I would think part of kernel fusion is that it could somehow make one call to add that takes in all three inputs. Am I misinterpreting the graph, or am I misunderstanding what kernel fusion does?