Questions about Pytorch 2.0 tutorial and TORCH_COMPILE_DEBUG=1

I’m going through the Pytorch 2.0 tutorial here Accelerating Hugging Face and TIMM models with PyTorch 2.0 | PyTorch ( Accelerating Hugging Face and TIMM models with PyTorch 2.0).

When I get to the section “A toy example” there’s a point where they say you can run the toy model with:

TORCH_COMPILE_DEBUG=1 python trig.py

And this apparently is supposed to generate the output:

@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
@triton.jit
def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 10000
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask)
    tmp1 = tl.sin(tmp0)
    tmp2 = tl.sin(tmp1)
    tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)

However, I do not see that output. I see a lot of debug output. I also notice that a torch_compile_debug directory has been created so I went into that to see if that’s where I coudl find the specified output. There’s a timestamp directory that’s created and under that directory I see two subdirectories: aot_torchinductor and torchdynamo.

In torchdynamo there’s a debug.log with the following:

skipping __init__ /usr/lib/python3.10/contextlib.py
skipping __enter__ /usr/lib/python3.10/contextlib.py
skipping __init__ /usr/lib/python3.10/contextlib.py
skipping __enter__ /usr/lib/python3.10/contextlib.py
skipping enable_dynamic /home/phil/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py
Step 1: torchdynamo start tracing fn
TRACE starts_line /home/phil/devel/pytorch/trig.py:3
TRACE LOAD_GLOBAL torch []
TRACE LOAD_ATTR sin [TorchVariable(<module 'torch' from '/home/phil/.local/lib/python3.10/site-packages/torch/__init__.py'>)]
TRACE LOAD_FAST x [TorchVariable(<built-in method sin of type object at 0x7fa801eb5080>)]
TRACE CALL_FUNCTION 1 [TorchVariable(<built-in method sin of type object at 0x7fa801eb5080>), TensorVariable()]
TRACE LOAD_ATTR cuda [TensorVariable()]
TRACE CALL_FUNCTION 0 [GetAttrVariable(TensorVariable(), cuda)]
TRACE STORE_FAST a [TensorVariable()]
TRACE starts_line /home/phil/devel/pytorch/trig.py:4
TRACE LOAD_GLOBAL torch []
TRACE LOAD_ATTR sin [TorchVariable(<module 'torch' from '/home/phil/.local/lib/python3.10/site-packages/torch/__init__.py'>)]
TRACE LOAD_FAST y [TorchVariable(<built-in method sin of type object at 0x7fa801eb5080>)]
TRACE CALL_FUNCTION 1 [TorchVariable(<built-in method sin of type object at 0x7fa801eb5080>), TensorVariable()]
TRACE LOAD_ATTR cuda [TensorVariable()]
TRACE CALL_FUNCTION 0 [GetAttrVariable(TensorVariable(), cuda)]
TRACE STORE_FAST b [TensorVariable()]
TRACE starts_line /home/phil/devel/pytorch/trig.py:5
TRACE LOAD_FAST a []
TRACE LOAD_FAST b [TensorVariable()]
TRACE BINARY_ADD None [TensorVariable(), TensorVariable()]
TRACE RETURN_VALUE None [TensorVariable()]
Step 1: torchdynamo done tracing fn (RETURN_VALUE)
RETURN_VALUE triggered compile
COMPILING GRAPH due to None
Step 2: calling compiler function debug_wrapper
Step 2: done compiler function debug_wrapper
skipping _fn /home/phil/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py
skipping nothing /home/phil/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py
skipping __exit__ /usr/lib/python3.10/contextlib.py
skipping __exit__ /usr/lib/python3.10/contextlib.py
~                                                      

In the aot_torchinductor dir there’s a file called aot_model__0_debug.log that has:

[aot_autograd.py:1054 DEBUG] ====== Forward (only) graph 0 ======
[aot_autograd.py:1055 DEBUG] class <lambda>(torch.nn.Module):
    def forward(self, arg0_1: f32[10000], arg1_1: f32[10000]):
        # File: /home/phil/devel/pytorch/trig.py:3, code: a = torch.sin(x).cuda()
        sin: f32[10000] = torch.ops.aten.sin.default(arg0_1);  arg0_1 = None

        # File: /home/phil/devel/pytorch/trig.py:4, code: b = torch.sin(y).cuda()
        sin_1: f32[10000] = torch.ops.aten.sin.default(arg1_1);  arg1_1 = None

        # File: /home/phil/devel/pytorch/trig.py:5, code: return a + b
        add: f32[10000] = torch.ops.aten.add.Tensor(sin, sin_1);  sin = sin_1 = None
        return (add,)

Which looks a bit closer to what was supposed to be expected, but still quite different.

Also, further down in the tutorial it on the Huggingface/BERT example it says:

If you remove the `to(device="cuda:0")` from the model and `encoded_input` then PyTorch 2.0 will generate C++ kernels that will be optimized for running on your CPU.

I tried that, but don’t see any C++ output anywhere.

2 Likes

Hi,

I think you can find what you expect at …/aot_torchinductor/model__0_inference_0.0/output_code.py.

1 Like

@EstherBear is correct there should be a tmp folder printed in your console that shows where the inductor code is printed, we could make this clearer and package everythign in the same directory cc @mlazos

1 Like

By default the python output code should be written in /torch_compile_debug/run_/aot_torchinductor/model_/outputcode.py

There will be a separate model_… directory for each subgraph that is compiled.