I’m going through the Pytorch 2.0 tutorial here Accelerating Hugging Face and TIMM models with PyTorch 2.0 | PyTorch ( Accelerating Hugging Face and TIMM models with PyTorch 2.0).
When I get to the section “A toy example” there’s a point where they say you can run the toy model with:
TORCH_COMPILE_DEBUG=1 python trig.py
And this apparently is supposed to generate the output:
@pointwise(size_hints=[16384], filename=__file__, meta={'signature': {0: '*fp32', 1: '*fp32', 2: 'i32'}, 'device': 0, 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=())]})
@triton.jit
def kernel(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 10000
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask)
tmp1 = tl.sin(tmp0)
tmp2 = tl.sin(tmp1)
tl.store(out_ptr0 + (x0 + tl.zeros([XBLOCK], tl.int32)), tmp2, xmask)
However, I do not see that output. I see a lot of debug output. I also notice that a torch_compile_debug directory has been created so I went into that to see if that’s where I coudl find the specified output. There’s a timestamp directory that’s created and under that directory I see two subdirectories: aot_torchinductor and torchdynamo.
In torchdynamo there’s a debug.log with the following:
skipping __init__ /usr/lib/python3.10/contextlib.py
skipping __enter__ /usr/lib/python3.10/contextlib.py
skipping __init__ /usr/lib/python3.10/contextlib.py
skipping __enter__ /usr/lib/python3.10/contextlib.py
skipping enable_dynamic /home/phil/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py
Step 1: torchdynamo start tracing fn
TRACE starts_line /home/phil/devel/pytorch/trig.py:3
TRACE LOAD_GLOBAL torch []
TRACE LOAD_ATTR sin [TorchVariable(<module 'torch' from '/home/phil/.local/lib/python3.10/site-packages/torch/__init__.py'>)]
TRACE LOAD_FAST x [TorchVariable(<built-in method sin of type object at 0x7fa801eb5080>)]
TRACE CALL_FUNCTION 1 [TorchVariable(<built-in method sin of type object at 0x7fa801eb5080>), TensorVariable()]
TRACE LOAD_ATTR cuda [TensorVariable()]
TRACE CALL_FUNCTION 0 [GetAttrVariable(TensorVariable(), cuda)]
TRACE STORE_FAST a [TensorVariable()]
TRACE starts_line /home/phil/devel/pytorch/trig.py:4
TRACE LOAD_GLOBAL torch []
TRACE LOAD_ATTR sin [TorchVariable(<module 'torch' from '/home/phil/.local/lib/python3.10/site-packages/torch/__init__.py'>)]
TRACE LOAD_FAST y [TorchVariable(<built-in method sin of type object at 0x7fa801eb5080>)]
TRACE CALL_FUNCTION 1 [TorchVariable(<built-in method sin of type object at 0x7fa801eb5080>), TensorVariable()]
TRACE LOAD_ATTR cuda [TensorVariable()]
TRACE CALL_FUNCTION 0 [GetAttrVariable(TensorVariable(), cuda)]
TRACE STORE_FAST b [TensorVariable()]
TRACE starts_line /home/phil/devel/pytorch/trig.py:5
TRACE LOAD_FAST a []
TRACE LOAD_FAST b [TensorVariable()]
TRACE BINARY_ADD None [TensorVariable(), TensorVariable()]
TRACE RETURN_VALUE None [TensorVariable()]
Step 1: torchdynamo done tracing fn (RETURN_VALUE)
RETURN_VALUE triggered compile
COMPILING GRAPH due to None
Step 2: calling compiler function debug_wrapper
Step 2: done compiler function debug_wrapper
skipping _fn /home/phil/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py
skipping nothing /home/phil/.local/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py
skipping __exit__ /usr/lib/python3.10/contextlib.py
skipping __exit__ /usr/lib/python3.10/contextlib.py
~
In the aot_torchinductor dir there’s a file called aot_model__0_debug.log that has:
[aot_autograd.py:1054 DEBUG] ====== Forward (only) graph 0 ======
[aot_autograd.py:1055 DEBUG] class <lambda>(torch.nn.Module):
def forward(self, arg0_1: f32[10000], arg1_1: f32[10000]):
# File: /home/phil/devel/pytorch/trig.py:3, code: a = torch.sin(x).cuda()
sin: f32[10000] = torch.ops.aten.sin.default(arg0_1); arg0_1 = None
# File: /home/phil/devel/pytorch/trig.py:4, code: b = torch.sin(y).cuda()
sin_1: f32[10000] = torch.ops.aten.sin.default(arg1_1); arg1_1 = None
# File: /home/phil/devel/pytorch/trig.py:5, code: return a + b
add: f32[10000] = torch.ops.aten.add.Tensor(sin, sin_1); sin = sin_1 = None
return (add,)
Which looks a bit closer to what was supposed to be expected, but still quite different.
Also, further down in the tutorial it on the Huggingface/BERT example it says:
If you remove the `to(device="cuda:0")` from the model and `encoded_input` then PyTorch 2.0 will generate C++ kernels that will be optimized for running on your CPU.
I tried that, but don’t see any C++ output anywhere.