FX graph resorting a torch model's positional args based on execution order

ktexb · June 3, 2024, 6:53pm

It seems like the order of the args in an FX graph are dependent on the order of execution of the args, rather than however the args were expected in the model definition.

Is this expected behaviour? And if it is, how should I acquire my FX graph without running into this behaviour? Or how could I circumvent this behaviour?

I’ve attached an example below

def model(x_1, x_2, x_3):
    x_5 = torch.nn.functional.conv2d(x_1, x_2, x_3)
    return x_5

def model_1(x_1, x_2, x_3):
    x_4 = x_2
    x_5 = torch.nn.functional.conv2d(x_1, x_4, x_3)
    return x_5

x_1 = torch.randn([4, 6, 8, 4])
x_2 = torch.randn([4, 6, 2, 2])
x_3 = torch.randn([4])

def get_fx(model, args):
    graphs = []
    def some_backend(graph_module, sample_inputs):
        nonlocal graphs
        graph_module.print_readable()
        graphs.append(graph_module)
        return graph_module

    torch._dynamo.reset()
    torch.compile(model, backend=some_backend)(*args)
    return graphs

graph_module = get_fx(model, [x_1, x_2, x_3])
output = graph_module[0](x_1, x_2, x_3)

graph_module_1 = get_fx(model_1, [x_1, x_2, x_3])
output_1 = graph_module_1[0](x_1, x_2, x_3)

In the above code, I have 2 different models. The only difference is that in model_1, the argument x_2 is used before x_1.

Below, get_fx() tries to generate, extract, and print the FX graphs that get created from the models.

The print for graph_module shows

def forward(self, L_x_1_ : torch.Tensor, L_x_2_ : torch.Tensor, L_x_3_ : torch.Tensor):
    l_x_1_ = L_x_1_
    l_x_2_ = L_x_2_
    l_x_3_ = L_x_3_
        
    x_5 = torch.conv2d(l_x_1_, l_x_2_, l_x_3_);  l_x_1_ = l_x_2_ = l_x_3_ = None
    return (x_5,)

which is expected.

The print for graph_module_1 shows

def forward(self, L_x_2_ : torch.Tensor, L_x_1_ : torch.Tensor, L_x_3_ : torch.Tensor):
    x_4 = L_x_2_
    l_x_1_ = L_x_1_
    l_x_3_ = L_x_3_
        
    x_5 = torch.conv2d(l_x_1_, x_4, l_x_3_);  l_x_1_ = x_4 = l_x_3_ = None
    return (x_5,)

We can see that x_2 is expected before x_1.

As a result, I can call graph_module[0](x_1, x_2, x_3) without issue, but calling graph_module_1[0](x_1, x_2, x_3) results in an error, because it’s getting the inputs in the wrong order.

ktexb · June 11, 2024, 3:16pm

This is indeed intended behaviour and torch.export will maintain the correct input mapping.

github.com/pytorch/pytorch

FX graph have input args in a different order from the original pytorch model

opened 04:43PM - 10 Jun 24 UTC

closed 03:14PM - 11 Jun 24 UTC

ktexb

triaged oncall: pt2

### 🐛 Describe the bug The FX graph generated during torch.compile() seems to e…xpect a different order of input args based on the model's execution order. In the code below, `model_resort` differs from `model` only in that x_2 is used before x_1. When I try to print and extract the corresponding FX graphs with `get_fx()`, the print for `model_resort` shows that it expects x_2 as the first input and its graph module expects x_2 to be passed in first. Is this intended behaviour for the FX graph? Is there a way to ensure that the FX graph expects inputs in the same order as the original model? Am I perhaps extracting the FX graph in an incorrect manner? Code: ``` import torch def get_fx(model, args): graphs = [] def some_backend(graph_module, inputs): nonlocal graphs graph_module.print_readable() graphs.append(graph_module) return graph_module torch._dynamo.reset() torch.compile(model, backend=some_backend)(*args) return graphs def model(x_1, x_2, x_3): x_5 = torch.nn.functional.conv2d(x_1, x_2, x_3) return x_5 def model_resort(x_1, x_2, x_3): x_4 = x_2 x_5 = torch.nn.functional.conv2d(x_1, x_4, x_3) return x_5 x_1 = torch.randn([4, 6, 8, 4]) x_2 = torch.randn([4, 6, 2, 2]) x_3 = torch.randn([4]) graph_module = get_fx(model, [x_1, x_2, x_3])[0] output = graph_module(x_1, x_2, x_3) graph_module_resort = get_fx(model_resort, [x_1, x_2, x_3])[0] output_resort = graph_module_resort(x_2, x_1, x_3) # this line works output_resort = graph_module_resort(x_1, x_2, x_3) # this line does not ``` Output: ``` class GraphModule(torch.nn.Module): def forward(self, L_x_1_ : torch.Tensor, L_x_2_ : torch.Tensor, L_x_3_ : torch.Tensor): l_x_1_ = L_x_1_ l_x_2_ = L_x_2_ l_x_3_ = L_x_3_ # File: test.py:16, code: x_5 = torch.nn.functional.conv2d(x_1, x_2, x_3) x_5 = torch.conv2d(l_x_1_, l_x_2_, l_x_3_); l_x_1_ = l_x_2_ = l_x_3_ = None return (x_5,) class GraphModule(torch.nn.Module): def forward(self, L_x_2_ : torch.Tensor, L_x_1_ : torch.Tensor, L_x_3_ : torch.Tensor): x_4 = L_x_2_ l_x_1_ = L_x_1_ l_x_3_ = L_x_3_ # File: test.py:21, code: x_5 = torch.nn.functional.conv2d(x_1, x_4, x_3) x_5 = torch.conv2d(l_x_1_, x_4, l_x_3_); l_x_1_ = x_4 = l_x_3_ = None return (x_5,) Traceback (most recent call last): File "/home/vscode/.local/lib/python3.8/site-packages/torch/fx/graph_module.py", line 304, in __call__ return super(self.cls, obj).__call__(*args, **kwargs) # type: ignore[misc] File "/home/vscode/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/vscode/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "<eval_with_key>.1", line 8, in forward x_5 = torch.conv2d(l_x_1_, x_4, l_x_3_); l_x_1_ = x_4 = l_x_3_ = None RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (8 x 4). Kernel size can't be greater than actual input size Call using an FX-traced Module, line 8 of the traced Module's generated forward function: l_x_3_ = L_x_3_ x_5 = torch.conv2d(l_x_1_, x_4, l_x_3_); l_x_1_ = x_4 = l_x_3_ = None ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE return (x_5,) Traceback (most recent call last): File "test.py", line 33, in <module> output_resort = graph_module_resort(x_1, x_2, x_3) # this line does not File "/home/vscode/.local/lib/python3.8/site-packages/torch/fx/graph_module.py", line 738, in call_wrapped return self._wrapped_call(self, *args, **kwargs) File "/home/vscode/.local/lib/python3.8/site-packages/torch/fx/graph_module.py", line 315, in __call__ raise e.with_traceback(None) # noqa: TRY200 RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (8 x 4). Kernel size can't be greater than actual input size ``` ### Versions ``` Collecting environment information... PyTorch version: 2.2.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: Could not collect Libc version: glibc-2.31 Python version: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] (64-bit runtime) Python platform: Linux-5.4.0-174-generic-x86_64-with-glibc2.29 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA RTX A6000 GPU 1: NVIDIA RTX A6000 GPU 2: NVIDIA RTX A6000 Nvidia driver version: 545.29.06 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 2 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD Ryzen Threadripper PRO 3995WX 64-Cores Stepping: 0 Frequency boost: enabled CPU MHz: 2068.556 CPU max MHz: 2700.0000 CPU min MHz: 2200.0000 BogoMIPS: 5389.82 Virtualization: AMD-V L1d cache: 2 MiB L1i cache: 2 MiB L2 cache: 32 MiB L3 cache: 256 MiB NUMA node0 CPU(s): 0-127 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Vulnerable Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca sme sev sev_es Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.24.4 [pip3] onnx==1.16.0 [pip3] onnxsim==0.4.36 [pip3] torch==2.2.2 [pip3] torchvision==0.17.2 [pip3] triton==2.2.0 [conda] Could not collect ``` FYI this error also occurs in torch==2.3.1 cc @ezyang @bdhirsh @anijain2305 @chauhang @SherlockNoMad @EikanWang @jgong5 @wenzhe-nrv