PyTorch 2.0 torch.compile works on CPU but not GPU

Any idea what could be going on? Is it a bug or some misconfiguration?

WSL 2: 5.10.102.1-microsoft-standard-WSL2
Torch version: 2.0.0+cu118
CUDA: 11.8
Python: 3.10.9
cuDNN: 8700
GPU: NVIDIA GeForce RTX 3090


import torch
class TestSig(torch.nn.Module):
   def __init__(self):
      super().__init__()
   def forward(self, x):
      return torch.sigmoid(x)

torch._dynamo.config.verbose=True
opt_cpu = torch.compile(TestSig())
print("cpu:", opt_cpu(torch.randn(1)))
cuda_eager = TestSig().cuda()
print("cuda eager:", cuda_eager(torch.randn(1).cuda()))
opt_cuda = torch.compile(TestSig()).cuda() #torch.compile(TestSig().cuda()) also fails
print("cuda opt:", opt_cuda(torch.randn(1).cuda()))

cpu: tensor([0.1461]) 
cuda eager: tensor([0.7304], device='cuda:0')

Output exceeds the [size limit](command:workbench.action.openSettings?%5B%22notebook.output.textLineLimit%22%5D). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?1ca450f5-00f7-498c-83e6-2c94d6831b7b)

--------------------------------------------------------------------------- _RemoteTraceback Traceback (most recent call last) _RemoteTraceback: """ Traceback (most recent call last): File "/usr/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File "/home/tech/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 549, in _worker_compile kernel.precompile(warm_cache_only_with_cc=cc) File "/home/tech/.local/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 69, in precompile self.launchers = [ File "/home/tech/.local/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 70, in self._precompile_config(c, warm_cache_only_with_cc) File "/home/tech/.local/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py", line 83, in _precompile_config triton.compile( File "/home/tech/.local/lib/python3.10/site-packages/triton/compiler.py", line 1587, in compile so_path = make_stub(name, signature, constants) File "/home/tech/.local/lib/python3.10/site-packages/triton/compiler.py", line 1476, in make_stub so = _build(name, src_path, tmpdir) File "/home/tech/.local/lib/python3.10/site-packages/triton/compiler.py", line 1391, in _build ret = subprocess.check_call(cc_cmd) File "/usr/lib/python3.10/subprocess.py", line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/bin/gcc', '/tmp/tmp95fa71kl/main.c', '-O3', '-I/usr/local/cuda/include', '-I/usr/local/include/python3.10', '-I/tmp/tmp95fa71kl', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmp95fa71kl/triton_.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib/wsl/lib']' returned non-zero exit status 1. """ The above exception was the direct cause of the following exception: CalledProcessError Traceback (most recent call last) File [~/.local/lib/python3.10/site-packages/torch/_dynamo/output_graph.py:670](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_dynamo/output_graph.py:670), in OutputGraph.call_user_compiler(self, gm) 669 else: --> 670 compiled_fn = compiler_fn(gm, self.fake_example_inputs()) 671 _step_logger()(logging.INFO, f"done compiler function {name}") File [~/.local/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py:1055](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py:1055), in wrap_backend_debug..debug_wrapper(gm, example_inputs, **kwargs) 1054 else: -> 1055 compiled_gm = compiler_fn(gm, example_inputs) 1057 return compiled_gm File [~/.local/lib/python3.10/site-packages/torch/__init__.py:1390](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/__init__.py:1390), in _TorchCompileInductorWrapper.__call__(self, model_, inputs_) 1388 from torch._inductor.compile_fx import compile_fx -> 1390 return compile_fx(model_, inputs_, config_patches=self.config) File [~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:455](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:455), in compile_fx(model_, example_inputs_, inner_compile, config_patches) 450 with overrides.patch_functions(): 451 452 # TODO: can add logging before/after the call to create_aot_dispatcher_function 453 # in torch._functorch/aot_autograd.py::aot_module_simplified::aot_function_simplified::new_func 454 # once torchdynamo is merged into pytorch --> 455 return aot_autograd( 456 fw_compiler=fw_compiler, 457 bw_compiler=bw_compiler, 458 decompositions=select_decomp_table(), 459 partition_fn=functools.partial( 460 min_cut_rematerialization_partition, compiler="inductor" 461 ), 462 keep_inference_input_mutations=True, 463 )(model_, example_inputs_) File [~/.local/lib/python3.10/site-packages/torch/_dynamo/backends/common.py:48](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_dynamo/backends/common.py:48), in aot_autograd..compiler_fn(gm, example_inputs) 47 with enable_aot_logging(): ---> 48 cg = aot_module_simplified(gm, example_inputs, **kwargs) 49 counters["aot_autograd"]["ok"] += 1 File [~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:2805](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:2805), in aot_module_simplified(mod, args, fw_compiler, bw_compiler, partition_fn, decompositions, hasher_type, static_argnums, keep_inference_input_mutations) 2803 full_args.extend(args) -> 2805 compiled_fn = create_aot_dispatcher_function( 2806 functional_call, 2807 full_args, 2808 aot_config, 2809 ) 2811 # TODO: There is something deeply wrong here; compiled_fn running with 2812 # the boxed calling convention, but aot_module_simplified somehow 2813 # historically returned a function that was not the boxed calling 2814 # convention. This should get fixed... File [~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163), in dynamo_timed..dynamo_timed_inner..time_wrapper(*args, **kwargs) 162 t0 = time.time() --> 163 r = func(*args, **kwargs) 164 time_spent = time.time() - t0 File [~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:2498](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:2498), in create_aot_dispatcher_function(flat_fn, flat_args, aot_config) 2496 # You can put more passes here -> 2498 compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config) 2500 if not hasattr(compiled_fn, "_boxed_call"): File [~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1713](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1713), in aot_wrapper_dedupe(flat_fn, flat_args, aot_config, compiler_fn) 1712 if ok: -> 1713 return compiler_fn(flat_fn, leaf_flat_args, aot_config) 1715 # Strategy 2: Duplicate specialize. 1716 # 1717 # In Haskell types, suppose you have: (...) 1749 # } 1750 # keep_arg_mask = [True, True, False, True] File [~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1326](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1326), in aot_dispatch_base(flat_fn, flat_args, aot_config) 1325 with context(), track_graph_compiling(aot_config, "inference"): -> 1326 compiled_fw = aot_config.fw_compiler(fw_module, flat_args_with_views_handled) 1328 compiled_fn = create_runtime_wrapper( 1329 compiled_fw, 1330 runtime_metadata=metadata_, 1331 trace_joint=False, 1332 keep_input_mutations=aot_config.keep_inference_input_mutations 1333 ) File [~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163), in dynamo_timed..dynamo_timed_inner..time_wrapper(*args, **kwargs) 162 t0 = time.time() --> 163 r = func(*args, **kwargs) 164 time_spent = time.time() - t0 File [~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:430](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:430), in compile_fx..fw_compiler(model, example_inputs) 429 model = convert_outplace_to_inplace(model) --> 430 return inner_compile( 431 model, 432 example_inputs, 433 num_fixed=fixed, 434 cudagraphs=cudagraphs, 435 graph_id=graph_id, 436 ) File [~/.local/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py:595](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py:595), in wrap_compiler_debug..debug_wrapper(gm, example_inputs, **kwargs) 594 else: --> 595 compiled_fn = compiler_fn(gm, example_inputs) 597 return compiled_fn File [~/.local/lib/python3.10/site-packages/torch/_inductor/debug.py:239](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/debug.py:239), in DebugContext.wrap..inner(*args, **kwargs) 238 with DebugContext(): --> 239 return fn(*args, **kwargs) File /usr/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__..inner(*args, **kwds) 78 with self._recreate_cm(): ---> 79 return func(*args, **kwds) File [~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:177](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:177), in compile_fx_inner(gm, example_inputs, cudagraphs, num_fixed, is_backward, graph_id) 176 graph.run(*example_inputs) --> 177 compiled_fn = graph.compile_to_fn() 179 if cudagraphs: File [~/.local/lib/python3.10/site-packages/torch/_inductor/graph.py:586](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/graph.py:586), in GraphLowering.compile_to_fn(self) 585 def compile_to_fn(self): --> 586 return self.compile_to_module().call File [~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163), in dynamo_timed..dynamo_timed_inner..time_wrapper(*args, **kwargs) 162 t0 = time.time() --> 163 r = func(*args, **kwargs) 164 time_spent = time.time() - t0 File [~/.local/lib/python3.10/site-packages/torch/_inductor/graph.py:575](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/graph.py:575), in GraphLowering.compile_to_module(self) 573 print(code) --> 575 mod = PyCodeCache.load(code) 576 for name, value in self.constants.items(): File [~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:528](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:528), in PyCodeCache.load(cls, source_code) 527 mod.key = key --> 528 exec(code, mod.__dict__, mod.__dict__) 529 # another thread might set this first File /tmp/torchinductor_tech/fl/cflyzetaelrdigwqk7eeqcd4ltjygnu2ngsoprkrcxeecyg274xg.py:42 20 triton__0 = async_compile.triton(''' 21 import triton 22 import triton.language as tl (...) 38 tl.store(out_ptr0 + (0 + tl.zeros([XBLOCK], tl.int32)), tmp1, None) 39 ''') ---> 42 async_compile.wait(globals()) 43 del async_compile File [~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:715](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:715), in AsyncCompile.wait(self, scope) 714 if isinstance(result, (Future, TritonFuture)): --> 715 scope[key] = result.result() 716 pbar.update(1) File [~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:573](https://vscode-remote+wsl-002bubuntu-002d20-002e04.vscode-resource.vscode-cdn.net/mnt/r/midi_new/~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:573), in TritonFuture.result(self) 572 # If the worker failed this will throw an exception. --> 573 self.future.result() 574 kernel = self.kernel = _load_kernel(self.source_code) File /usr/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout) 457 elif self._state == FINISHED: --> 458 return self.__get_result() 459 else: File /usr/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self) 402 try: --> 403 raise self._exception 404 finally: 405 # Break a reference cycle with the exception in self._exception CalledProcessError: Command '['/bin/gcc', '/tmp/tmp95fa71kl/main.c', '-O3', '-I/usr/local/cuda/include', '-I/usr/local/include/python3.10', '-I/tmp/tmp95fa71kl', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmp95fa71kl/triton_.cpython-310-x86_64-linux-gnu.so', '-L/usr/lib/wsl/lib']' returned non-zero exit status 1. The above exception was the direct cause of the following exception: BackendCompilerFailed Traceback (most recent call last) Cell In[31], line 15

...

You can suppress this exception and fall back to eager by setting: torch._dynamo.config.suppress_errors = True

Output exceeds the size limit. Open the full output data in a text editor

--------------------------------------------------------------------------- _RemoteTraceback Traceback (most recent call last) _RemoteTraceback: “”" Traceback (most recent call last): File “/usr/lib/python3.10/concurrent/futures/process.py”, line 246, in _process_worker r = call_item.fn(*call_item.args, **call_item.kwargs) File “/home/tech/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py”, line 549, in _worker_compile kernel.precompile(warm_cache_only_with_cc=cc) File “/home/tech/.local/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py”, line 69, in precompile self.launchers = [ File “/home/tech/.local/lib/python3.10/site-packages/torch/_inductor/triton_ops/autotune.py”, line 70, in self.precompile_config(c, warm_cache_only_with_cc) File "/home/tech/.local/lib/python3.10/site-packages/torch/inductor/triton_ops/autotune.py", line 83, in precompile_config triton.compile( File “/home/tech/.local/lib/python3.10/site-packages/triton/compiler.py”, line 1587, in compile so_path = make_stub(name, signature, constants) File “/home/tech/.local/lib/python3.10/site-packages/triton/compiler.py”, line 1476, in make_stub so = build(name, src_path, tmpdir) File “/home/tech/.local/lib/python3.10/site-packages/triton/compiler.py”, line 1391, in build ret = subprocess.check_call(cc_cmd) File “/usr/lib/python3.10/subprocess.py”, line 369, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command ‘[’/bin/gcc’, ‘/tmp/tmp95fa71kl/main.c’, ‘-O3’, ‘-I/usr/local/cuda/include’, ‘-I/usr/local/include/python3.10’, ‘-I/tmp/tmp95fa71kl’, ‘-shared’, ‘-fPIC’, ‘-lcuda’, ‘-o’, '/tmp/tmp95fa71kl/triton.cpython-310-x86_64-linux-gnu.so’, ‘-L/usr/lib/wsl/lib’]’ returned non-zero exit status 1. “”" The above exception was the direct cause of the following exception: CalledProcessError Traceback (most recent call last) File ~/.local/lib/python3.10/site-packages/torch/_dynamo/output_graph.py:670, in OutputGraph.call_user_compiler(self, gm) 669 else: → 670 compiled_fn = compiler_fn(gm, self.fake_example_inputs()) 671 step_logger()(logging.INFO, f"done compiler function {name}") File ~/.local/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py:1055, in wrap_backend_debug…debug_wrapper(gm, example_inputs, **kwargs) 1054 else: → 1055 compiled_gm = compiler_fn(gm, example_inputs) 1057 return compiled_gm File ~/.local/lib/python3.10/site-packages/torch/init.py:1390, in TorchCompileInductorWrapper.call(self, model, inputs) 1388 from torch.inductor.compile_fx import compile_fx → 1390 return compile_fx(model, inputs, config_patches=self.config) File ~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:455, in compile_fx(model, example_inputs, inner_compile, config_patches) 450 with overrides.patch_functions(): 451 452 # TODO: can add logging before/after the call to create_aot_dispatcher_function 453 # in torch.functorch/aot_autograd.py::aot_module_simplified::aot_function_simplified::new_func 454 # once torchdynamo is merged into pytorch → 455 return aot_autograd( 456 fw_compiler=fw_compiler, 457 bw_compiler=bw_compiler, 458 decompositions=select_decomp_table(), 459 partition_fn=functools.partial( 460 min_cut_rematerialization_partition, compiler=“inductor” 461 ), 462 keep_inference_input_mutations=True, 463 )(model, example_inputs) File ~/.local/lib/python3.10/site-packages/torch/_dynamo/backends/common.py:48, in aot_autograd…compiler_fn(gm, example_inputs) 47 with enable_aot_logging(): —> 48 cg = aot_module_simplified(gm, example_inputs, **kwargs) 49 counters[“aot_autograd”][“ok”] += 1 File ~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:2805, in aot_module_simplified(mod, args, fw_compiler, bw_compiler, partition_fn, decompositions, hasher_type, static_argnums, keep_inference_input_mutations) 2803 full_args.extend(args) → 2805 compiled_fn = create_aot_dispatcher_function( 2806 functional_call, 2807 full_args, 2808 aot_config, 2809 ) 2811 # TODO: There is something deeply wrong here; compiled_fn running with 2812 # the boxed calling convention, but aot_module_simplified somehow 2813 # historically returned a function that was not the boxed calling 2814 # convention. This should get fixed… File ~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163, in dynamo_timed…dynamo_timed_inner…time_wrapper(*args, **kwargs) 162 t0 = time.time() → 163 r = func(*args, **kwargs) 164 time_spent = time.time() - t0 File ~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:2498, in create_aot_dispatcher_function(flat_fn, flat_args, aot_config) 2496 # You can put more passes here → 2498 compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config) 2500 if not hasattr(compiled_fn, "boxed_call"): File ~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1713, in aot_wrapper_dedupe(flat_fn, flat_args, aot_config, compiler_fn) 1712 if ok: → 1713 return compiler_fn(flat_fn, leaf_flat_args, aot_config) 1715 # Strategy 2: Duplicate specialize. 1716 # 1717 # In Haskell types, suppose you have: (…) 1749 # } 1750 # keep_arg_mask = [True, True, False, True] File ~/.local/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py:1326, in aot_dispatch_base(flat_fn, flat_args, aot_config) 1325 with context(), track_graph_compiling(aot_config, “inference”): → 1326 compiled_fw = aot_config.fw_compiler(fw_module, flat_args_with_views_handled) 1328 compiled_fn = create_runtime_wrapper( 1329 compiled_fw, 1330 runtime_metadata=metadata, 1331 trace_joint=False, 1332 keep_input_mutations=aot_config.keep_inference_input_mutations 1333 ) File ~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163, in dynamo_timed…dynamo_timed_inner…time_wrapper(*args, **kwargs) 162 t0 = time.time() → 163 r = func(*args, **kwargs) 164 time_spent = time.time() - t0 File ~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:430, in compile_fx…fw_compiler(model, example_inputs) 429 model = convert_outplace_to_inplace(model) → 430 return inner_compile( 431 model, 432 example_inputs, 433 num_fixed=fixed, 434 cudagraphs=cudagraphs, 435 graph_id=graph_id, 436 ) File ~/.local/lib/python3.10/site-packages/torch/_dynamo/debug_utils.py:595, in wrap_compiler_debug…debug_wrapper(gm, example_inputs, **kwargs) 594 else: → 595 compiled_fn = compiler_fn(gm, example_inputs) 597 return compiled_fn File ~/.local/lib/python3.10/site-packages/torch/_inductor/debug.py:239, in DebugContext.wrap…inner(*args, **kwargs) 238 with DebugContext(): → 239 return fn(*args, **kwargs) File /usr/lib/python3.10/contextlib.py:79, in ContextDecorator.call…inner(*args, **kwds) 78 with self._recreate_cm(): —> 79 return func(*args, **kwds) File ~/.local/lib/python3.10/site-packages/torch/_inductor/compile_fx.py:177, in compile_fx_inner(gm, example_inputs, cudagraphs, num_fixed, is_backward, graph_id) 176 graph.run(*example_inputs) → 177 compiled_fn = graph.compile_to_fn() 179 if cudagraphs: File ~/.local/lib/python3.10/site-packages/torch/_inductor/graph.py:586, in GraphLowering.compile_to_fn(self) 585 def compile_to_fn(self): → 586 return self.compile_to_module().call File ~/.local/lib/python3.10/site-packages/torch/_dynamo/utils.py:163, in dynamo_timed…dynamo_timed_inner…time_wrapper(*args, **kwargs) 162 t0 = time.time() → 163 r = func(*args, **kwargs) 164 time_spent = time.time() - t0 File ~/.local/lib/python3.10/site-packages/torch/_inductor/graph.py:575, in GraphLowering.compile_to_module(self) 573 print(code) → 575 mod = PyCodeCache.load(code) 576 for name, value in self.constants.items(): File ~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:528, in PyCodeCache.load(cls, source_code) 527 mod.key = key → 528 exec(code, mod.dict, mod.dict) 529 # another thread might set this first File /tmp/torchinductor_tech/fl/cflyzetaelrdigwqk7eeqcd4ltjygnu2ngsoprkrcxeecyg274xg.py:42 20 triton__0 = async_compile.triton(‘’’ 21 import triton 22 import triton.language as tl (…) 38 tl.store(out_ptr0 + (0 + tl.zeros([XBLOCK], tl.int32)), tmp1, None) 39 ‘’‘) —> 42 async_compile.wait(globals()) 43 del async_compile File ~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:715, in AsyncCompile.wait(self, scope) 714 if isinstance(result, (Future, TritonFuture)): → 715 scope[key] = result.result() 716 pbar.update(1) File ~/.local/lib/python3.10/site-packages/torch/_inductor/codecache.py:573, in TritonFuture.result(self) 572 # If the worker failed this will throw an exception. → 573 self.future.result() 574 kernel = self.kernel = _load_kernel(self.source_code) File /usr/lib/python3.10/concurrent/futures/_base.py:458, in Future.result(self, timeout) 457 elif self._state == FINISHED: → 458 return self.__get_result() 459 else: File /usr/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self) 402 try: → 403 raise self._exception 404 finally: 405 # Break a reference cycle with the exception in self.exception CalledProcessError: Command ‘[’/bin/gcc’, ‘/tmp/tmp95fa71kl/main.c’, ‘-O3’, ‘-I/usr/local/cuda/include’, ‘-I/usr/local/include/python3.10’, ‘-I/tmp/tmp95fa71kl’, ‘-shared’, ‘-fPIC’, ‘-lcuda’, ‘-o’, '/tmp/tmp95fa71kl/triton.cpython-310-x86_64-linux-gnu.so’, ‘-L/usr/lib/wsl/lib’]’ returned non-zero exit status 1. The above exception was the direct cause of the following exception: BackendCompilerFailed Traceback (most recent call last) Cell In[31], line 15

You can suppress this exception and fall back to eager by setting: torch._dynamo.config.suppress_errors = True

I cannot reproduce the issue in my 2.0.0+cu118 environment and get:

cpu: tensor([0.4367])
cuda eager: tensor([0.7714], device='cuda:0')
cuda opt: tensor([0.4290], device='cuda:0')

but I’m also on Linux and not WSL2. Could you create an issue here so that the code owners can track and fix this issue, please?

1 Like

Thanks for the attempt to reproduce, will do.