Model trained with PyTorch 2.0 and Multi GPUs does not work with one GPU

Hello,

I just trained a model with 3 A100 cards using PyTorch 2.0.1. The training part works OK, and the inference with 3 GPU cards works normally. However, when I tried to inference the model using PyTorch 2.0 with 1 card only , a bug just occurred:

In file included from /tmp/tmpfjotpdpx/main.c:2:
/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/triton/third_party/cuda/include/cuda.h:55:10: fatal error: stdlib.h: No such file or directory
   55 | #include <stdlib.h>
      |          ^~~~~~~~~~
compilation terminated.
In file included from /tmp/tmpy4fwn19e/main.c:2:
/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/triton/third_party/cuda/include/cuda.h:55:10: fatal error: stdlib.h: No such file or directory
   55 | #include <stdlib.h>
      |          ^~~~~~~~~~
compilation terminated.
compilation terminated.
In file included from /tmp/tmptis__htf/main.c:2:
/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/triton/third_party/cuda/include/cuda.h:55:10: fatal error: stdlib.h: No such file or directory
   55 | #include <stdlib.h>
      |          ^~~~~~~~~~
compilation terminated.
In file included from /tmp/tmpajqnwuaa/main.c:2:
/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/triton/third_party/cuda/include/cuda.h:55:10: fatal error: stdlib.h: No such file or directory
   55 | #include <stdlib.h>
      |          ^~~~~~~~~~
compilation terminated.
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/torch/_inductor/codecache.py", line 549, in _worker_compile
    kernel.precompile(warm_cache_only_with_cc=cc)
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/torch/_inductor/triton_ops/autotune.py", line 69, in precompile
    self.launchers = [
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/torch/_inductor/triton_ops/autotune.py", line 70, in <listcomp>
    self._precompile_config(c, warm_cache_only_with_cc)
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/torch/_inductor/triton_ops/autotune.py", line 83, in _precompile_config
    triton.compile(
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/triton/compiler.py", line 1587, in compile
    so_path = make_stub(name, signature, constants)
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/triton/compiler.py", line 1476, in make_stub
    so = _build(name, src_path, tmpdir)
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/triton/compiler.py", line 1391, in _build
    ret = subprocess.check_call(cc_cmd)
  File "/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/opt/ohpc/pub/compiler/gcc/9.4.0/bin/gcc', '/tmp/tmp2haucfn0/main.c', '-O3', '-I/beegfs/userhome/gabrielpan/.conda/envs/torch2/lib/python3.8/site-packages/triton/third_party/cuda/include', '-I/beegfs/userhome/gabrielpan/.conda/envs/torch2/include/python3.8', '-I/tmp/tmp2haucfn0', '-shared', '-fPIC', '-lcuda', '-o', '/tmp/tmp2haucfn0/triton_.cpython-38-x86_64-linux-gnu.so', '-L/usr/lib64']' returned non-zero exit status 1.
"""

I then tried to inference using the model with PyTorch 1.13 and 1 GPU card, it worked OK.

Does any one meet the same issue, and could anyone help me with that?

Best Regards,
Gabriel

It seems to work after I changed the compile backend to torch.compile(self.model,backend="aot_eager")
but I don’t know shall I changed the training compile backend as well.

The error message points to a missing header raised from OpenAI/Triton and I wouldn’t know how it could depend on the number of used GPUs.
Did you change anything else in your environment and was torch.compileworking before?

I didn’t change anything but the device number. The interesting thing is the model works on 1 GPU after I changed the compile backend to torch.compile(self.model,backend="aot_eager") according to PyTorch 2.0 compile problem in mac Honestly, I have no idea how it works and why it works…

From the docs:

torch.compile(..., backend="aot_eager") which runs torchdynamo to capture a forward graph, and then AOTAutograd to trace the backward graph without any additional backend compiler steps. PyTorch eager will then be used to run the forward and backward graphs. If this fails then there’s an issue with AOTAutograd.

which explains why it’s not failing anymore since OpenAI/Triton won’t be used anymore, but PyTorch eager mode, so it would be interesting to get any code which could reproduce the issue.

1 Like