RuntimeError: GET was unable to find an engine to execute this computation

tbergeron · December 13, 2023, 1:25pm

I’m currently adapting an existing model to push it on Replicate which implies building a container to push it onto their servers. They use a tool called cog that is basically an abstraction around Docker.

When running the model within the container, I get the errors below.

I tried to run the container elsewhere and also on Replicate and I get the same errors so clearly something is wrong with my container. I think it might be because of a wrong version of PyTorch or another package.

Could anybody help figure this out please? Thanks!

root@e9a4ecf8be50:/src# python main.py --config configs/text.yaml prompt="a photo of an icecream" save_path=icecream
Number of points at initialisation :  5000
[INFO] loading SD...
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:01<00:00,  3.31it/s]
[INFO] loaded SD!
  0%|                                                                                                                                                           | 0/500 [00:00<?, ?it/s]Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
  0%|                                                                                                                                                           | 0/500 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/src/main.py", line 902, in <module>
    gui.train(opt.iters)
  File "/src/main.py", line 878, in train
    self.train_step()
  File "/src/main.py", line 258, in train_step
    loss.backward()
  File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: GET was unable to find an engine to execute this computation

Here’s some debug info about my current configuration:

docker --version

Docker version 24.0.7, build afdd53b

docker container: nvcc --version

root@e9a4ecf8be50:/src# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

docker container: nvidia-smi

root@e9a4ecf8be50:/src# nvidia-smi
Sun Dec 10 19:59:51 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05   Driver Version: 525.147.05   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   26C    P8    14W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

docker container: pytorch version

>>> import torch
>>> torch.__version__
'2.1.1+cu121'

lsb_release

Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy

cog.yaml (contains packages, python version, cuda version, etc.)

build:
  gpu: true
  cuda: "12.1"
  system_packages:
    - "libgl1-mesa-glx"
    - "libegl1-mesa-dev"
  python_version: "3.10"
  python_packages:
    - "tqdm"
    - "rich"
    - "ninja"
    - "numpy"
    - "pandas"
    - "scipy"
    - "scikit-learn"
    - "matplotlib"
    - "opencv-python"
    - "imageio"
    - "imageio-ffmpeg"
    - "omegaconf"
    - "torch==2.1.0"
    - "einops"
    - "plyfile"
    - "pygltflib"
    - "dearpygui"
    - "huggingface_hub"
    - "diffusers"
    - "accelerate"
    - "transformers"
    - "xatlas"
    - "trimesh"
    - "PyMCubes"
    - "pymeshlab"
    - "rembg[gpu,cli]"
  run:
    - "git clone --recursive https://github.com/ashawkey/diff-gaussian-rasterization"
    - "pip install ./diff-gaussian-rasterization"
    - "pip install git+https://github.com/dreamgaussian/dreamgaussian/#subdirectory=simple-knn"
    - "pip install git+https://github.com/NVlabs/nvdiffrast/"
    - "pip install git+https://github.com/ashawkey/kiuikit"
    - "pip install git+https://github.com/bytedance/MVDream"
    - "echo 'READY.'"

ptrblck · December 13, 2023, 1:39pm

tbergeron:

Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8

It seems your cuDNN installation is broken or you are trying to mix different cuDNN versions. If you are using the PyTorch binaries, you could uninstall your locally installed cuDNN package or remove it from the library path so it won’t be loaded.

tbergeron · December 13, 2023, 1:53pm

It seems like cuDNN is provided by the cog utility provided by Replicate. I just found out that they had issue with that in the past → RuntimeError: cuDNN version incompatibility · Issue #815 · replicate/cog · GitHub

I don’t really know how I could install a different version of cuDNN since it seems to be part of the base image they provide but I’ll look into it. At least this gives me a good clue to follow. Thanks!

ptrblck · December 13, 2023, 5:34pm

You don’t need to install CUDA or cuDNN locally if you are using the PyTorch binaries, as they already ship with these dependencies. Only a properly installed NVIDIA driver is needed.

tbergeron · December 13, 2023, 5:49pm

But what if I’m running PyTorch from inside a Docker container? I assume, the version of PyTorch should be compatible with the version of the drivers installed on the host machine. I double checked all versions and made sure they were compatible in between each others. I’m probably missing out on something

So basically it means that my version of PyTorch is wrong and I’d need a version that’s compatible with the drivers only?

I looked up every compatibility matrices I could find and it seems like it should all be compatible together? (You can see all the versions in my OP)

Thanks for your help. Greatly appreciated!

ptrblck · December 13, 2023, 5:56pm

That’s right. You would need to use a properly installed NVIDIA driver, but don’t need a locally installed CUDA toolkit or cuDNN, since these are shipped as dependencies in the PyTorch binaries.
Your locally installed CUDA toolkit (including cuDNN) would be used if you build PyTorch from source or custom CUDA extensions.

No, it means that your locally installed CUDA toolkit and cuDNN conflicts with the PyTorch binaries. Either rebuild PyTorch from source if you explicitly want to use this CUDA toolkit and cuDNN, or delete them (or remove them from the LD_LIBRARY_PATH).

tbergeron · December 13, 2023, 6:41pm

It seems you are right! I removed cuDNN from the container (apt-get remove libcudnn8) and the error disappeared so it was a conflict between what’s on the container and what’s bundled inside PyTorch!

Now my model almost runs entirely but I encountered the following error which I’m researching about at the moment.

[F glutil.cpp:338] eglInitialize() failed
Aborted (core dumped)

Different story though. I’m glad to finally understand what was happening in my OP.

Thanks so much for your help, you’re a real life saver!

ptrblck · December 14, 2023, 12:03am

Good to hear it’s working now!
If I’m not mistaken the new error is related to OpenGL, so unsure if you are trying to visualize something inside your container, but visualizations might not be supported.