Iβm currently adapting an existing model to push it on Replicate which implies building a container to push it onto their servers. They use a tool called cog that is basically an abstraction around Docker.
When running the model within the container, I get the errors below.
I tried to run the container elsewhere and also on Replicate and I get the same errors so clearly something is wrong with my container. I think it might be because of a wrong version of PyTorch or another package.
Could anybody help figure this out please? Thanks!
root@e9a4ecf8be50:/src# python main.py --config configs/text.yaml prompt="a photo of an icecream" save_path=icecream
Number of points at initialisation : 5000
[INFO] loading SD...
Loading pipeline components...: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6/6 [00:01<00:00, 3.31it/s]
[INFO] loaded SD!
0%| | 0/500 [00:00<?, ?it/s]Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
Could not load library libcudnn_cnn_train.so.8. Error: /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8: undefined symbol: _ZTIN10cask_cudnn14BaseKernelInfoE, version libcudnn_cnn_infer.so.8
0%| | 0/500 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/src/main.py", line 902, in <module>
gui.train(opt.iters)
File "/src/main.py", line 878, in train
self.train_step()
File "/src/main.py", line 258, in train_step
loss.backward()
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/root/.pyenv/versions/3.11.7/lib/python3.11/site-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: GET was unable to find an engine to execute this computation
Hereβs some debug info about my current configuration:
docker --version
Docker version 24.0.7, build afdd53b
docker container: nvcc --version
root@e9a4ecf8be50:/src# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0
docker container: nvidia-smi
root@e9a4ecf8be50:/src# nvidia-smi
Sun Dec 10 19:59:51 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 26C P8 14W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
docker container: pytorch version
>>> import torch
>>> torch.__version__
'2.1.1+cu121'
lsb_release
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy
cog.yaml (contains packages, python version, cuda version, etc.)
build:
gpu: true
cuda: "12.1"
system_packages:
- "libgl1-mesa-glx"
- "libegl1-mesa-dev"
python_version: "3.10"
python_packages:
- "tqdm"
- "rich"
- "ninja"
- "numpy"
- "pandas"
- "scipy"
- "scikit-learn"
- "matplotlib"
- "opencv-python"
- "imageio"
- "imageio-ffmpeg"
- "omegaconf"
- "torch==2.1.0"
- "einops"
- "plyfile"
- "pygltflib"
- "dearpygui"
- "huggingface_hub"
- "diffusers"
- "accelerate"
- "transformers"
- "xatlas"
- "trimesh"
- "PyMCubes"
- "pymeshlab"
- "rembg[gpu,cli]"
run:
- "git clone --recursive https://github.com/ashawkey/diff-gaussian-rasterization"
- "pip install ./diff-gaussian-rasterization"
- "pip install git+https://github.com/dreamgaussian/dreamgaussian/#subdirectory=simple-knn"
- "pip install git+https://github.com/NVlabs/nvdiffrast/"
- "pip install git+https://github.com/ashawkey/kiuikit"
- "pip install git+https://github.com/bytedance/MVDream"
- "echo 'READY.'"