Compatible Versions of PyTorch/Libtorch with Cuda 10.0

bfortuner · October 17, 2019, 6:15pm

Does anyone know which version of libtorch I need to download that’s compatible with Pytorch 1.3 with Cuda 10.0? I have a hard requirement to run Cuda 10.0 and it’s not clear from the website which versions are compatible. I’m able to train fine in vanilla PyTorch.

I’m following the tutorial here: https://pytorch.org/tutorials/advanced/cpp_export.html, but get this error when I try to train on Cuda in libtorch with a model I traced/scripted in PyTorch 1.3.

CUDA is available! Training on GPU.
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:137] . PytorchStreamReader failed closing reader: file not found
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7f92f905de17 in /home/bfortuner/libtorch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::valid(char const*) + 0x6b (0x7f92fbdbb4cb in /home/bfortuner/libtorch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::~PyTorchStreamReader() + 0x1f (0x7f92fbdbb51f in /home/bfortuner/libtorch/lib/libtorch.so)

Setup:

Ubuntu: 14.01
gcc: 4.8.4
ldd: 2.19
Python: 3.6
Cuda: 10.0
Nvidia Driver: 410.78
Pytorch: https://download.pytorch.org/whl/cu100/torch-1.3.0%2Bcu100-cp36-cp36m-linux_x86_64.whl
Libtorch: https://download.pytorch.org/libtorch/cu100/libtorch-shared-with-deps-latest.zip

More details, when I run:

cmake -DCMAKE_PREFIX_PATH=/home/bfortuner/libtorch ..

-- The C compiler identification is GNU 4.8.4
-- The CXX compiler identification is GNU 4.8.4
...

-- Found CUDA: /usr/local/cuda-10.0 (found version "10.0") 
-- Caffe2: CUDA detected: 10.0
-- Caffe2: CUDA nvcc is: /usr/local/cuda-10.0/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda-10.0
-- Caffe2: Header version is: 10.0
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so  
-- Found cuDNN: v7.4.1 (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
-- Autodetected CUDA architecture(s): 6.1
-- Added CUDA NVCC flags for: -gencode;arch=compute_61,code=sm_61
-- Found torch: /home/bfortuner/libtorch/lib/libtorch.so

albanD · October 17, 2019, 6:19pm

Hi,

I’m not sure but the error states file not found. Are you sure the issue is CUDA related?

bfortuner · October 17, 2019, 6:21pm

No I don’t think it’s cuda related, rather just version mismatch between my pytorch/libtorch versions. (exporting in one, loading in the other). I mention CUDA because I have a version that’s not “default” on the download website.

albanD · October 17, 2019, 6:23pm

So are you sure that the path to find the model that you give your cpp module is correct?

bfortuner · October 17, 2019, 7:02pm

Yeah, here are some more details. My PyTorch version is torch==1.3.0+cu100. Here’s the script I’m running for inference:

gist.github.com

https://gist.github.com/bfortuner/c99a32dbdeb0f4f560d0080118f15af8

cpp-inference.cpp

#include <torch/script.h> // One-stop header.
#include <torch/torch.h>
#include <iostream>
#include <memory>

int main(int argc, const char* argv[]) {
  if (argc != 2) {
    std::cerr << "usage: example-app <path-to-exported-script-module>\n";
    return -1;
  }

This file has been truncated. show original

Then I run this

mkdir build && cd build
cp ../traced_resnet_model.pt .
cmake -DCMAKE_PREFIX_PATH=/home/bfortuner/pytorch_libs/libtorch ..
make
./cpp-inference traced_resnet_model.pt

If I download this version of libtorch: https://download.pytorch.org/libtorch/cu101/libtorch-shared-with-deps-1.3.0.zip, I’m able to run inference on CPU, but it doesn’t recognize my GPU (this makes sense bc the cuda versions aren’t matching).

If I download this version of libtorch: https://download.pytorch.org/libtorch/cu100/libtorch-shared-with-deps-1.3.0.zip, I’m able to do inference on CPU, but it still doesn’t recognize my GPU:

(.venvpy3) bfortuner@bfortuner-desktop:~/workplace/pytorch_spike/build$ ./cpp-inference ../traced_resnet_model.pt 
CUDA not available! Training on CPU.
ok
-0.0172 -0.5685  0.2170 -0.8681  0.3364
[ Variable[CPUFloatType]{1,5} ]

If I try to load the model onto GPU anyway, I get this:

CUDA not available! Training on CPU.
ok
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: CUDA driver version is insufficient for CUDA runtime version (getDevice at /pytorch/c10/cuda/impl/CUDAGuardImpl.h:37)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f99fa1fd813 in /home/bfortuner/pytorch_libs/libtorch/lib/libc10.so)
frame #1: <unknown function> + 0x1234b (0x7f99f846334b in /home/bfortuner/pytorch_libs/libtorch/lib/libc10_cuda.so)
frame #2: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool) + 0x686 (0x7f99fbff5c96 in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #3: <unknown function> + 0x1efb7f1 (0x7f99fc31d7f1 in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #4: <unknown function> + 0x3a8c14b (0x7f99fdeae14b in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #5: <unknown function> + 0x40411d2 (0x7f99fe4631d2 in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #6: torch::jit::script::Module::to_impl(c10::optional<c10::Device> const&, c10::optional<c10::ScalarType> const&, bool) + 0x13e (0x7f99fe46730e in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #7: torch::jit::script::Module::to_impl(c10::optional<c10::Device> const&, c10::optional<c10::ScalarType> const&, bool) + 0xb0 (0x7f99fe467280 in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #8: torch::jit::script::Module::to(c10::Device, bool) + 0x29 (0x7f99fe4676a9 in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #9: main + 0x176 (0x407e7e in ./cpp-inference)
frame #10: __libc_start_main + 0xf5 (0x7f99f9477f45 in /lib/x86_64-linux-gnu/libc.so.6)
frame #11: ./cpp-inference() [0x407ba9]

If I download this version of libtorch: https://download.pytorch.org/libtorch/cu100/libtorch-shared-with-deps-latest.zip, it recognizes the GPU, but fails loading the model on CPU and GPU.

CUDA is available! Training on GPU.
terminate called after throwing an instance of 'c10::Error'
  what():  [enforce fail at inline_container.cc:137] . PytorchStreamReader failed closing reader: file not found
frame #0: c10::ThrowEnforceNotMet(char const*, int, char const*, std::string const&, void const*) + 0x47 (0x7fc2b7236e17 in /home/bfortuner/pytorch_libs/libtorch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::valid(char const*) + 0x6b (0x7fc2b9f944cb in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::~PyTorchStreamReader() + 0x1f (0x7fc2b9f9451f in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #3: <unknown function> + 0x3c13b97 (0x7fc2bb070b97 in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #4: torch::jit::load(std::unique_ptr<caffe2::serialize::ReadAdapterInterface, std::default_delete<caffe2::serialize::ReadAdapterInterface> >, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x238 (0x7fc2bb077ba8 in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #5: torch::jit::load(std::string const&, c10::optional<c10::Device>, std::unordered_map<std::string, std::string, std::hash<std::string>, std::equal_to<std::string>, std::allocator<std::pair<std::string const, std::string> > >&) + 0x69 (0x7fc2bb077cd9 in /home/bfortuner/pytorch_libs/libtorch/lib/libtorch.so)
frame #6: main + 0x125 (0x4220ab in ./cpp-inference)
frame #7: __libc_start_main + 0xf5 (0x7fc2b64adf45 in /lib/x86_64-linux-gnu/libc.so.6)
frame #8: ./cpp-inference() [0x420e89]

albanD · October 17, 2019, 7:14pm

That is weird that the 1.3 binary does not find cuda. This may be an issue with cuda 10.0 minor versions.
The error on the nightly build is unexpected as well. Can you reproduce that error with any file? If so you might want to open an issue on github.

Finally the “easy” solution I would see for such problems would be to compile from source from within your environment. Installing from source within conda should be fairly simple as all dependency can be pulled directly within your conda env.

bfortuner · October 17, 2019, 8:57pm

Okay, so here’s what happened. It looks like the libtorch installed here https://download.pytorch.org/libtorch/cu100/libtorch-shared-with-deps-latest.zip, gives you:

cat libtorch/build-version 
1.2.0.dev20190815+cu100

I installed PyTorch 1.2.0 and it worked. I don’t know what’s up with the 1.3.0 cu100 build. I’ll poke around.

Is there a page similar to this one for libtorch versions? That could help me find which libtorch packages are available.
https://download.pytorch.org/whl/cu100/torch_stable.html

albanD · October 17, 2019, 9:05pm

cc @smth do we actually try to provide cuda 10.0 builds for pytorch 1.3?

bfortuner · October 17, 2019, 9:09pm

One more detail if it helps. I’m able to run pytorch 1.3.0 with Cuda 10.0 and train on CPU/GPU. It’s just libtorch which I have trouble with (or version mismatch)

smth · October 17, 2019, 9:22pm

we do have cu100 libtorch uploaded, the filename is: libtorch-shared-with-deps-1.3.0.zip

However, none of your error messages are because you are using pytorch 10.0 and libtorch 10.1.

The error messages are clear in what they say: “file not found” and “CUDA error: CUDA driver version is insufficient for CUDA runtime version”.

bfortuner · October 17, 2019, 10:15pm

Yeah, that’s the version I’m using. Here is what I downloaded. I’m able to run inference on CPU, but it can’t find CUDA.

wget https://download.pytorch.org/libtorch/cu100/libtorch-shared-with-deps-1.3.0.zip . # 1.3.0
https://download.pytorch.org/whl/cu100/torch-1.3.0%2Bcu100-cp36-cp36m-linux_x86_64.whl
https://download.pytorch.org/whl/cu100/torchvision-0.4.1%2Bcu100-cp36-cp36m-linux_x86_64.whl

Comes out as

libtorch: 1.3.0
torch==1.3.0+cu100
torchvision==0.4.1+cu100

Code

  torch::Device device(torch::kCPU);
  if (torch::cuda::is_available()) {
    std::cout << "CUDA is available! Inference on GPU." << std::endl;
    device = torch::Device(torch::kCUDA);
  } else {
    std::cout << "CUDA not available! Inference on CPU." << std::endl;
  }

It finds cuda fine when I use python, just not libtorch.

albanD · October 17, 2019, 10:24pm

I think what soumith said is that the error message after you print tells you why “it can’t find CUDA”: CUDA driver version is insufficient for CUDA runtime

bfortuner · October 17, 2019, 10:28pm

I see, what do they mean by “CUDA runtime”? How can I find out what I need to change?

I’m confused because the CUDA driver version works in libtorch on version 1.2. It also works in PyTorch with version 1.3. It’s only libtorch 1.3.0 that I have issues with. How can I figure out what driver version I need?

Ubuntu: 14.01
gcc: 4.8.4
ldd: 2.19
Python: 3.6
Cuda: 10.0
Nvidia Driver: 410.78

albanD · October 17, 2019, 10:48pm

The driver would be linked to the nvidia driver (410.78 here) and the runtime is the cuda version (10.0).
If you check nvidia’s website, each version of cuda requires a minimum driver.

You should try and update your nvidia driver to solve this.

bfortuner · October 17, 2019, 10:52pm

My current driver seems compatible CUDA 10.0 (10.0.130) >= 410.48? Cuda works fine in pytorch 1.3.0 and libtorch 1.2.0. It’s just libtorch 1.3.0 giving me trouble, maybe that requires a different version…

I’ll keep poking around!

bfortuner · October 18, 2019, 8:12pm

Wasn’t able to figure it out, but I upgraded to Ubuntu 18 and things work. Thanks for your help!