TX2 (ARM64) Install for Python 3 - Cont from nVidia Forum

reverts · October 20, 2018, 1:55am

Building from source, and I can’t get Jetson TX2, an ARM64, to compile properly.

I follow the reccommendations from nVidia’s forums here:

gist.github.com

https://gist.github.com/dusty-nv/ef2b372301c00c0a9d3203e42fd83426

pytorch_jetson_install.sh

#!/bin/bash
#
# pyTorch install script for NVIDIA Jetson TX1/TX2,
# from a fresh flashing of JetPack 2.3.1 / JetPack 3.0 / JetPack 3.1
#
# for the full source, see jetson-reinforcement repo:
#   https://github.com/dusty-nv/jetson-reinforcement/blob/master/CMakePreBuild.sh
#
# note:  pyTorch documentation calls for use of Anaconda,
#        however Anaconda isn't available for aarch64.

This file has been truncated. show original

I can get it to compile on Python2.7, but not Python3. I had to disable NCCL support, which is desktop only CUDA, and that got me to about 27% of the build complete. You can see the nVidia conversation here: https://devtalk.nvidia.com/default/topic/1042821/pytorch-install-broken/?offset=7#5291321

For now, Pytorch keeps dying at the same spot when compiling Onnx and onnx-tesnsorrt, obviously both of which would be required to deploy for production inferencing on the TX2. Latest errors are:

third_party/onnx-tensorrt/CMakeFiles/nvonnxparser.dir/build.make:134: recipe for target ‘third_party/onnx-tensorrt/CMakeFiles/nvonnxparser.dir/onnx2trt_utils.cpp.o’ failed
make[2]: *** [third_party/onnx-tensorrt/CMakeFiles/nvonnxparser.dir/onnx2trt_utils.cpp.o] Error 1
CMakeFiles/Makefile2:1488: recipe for target ‘third_party/onnx-tensorrt/CMakeFiles/nvonnxparser.dir/all’ failed
make[1]: *** [third_party/onnx-tensorrt/CMakeFiles/nvonnxparser.dir/all] Error 2
[ 27%] Built target python_copy_files
Makefile:160: recipe for target ‘all’ failed
make: *** [all] Error 2
Failed to run ‘bash …/tools/build_pytorch_libs.sh --use-cuda --use-nnpack caffe2 libshm gloo c10d THD’

I’d be open to not compiling it all, and simply using an Onnx exported model, but I have no idea how that would fit into my Python production code. After all, the model is only a part of the PyTorch code, I still use Torch for translating image formats, evaluations, etc.

Not sure where to go from here, but I’d really like to stick with this library and figure out the inferencing for edge devices. Combined with fast.AI’s recent v1.0 release, there’s some pretty amazing work that can be done.

ptrblck · October 20, 2018, 10:28am

Could you post or upload the complete build log? The error seems to be missing in the current snippet.

reverts · October 20, 2018, 5:34pm

It’s a LOT of code, so the log is too long to post here. You can view the read-only text file here:
https://www.dropbox.com/s/5qklhphl4sjdg7a/log.txt?dl=0

Here’s the setup command I ran.
sudo python3 setup.py install

ptrblck · October 21, 2018, 5:05pm

It looks like ONNX has some problems with protobuf.
Did you pull from master before trying to rebuild PyTorch?
If so, could you call git submodule update --init --recursive and try to build again?
I ran into similar issues when some submodules weren’t properly updated.

reverts · October 21, 2018, 11:10pm

Ok, so I completely wiped it and followed the standard PyTorch install directions from PyTorch versus the nVidia recommended pytorch_jetson_install.sh from Dusty.

The only thing is I added these changes as per recommended by nVidia since NCCL is only for CUDA desktop GPUs:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index f7b24b728..f75f610ed 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -93,7 +93,7 @@ option(USE_LMDB "Use LMDB" ON)
option(USE_METAL "Use Metal for iOS build" ON)
option(USE_MOBILE_OPENGL "Use OpenGL for mobile code" ON)
option(USE_NATIVE_ARCH "Use -march=native" OFF)
-option(USE_NCCL "Use NCCL" ON)
+option(USE_NCCL "Use NCCL" OFF)
option(USE_SYSTEM_NCCL "Use system-wide NCCL" OFF)
option(USE_NNAPI "Use NNAPI" OFF)
option(USE_NNPACK "Use NNPACK" ON)

diff --git a/setup.py b/setup.py
index 99817f346..e39042b83 100644
--- a/setup.py
+++ b/setup.py
@@ -195,6 +195,7 @@ IS_LINUX = (platform.system() == 'Linux')

BUILD_PYTORCH = check_env_flag('BUILD_PYTORCH')
USE_CUDA_STATIC_LINK = check_env_flag('USE_CUDA_STATIC_LINK')
RERUN_CMAKE = True
+USE_NCCL = False

NUM_JOBS = multiprocessing.cpu_count()
max_jobs = os.getenv("MAX_JOBS")

When I do that, don’t enable TensorRT support, and run the install with Python3 now, it does compile on the fresh install! I do believe I will need the TensorRT support eventually on the TX2, I’ll keep plugging away on that.

For now, however, when trying to import it into a Python shell, I get this:

Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nvidia/pytorch/torch/__init__.py", line 84, in <module>
    from torch._C import *
ImportError: No module named 'torch._C'
>>> exit()

ptrblck · October 21, 2018, 11:42pm

Good to hear you could compile it!

Are you creating the Python shell in your build directory?
If so, could you just switch the directory and try it again?

reverts · October 22, 2018, 12:53am

Duh! Thank you!!

So, it compiles without TensorRT support for now.

I think it’s enough to start exploring the transfer of Pytorch models via Onnx on desktop GPUs through TensorRT 3.0 to the Jetson TX2.

Thank you so much for the support here and on Twitter @ptrblck !

ptrblck · October 22, 2018, 1:09am

You are welcome!
I’m glad it working for now and I’m sure we can manage to build it with TensorRT in the next iteration.

Let me know how your experiments went deploying PyTorch through ONNX!

smth · October 23, 2018, 3:40am

I got it working on AGX Xavier. I’ll take a crack at TX2 and fix any structural problems (and commit patches to master)