Libtorch returns NaN on Arm64

Hi!
I’m using Pytorch’s torch C++ frontend on a Nvidia Orin NX (Arm64). The Jetpack is 5.1.2 and I’m using torch installed in /home/nvidia/.local/lib/python3.8/site-packages/torch/ from the python wheel provided by Nvidia. (torch-2.1.0a0+41361538.nv23.06-cp38-cp38-linux_aarch64.whl)

Basically, after a few jit inferences on an IValue vector inputs,
at::Tensor action_mean_tensor = policy_.forward(inputs).toTensor(); starts returning NaN even with small values.

I tried to create a minimal application to replicate the issue:

In my original bigger ROS2 node, I used to generate new observations every time, doing an env.step(obs) but I got easily NaNs during a few iterations.
On my OrinNX, in the example above, I always get NaNs at the third iteration.

For some reason, removing #include <torch/torch.h> seems to remove the NaNs. Unfortunately, the same workaround on the bigger ROS2 node guarantees only good 20 iterations (instead of 2) but it eventually restarts with NaN and good values alternatively.

So far I found only 3 workaround that are giving always good results:

  1. appending clone() to
    torch::Tensor obs_tensor = torch::from_blob(obs.data(), {1, 28}, torch::kFloat32);

  2. Compiling with -O1, -O2 or -O3. Somehow g++ is optimizing the memory and fixing/hiding the issue.

  3. Printing to stdout the obs before passing them to from_blob. It’s funny but it stops getting NaN if I show the obs values :thinking:

I hope somebody can give an insight into this strange issue.

While I was trying to debug, I found similar errors:

and the recent memory issues related to the C++ frontend:

Your workarounds suggest that obs.data() is invalid and goes out of scope. torch::from_blob will use the underlying memory directly and it’s your responsibility to guarantee the data is valid and won’t be freed.

I agree with you @ptrblck but I don’t see why obs is not valid in my pastebin above.

This is the output from my x86_64 laptop with Ubuntu 22.04 and (Py)Torch 2.3.0 cxx11 ABI installed standalone and compiled with colcon:

inputs: Columns 1 to 10 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000

Columns 11 to 20 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -3.9500 2.3000 1.6500

Columns 21 to 28 0.6500 0.9000 1.6500 0.0000 0.0000 0.0000 0.0000 0.0000
[ CPUFloatType{1,28} ]
stds: 0.3192
0.1498
0.1619
0.4095
[ CPUFloatType{4} ]
random_sample: 1.8895
0.4808
-0.1694
0.9750
[ CPUFloatType{4} ]
action_mean_tensor: 0.8503 -0.7974 0.4450 -0.8844
[ CPUFloatType{1,4} ]
Action: 1.606528 -0.383409 -0.075384 -0.862261
inputs: Columns 1 to 10-0.0076 0.0016 0.0101 0.8693 0.1434 -0.4731 -0.1096 0.9891 0.0984 0.4821

Columns 11 to 20-0.0337 0.8755 -0.3785 0.0787 0.5042 -3.4581 -25.0000 -6.6201 -3.9424 2.2984

Columns 21 to 28 1.6399 0.6576 0.8984 1.6399 1.4953 -0.1383 -1.2073 -0.4413
[ CPUFloatType{1,28} ]
stds: 0.3192
0.1498
0.1619
0.4095
[ CPUFloatType{4} ]
random_sample: -0.7742
-2.2891
-0.8089
0.3901
[ CPUFloatType{4} ]
action_mean_tensor: 1.7387 -0.1351 0.0408 -0.4169
[ CPUFloatType{1,4} ]
Action: -1.346089 0.309363 -0.033036 -0.162666
inputs: Columns 1 to 10-0.0216 0.0067 0.0304 0.7917 0.4566 -0.4058 -0.3988 0.8895 0.2228 0.4627

Columns 11 to 20-0.0146 0.8864 -0.7031 0.2570 1.0170 -6.9673 1.4978 -15.0000 -3.9284 2.2933

Columns 21 to 28 1.6196 0.6716 0.8933 1.6196 2.1924 -0.2787 0.0599 -1.1601
[ CPUFloatType{1,28} ]
stds: 0.3192
0.1498
0.1619
0.4095
[ CPUFloatType{4} ]
random_sample: 1.5993
-1.9500
-0.9119
-1.6646
[ CPUFloatType{4} ]
action_mean_tensor: 1.8872 0.0295 -0.7745 0.0846
[ CPUFloatType{1,4} ]
Action: 3.018301 -0.057587 0.706246 -0.140806

This is the output from the Orin NX that uses (Py)Torch 2.1.0 wheel (Ubuntu 20.04) and compiled with colcon:

PyTorch version: 2.1.0
inputs: Columns 1 to 10 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000

Columns 11 to 20 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -3.9500 2.3000 1.6500

Columns 21 to 28 0.6500 0.9000 1.6500 0.0000 0.0000 0.0000 0.0000 0.0000
[ CPUFloatType{1,28} ]
stds: 0.3192
0.1498
0.1619
0.4095
[ CPUFloatType{4} ]
random_sample: -0.7111
0.0454
-0.2371
-0.6546
[ CPUFloatType{4} ]
action_mean_tensor: 0.8503 -0.7974 0.4450 -0.8844
[ CPUFloatType{1,4} ]
Action: -0.604600 -0.036172 -0.105509 0.578925
inputs: Columns 1 to 10-0.0076 0.0016 0.0101 0.8693 0.1434 -0.4731 -0.1096 0.9891 0.0984 0.4821

Columns 11 to 20-0.0337 0.8755 -0.3785 0.0787 0.5042 -3.4581 -25.0000 -6.6201 -3.9424 2.2984

Columns 21 to 28 1.6399 0.6576 0.8984 1.6399 1.4953 -0.1383 -1.2073 -0.4413
[ CPUFloatType{1,28} ]
stds: 0.3192
0.1498
0.1619
0.4095
[ CPUFloatType{4} ]
random_sample: 2.1585
-1.5461
-0.8058
0.0587
[ CPUFloatType{4} ]
action_mean_tensor: 1.7387 -0.1351 0.0408 -0.4169
[ CPUFloatType{1,4} ]
Action: 3.752915 0.208950 -0.032909 -0.024473
inputs: Columns 1 to 10-0.0216 0.0067 0.0304 0.7917 0.4566 -0.4058 -0.3988 0.8895 0.2228 0.4627

Columns 11 to 20-0.0146 0.8864 -0.7031 0.2570 1.0170 -6.9673 1.4978 -15.0000 -3.9284 2.2933

Columns 21 to 28 1.6196 0.6716 0.8933 1.6196 2.1924 -0.2787 0.0599 -1.1601
[ CPUFloatType{1,28} ]
stds: 0.3192
0.1498
0.1619
0.4095
[ CPUFloatType{4} ]
random_sample: 0.5744
-0.5015
-0.2751
-1.0906
[ CPUFloatType{4} ]
action_mean_tensor: -nan -nan -nan -nan
[ CPUFloatType{1,4} ]
Action: -nan -nan -nan -nan