Log_softmax function returns "nan" in raspbian OS running torch wheels, but works fine in anaconda (windows OS)

Lahiru_Welagedara · December 7, 2020, 6:59am

Hello Guys

I have been working on a pytorch model for one of my testing purposes. So far I trained the model using anaconda in windows and the model is working fine. I wanted to train the same model in raspberry pi 4 (aarch64). So I did the following,

Deployed RaspbianLite OS
Used torch wheels to install torch in Raspberry PI as it has ARM architecture ( got the wheels for torch(v1.7) and torchvision(v.0.8.1) from - https://mathinf.eu/pytorch/arm64/ )

Torch was imported successfully and I ran the model in raspberry pi without errors.
But the loss got a value of nan. So I debugged and found out that there was no issue with the data or data transformation (no nan inputs). But found out that F.log_softmax(r_out2, dim=1) returns a nan value from the beginning of first batch of data itself.

Anaconda environment versions -
numpy.version - 1.18.5
torch.version - 1.7.0
torchvision.version - 0.8.1

Raspberry Pi versions -
torch - v1.7
torchvision - v.0.8.1
numpy - v1.18.5 (tried upgrading to latest version as well but no luck)

I have no clue why the same model and same data returns nan, when it runs perfectly well in the anaconda environment. The only issue is with the log_softmax as of yet.

Does it have something to do with the torch wheels I installed ?
Or does it have to do something with the processing power on raspberry ?

I am very glad if someone can help me with this issue

Thanks !

Lahiru_Welagedara · December 7, 2020, 12:06pm

Solution Found -
The issue was with the torch version 1.7 in https://mathinf.eu/pytorch/arm64/.
The model worked fine for torch version 1.6 available in https://mathinf.eu/pytorch/arm64/.