NaN values appear in model parallelism

I have coded a model that has different parts on two same GPUs. While training, I got some warning about NaN value or Inf value. BUT same model, same data, same RNG, same GPU but 1, same HPC, I got no warning.

What is the cause of this?

How are you performing model parallelism? Are you using PyTorch RPC? A simple reproducible script would help a lot in understanding the root cause of the issue.

Since your model works with 1 GPU but not 2, I am just guessing it may have something to do with cuda synchronization and this is giving garbage values, but please post an example of the code, your PyTorch version, and any frameworks you are using.

I think I can’t make a simple scripts to reproduce the problems but I will try, it will take a lot of time.
I’m using Horovod and NVIDIA apex. Here is my conda environment.

# packages in environment at /home/anhvd/miniconda3/envs/uniter:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
absl-py                   0.12.0                   pypi_0    pypi
anyio                     2.2.0                    pypi_0    pypi
apex                      0.1                      pypi_0    pypi
argon2-cffi               20.1.0                   pypi_0    pypi
async-generator           1.10                     pypi_0    pypi
attrs                     20.3.0                   pypi_0    pypi
autopep8                  1.5.6                    pypi_0    pypi
babel                     2.9.1                    pypi_0    pypi
backcall                  0.2.0                    pypi_0    pypi
blas                      1.0                         mkl
bleach                    3.3.0                    pypi_0    pypi
boto3                     1.17.59                  pypi_0    pypi
botocore                  1.20.59                  pypi_0    pypi
ca-certificates           2021.4.13            h06a4308_1
cachetools                4.2.2                    pypi_0    pypi
certifi                   2020.12.5        py36h06a4308_0
cffi                      1.14.5                   pypi_0    pypi
chardet                   4.0.0                    pypi_0    pypi
click                     7.1.2                    pypi_0    pypi
cloudpickle               1.6.0                    pypi_0    pypi
cmake                     3.18.4.post1             pypi_0    pypi
colorama                  0.4.4                    pypi_0    pypi
contextvars               2.4                      pypi_0    pypi
cytoolz                   0.11.0                   pypi_0    pypi
dataclasses               0.8                      pypi_0    pypi
decorator                 5.0.7                    pypi_0    pypi
defusedxml                0.7.1                    pypi_0    pypi
deprecation               2.1.0                    pypi_0    pypi
entrypoints               0.3                      pypi_0    pypi
google-auth               1.30.0                   pypi_0    pypi
google-auth-oauthlib      0.4.4                    pypi_0    pypi
grpcio                    1.37.0                   pypi_0    pypi
horovod                   0.21.3                   pypi_0    pypi
idna                      2.10                     pypi_0    pypi
immutables                0.15                     pypi_0    pypi
importlib-metadata        4.0.1                    pypi_0    pypi
intel-openmp              2021.2.0           h06a4308_610
ipdb                      0.12                     pypi_0    pypi
ipykernel                 5.5.3                    pypi_0    pypi
ipython                   7.16.1                   pypi_0    pypi
ipython-genutils          0.2.0                    pypi_0    pypi
jedi                      0.18.0                   pypi_0    pypi
jinja2                    2.11.3                   pypi_0    pypi
jmespath                  0.10.0                   pypi_0    pypi
joblib                    1.0.1              pyhd3eb1b0_0
json5                     0.9.5                    pypi_0    pypi
jsonschema                3.2.0                    pypi_0    pypi
jupyter-client            6.1.12                   pypi_0    pypi
jupyter-core              4.7.1                    pypi_0    pypi
jupyter-packaging         0.9.2                    pypi_0    pypi
jupyter-server            1.6.4                    pypi_0    pypi
jupyterlab                3.0.14                   pypi_0    pypi
jupyterlab-pygments       0.1.2                    pypi_0    pypi
jupyterlab-server         2.5.0                    pypi_0    pypi
ld_impl_linux-64          2.33.1               h53a641e_7
libffi                    3.3                  he6710b0_2
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_0
libstdcxx-ng              9.1.0                hdf63c60_0
lmdb                      0.97                     pypi_0    pypi
lz4                       2.1.9                    pypi_0    pypi
markdown                  3.3.4                    pypi_0    pypi
markupsafe                1.1.1                    pypi_0    pypi
mistune                   0.8.4                    pypi_0    pypi
mkl                       2020.2                      256
mkl-service               2.3.0            py36he8ac12f_0
mkl_fft                   1.3.0            py36h54f3939_0
mkl_random                1.1.1            py36h0573a6f_0
msgpack                   1.0.2                    pypi_0    pypi
msgpack-numpy             0.4.7.1                  pypi_0    pypi
nbclassic                 0.2.7                    pypi_0    pypi
nbclient                  0.5.3                    pypi_0    pypi
nbconvert                 6.0.7                    pypi_0    pypi
nbformat                  5.1.3                    pypi_0    pypi
ncurses                   6.2                  he6710b0_1
nest-asyncio              1.5.1                    pypi_0    pypi
notebook                  6.3.0                    pypi_0    pypi
numpy                     1.19.2           py36h54aff64_0
numpy-base                1.19.2           py36hfa32c7d_0
oauthlib                  3.1.0                    pypi_0    pypi
openssl                   1.1.1k               h27cfd23_0
packaging                 20.9                     pypi_0    pypi
pandas                    1.1.5                    pypi_0    pypi
pandocfilters             1.4.3                    pypi_0    pypi
parso                     0.8.2                    pypi_0    pypi
pexpect                   4.8.0                    pypi_0    pypi
pickleshare               0.7.5                    pypi_0    pypi
pip                       21.0.1           py36h06a4308_0
pretty-errors             1.2.20                   pypi_0    pypi
prometheus-client         0.10.1                   pypi_0    pypi
prompt-toolkit            3.0.18                   pypi_0    pypi
protobuf                  3.15.8                   pypi_0    pypi
psutil                    5.8.0                    pypi_0    pypi
ptyprocess                0.7.0                    pypi_0    pypi
pyasn1                    0.4.8                    pypi_0    pypi
pyasn1-modules            0.2.8                    pypi_0    pypi
pycodestyle               2.7.0                    pypi_0    pypi
pycparser                 2.20                     pypi_0    pypi
pygments                  2.8.1                    pypi_0    pypi
pyparsing                 2.4.7                    pypi_0    pypi
pyrsistent                0.17.3                   pypi_0    pypi
python                    3.6.13               hdb3f193_0
python-dateutil           2.8.1                    pypi_0    pypi
pytorch-pretrained-bert   0.6.2                    pypi_0    pypi
pytz                      2021.1                   pypi_0    pypi
pyyaml                    5.4.1                    pypi_0    pypi
pyzmq                     22.0.3                   pypi_0    pypi
readline                  8.1                  h27cfd23_0
regex                     2021.4.4                 pypi_0    pypi
requests                  2.25.1                   pypi_0    pypi
requests-oauthlib         1.3.0                    pypi_0    pypi
rsa                       4.7.2                    pypi_0    pypi
s3transfer                0.4.2                    pypi_0    pypi
scikit-learn              0.24.1           py36ha9443f7_0
scipy                     1.5.2            py36h0b6359f_0
send2trash                1.5.0                    pypi_0    pypi
setuptools                52.0.0           py36h06a4308_0
six                       1.15.0           py36h06a4308_0
sklearn                   0.0                      pypi_0    pypi
sniffio                   1.2.0                    pypi_0    pypi
sqlite                    3.35.4               hdfb4753_0
tensorboard               2.5.0                    pypi_0    pypi
tensorboard-data-server   0.6.0                    pypi_0    pypi
tensorboard-plugin-wit    1.8.0                    pypi_0    pypi
tensorboardx              1.7                      pypi_0    pypi
terminado                 0.9.4                    pypi_0    pypi
testpath                  0.4.4                    pypi_0    pypi
threadpoolctl             2.1.0              pyh5ca1d4c_0
tk                        8.6.10               hbc83047_0
toml                      0.10.2                   pypi_0    pypi
tomlkit                   0.7.0                    pypi_0    pypi
toolz                     0.11.1                   pypi_0    pypi
torch                     1.8.1                    pypi_0    pypi
tornado                   6.1                      pypi_0    pypi
tqdm                      4.60.0                   pypi_0    pypi
traitlets                 4.3.3                    pypi_0    pypi
typing-extensions         3.7.4.3                  pypi_0    pypi
urllib3                   1.26.4                   pypi_0    pypi
wcwidth                   0.2.5                    pypi_0    pypi
webencodings              0.5.1                    pypi_0    pypi
werkzeug                  1.0.1                    pypi_0    pypi
wheel                     0.36.2             pyhd3eb1b0_0
xz                        5.2.5                h7b6447c_0
zipp                      3.4.1                    pypi_0    pypi
zlib                      1.2.11               h7b6447c_3