I have coded a model that has different parts on two same GPUs. While training, I got some warning about NaN value or Inf value. BUT same model, same data, same RNG, same GPU but 1, same HPC, I got no warning.
What is the cause of this?
I have coded a model that has different parts on two same GPUs. While training, I got some warning about NaN value or Inf value. BUT same model, same data, same RNG, same GPU but 1, same HPC, I got no warning.
What is the cause of this?
How are you performing model parallelism? Are you using PyTorch RPC? A simple reproducible script would help a lot in understanding the root cause of the issue.
Since your model works with 1 GPU but not 2, I am just guessing it may have something to do with cuda synchronization and this is giving garbage values, but please post an example of the code, your PyTorch version, and any frameworks you are using.
I think I can’t make a simple scripts to reproduce the problems but I will try, it will take a lot of time.
I’m using Horovod and NVIDIA apex. Here is my conda environment.
# packages in environment at /home/anhvd/miniconda3/envs/uniter:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
absl-py 0.12.0 pypi_0 pypi
anyio 2.2.0 pypi_0 pypi
apex 0.1 pypi_0 pypi
argon2-cffi 20.1.0 pypi_0 pypi
async-generator 1.10 pypi_0 pypi
attrs 20.3.0 pypi_0 pypi
autopep8 1.5.6 pypi_0 pypi
babel 2.9.1 pypi_0 pypi
backcall 0.2.0 pypi_0 pypi
blas 1.0 mkl
bleach 3.3.0 pypi_0 pypi
boto3 1.17.59 pypi_0 pypi
botocore 1.20.59 pypi_0 pypi
ca-certificates 2021.4.13 h06a4308_1
cachetools 4.2.2 pypi_0 pypi
certifi 2020.12.5 py36h06a4308_0
cffi 1.14.5 pypi_0 pypi
chardet 4.0.0 pypi_0 pypi
click 7.1.2 pypi_0 pypi
cloudpickle 1.6.0 pypi_0 pypi
cmake 3.18.4.post1 pypi_0 pypi
colorama 0.4.4 pypi_0 pypi
contextvars 2.4 pypi_0 pypi
cytoolz 0.11.0 pypi_0 pypi
dataclasses 0.8 pypi_0 pypi
decorator 5.0.7 pypi_0 pypi
defusedxml 0.7.1 pypi_0 pypi
deprecation 2.1.0 pypi_0 pypi
entrypoints 0.3 pypi_0 pypi
google-auth 1.30.0 pypi_0 pypi
google-auth-oauthlib 0.4.4 pypi_0 pypi
grpcio 1.37.0 pypi_0 pypi
horovod 0.21.3 pypi_0 pypi
idna 2.10 pypi_0 pypi
immutables 0.15 pypi_0 pypi
importlib-metadata 4.0.1 pypi_0 pypi
intel-openmp 2021.2.0 h06a4308_610
ipdb 0.12 pypi_0 pypi
ipykernel 5.5.3 pypi_0 pypi
ipython 7.16.1 pypi_0 pypi
ipython-genutils 0.2.0 pypi_0 pypi
jedi 0.18.0 pypi_0 pypi
jinja2 2.11.3 pypi_0 pypi
jmespath 0.10.0 pypi_0 pypi
joblib 1.0.1 pyhd3eb1b0_0
json5 0.9.5 pypi_0 pypi
jsonschema 3.2.0 pypi_0 pypi
jupyter-client 6.1.12 pypi_0 pypi
jupyter-core 4.7.1 pypi_0 pypi
jupyter-packaging 0.9.2 pypi_0 pypi
jupyter-server 1.6.4 pypi_0 pypi
jupyterlab 3.0.14 pypi_0 pypi
jupyterlab-pygments 0.1.2 pypi_0 pypi
jupyterlab-server 2.5.0 pypi_0 pypi
ld_impl_linux-64 2.33.1 h53a641e_7
libffi 3.3 he6710b0_2
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
libstdcxx-ng 9.1.0 hdf63c60_0
lmdb 0.97 pypi_0 pypi
lz4 2.1.9 pypi_0 pypi
markdown 3.3.4 pypi_0 pypi
markupsafe 1.1.1 pypi_0 pypi
mistune 0.8.4 pypi_0 pypi
mkl 2020.2 256
mkl-service 2.3.0 py36he8ac12f_0
mkl_fft 1.3.0 py36h54f3939_0
mkl_random 1.1.1 py36h0573a6f_0
msgpack 1.0.2 pypi_0 pypi
msgpack-numpy 0.4.7.1 pypi_0 pypi
nbclassic 0.2.7 pypi_0 pypi
nbclient 0.5.3 pypi_0 pypi
nbconvert 6.0.7 pypi_0 pypi
nbformat 5.1.3 pypi_0 pypi
ncurses 6.2 he6710b0_1
nest-asyncio 1.5.1 pypi_0 pypi
notebook 6.3.0 pypi_0 pypi
numpy 1.19.2 py36h54aff64_0
numpy-base 1.19.2 py36hfa32c7d_0
oauthlib 3.1.0 pypi_0 pypi
openssl 1.1.1k h27cfd23_0
packaging 20.9 pypi_0 pypi
pandas 1.1.5 pypi_0 pypi
pandocfilters 1.4.3 pypi_0 pypi
parso 0.8.2 pypi_0 pypi
pexpect 4.8.0 pypi_0 pypi
pickleshare 0.7.5 pypi_0 pypi
pip 21.0.1 py36h06a4308_0
pretty-errors 1.2.20 pypi_0 pypi
prometheus-client 0.10.1 pypi_0 pypi
prompt-toolkit 3.0.18 pypi_0 pypi
protobuf 3.15.8 pypi_0 pypi
psutil 5.8.0 pypi_0 pypi
ptyprocess 0.7.0 pypi_0 pypi
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pycodestyle 2.7.0 pypi_0 pypi
pycparser 2.20 pypi_0 pypi
pygments 2.8.1 pypi_0 pypi
pyparsing 2.4.7 pypi_0 pypi
pyrsistent 0.17.3 pypi_0 pypi
python 3.6.13 hdb3f193_0
python-dateutil 2.8.1 pypi_0 pypi
pytorch-pretrained-bert 0.6.2 pypi_0 pypi
pytz 2021.1 pypi_0 pypi
pyyaml 5.4.1 pypi_0 pypi
pyzmq 22.0.3 pypi_0 pypi
readline 8.1 h27cfd23_0
regex 2021.4.4 pypi_0 pypi
requests 2.25.1 pypi_0 pypi
requests-oauthlib 1.3.0 pypi_0 pypi
rsa 4.7.2 pypi_0 pypi
s3transfer 0.4.2 pypi_0 pypi
scikit-learn 0.24.1 py36ha9443f7_0
scipy 1.5.2 py36h0b6359f_0
send2trash 1.5.0 pypi_0 pypi
setuptools 52.0.0 py36h06a4308_0
six 1.15.0 py36h06a4308_0
sklearn 0.0 pypi_0 pypi
sniffio 1.2.0 pypi_0 pypi
sqlite 3.35.4 hdfb4753_0
tensorboard 2.5.0 pypi_0 pypi
tensorboard-data-server 0.6.0 pypi_0 pypi
tensorboard-plugin-wit 1.8.0 pypi_0 pypi
tensorboardx 1.7 pypi_0 pypi
terminado 0.9.4 pypi_0 pypi
testpath 0.4.4 pypi_0 pypi
threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 hbc83047_0
toml 0.10.2 pypi_0 pypi
tomlkit 0.7.0 pypi_0 pypi
toolz 0.11.1 pypi_0 pypi
torch 1.8.1 pypi_0 pypi
tornado 6.1 pypi_0 pypi
tqdm 4.60.0 pypi_0 pypi
traitlets 4.3.3 pypi_0 pypi
typing-extensions 3.7.4.3 pypi_0 pypi
urllib3 1.26.4 pypi_0 pypi
wcwidth 0.2.5 pypi_0 pypi
webencodings 0.5.1 pypi_0 pypi
werkzeug 1.0.1 pypi_0 pypi
wheel 0.36.2 pyhd3eb1b0_0
xz 5.2.5 h7b6447c_0
zipp 3.4.1 pypi_0 pypi
zlib 1.2.11 h7b6447c_3